A parser is something which parses. The above does not.
You should ask yourself "Do I need a HTML parser?" and if the answer is "No", don't use one.
Parsing HTML is just about the most difficult thing possible. I have seen a HTML parser in Perl, and it was truly disgusting.
One problem is, that in HTML elements don't have to be closed. So the parser has to "figure out" where they are supposed to end by itself.
And most HTML documents are not well-formed anyway, so the parser has to deal with:
- Broken entity references like &rubbish
- Tags which are malformed in some way
- Attributes not enclosed in quotes, containing funny characters
- General mess
- A lot of HTML documents are not in the encoding they say they're in (or contain contradictory encoding statements / headers)
Then parsing documents which contain script elements is different again, because script elements are allowed to contain stuff which isn't valid markup, and needs to be taken literally (i.e. the parser doesn't attempt to fix it and make it into a DOM)
HTML is the worst.
Mark