html parser in php

jlive

Time ago I developed a small parser , able to scan a text with some custom tags included and return an array with the tags and the tokens. It was not to be used with html text

Now I'm looking for having an html parser.
I could change that my code, but first is there a free html parser in php ?

leetcrew

this an Object Oriented HTML parser.

[edit]
And one that is posted illegally. Really leetcrew; are you trying to get us shut down? Because if you are, you're going about it in a damn funny way - Jupitermedia's already covered, and the only entity at risk of prosecution by your posting material in violation of both the law (and I'm pretty certain you're in a territory that is a signatory to the Berne convention) and the Acceptable Use Policy you agreed to when you registered is you.

Signed Weedpacket.
[/edit]

jlive

thanku leetcrew

|| # vShare 1.0 CVS (inc/template.php) # ||
|| # ---------------------------------------------------------------- # ||
|| # All PHP code in this file are copyrighted by rjregalado.net # ||
|| # This file may not be redistributed in whole or significant part. # ||
|| # ------------- vShare 1.0 CVS IS NOT FREE SOFTWARE -------------- # ||
|| # <a href="http://vshare.uni.cc" target="blank">http://vshare.uni.cc</a> | <a href="http://www.rjregalado.net " target="blank">http://www.rjregalado.net </a> # ||

what about that copyright ?

MarkR

A parser is something which parses. The above does not.

You should ask yourself "Do I need a HTML parser?" and if the answer is "No", don't use one.

Parsing HTML is just about the most difficult thing possible. I have seen a HTML parser in Perl, and it was truly disgusting.

One problem is, that in HTML elements don't have to be closed. So the parser has to "figure out" where they are supposed to end by itself.

And most HTML documents are not well-formed anyway, so the parser has to deal with:
- Broken entity references like &rubbish
- Tags which are malformed in some way
- Attributes not enclosed in quotes, containing funny characters
- General mess
- A lot of HTML documents are not in the encoding they say they're in (or contain contradictory encoding statements / headers)

Then parsing documents which contain script elements is different again, because script elements are allowed to contain stuff which isn't valid markup, and needs to be taken literally (i.e. the parser doesn't attempt to fix it and make it into a DOM)

HTML is the worst.

Mark

Weedpacket

Originally posted by MarkR
HTML is the worst.

Roll on XHTML 🙂

leetcrew

Originally posted by leetcrew
this an Object Oriented HTML parser.

[edit]
And one that is posted illegally. Really leetcrew; are you trying to get us shut down? Because if you are, you're going about it in a damn funny way - Jupitermedia's already covered, and the only entity at risk of prosecution by your posting material in violation of both the law (and I'm pretty certain you're in a territory that is a signatory to the Berne convention) and the Acceptable Use Policy you agreed to when you registered is you.

Signed Weedpacket.
[/edit]

oops sorry... I will make it a freeware

jlive

MarkR wrote:

Parsing HTML is just about the most difficult thing possible. I have seen a HTML parser in Perl, and it was truly disgusting.

yes MarkR, totally agree.

Before all, I easily adapted my old small parser to process html text too, so I've experience on that. It's not so different from processing a natural language text

Weedpacket wrote:

Roll on XHTML

Infact I'm using a RSS generator and parser when it's possible. But it's good for special contexts only

Some (x)html parser should be already made: this issue seems exactly that should be solved (the best way possible) once for all and shared for free

Weedpacket

Originally posted by jlive
Some (x)html parser should be already made: this issue seems exactly that should be solved (the best way possible) once for all and shared for free

XHTML (unlike HTML) is an application of XML, so for XHTML there is already a parser built into PHP.