Parsing

bernard_hinault

Hi there - good morning

i need to parse the result pages of this site:

http://www.bildung-lsa.de/bildungsland/schulen_und_hochschulen.html#art5511

i need to parse all html-pages in order to get the results - eg like this one here:

Adresse:
Friedrich-Schiller-Gymnasium Calbe
39240 Calbe
Große Angergasse 10
Homepage: http://www.gym-calbe.info
Telefon: 039291/2560
Telefax: 039291/78874
E-Mail: kontakt@gym-schiller-calbe.bildung-lsa.de

Well - this can be done with SAX or something like that.
Well - we can do this with PERL too:

We could do it with HTML::TreeBuilder::XPath:

Well therefore - working with TreeBuilder i have to identify the xpath-expressions....for the resulting pages!

i try to come up with some examples... Guess that t his would be a good way. Getting the paths for one page would be good to prepare the job to parse all pages!

Look forward to any idea and help!

bernhard

sneakyimp

I can imagine a couple of ways to do this.

1) Parse the HTML contents of the page as an XML Document using DOMDocument::load and then you could perhaps use DOMDocument::getElementById or some similar function to locate the region of the document you are after for additional parsing of its children

2) Use [man]preg_match_all[/man] and try to concoct a pattern or two to do some pattern matching.