Dom Processing: code review of a little [10 liner-] parser-script

bernard_hinault

hello dear community!

good day!

I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval!

see an example:

I need to get all the data out of this site. www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder . I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

see an example:

Bürgerstiftung Lebensraum ******
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Hubert Schramm
    Alexanderstr. 69/ 71
    52062 ******
    Telefon: 0241 - 4500130
    Telefax: 0241 - 4500131
    Email: [email]info@buergerstiftung-******.de[/email]
    [url]www.buergerstiftung-******.de[/url]
    >> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Helga Kühn
    Rotkehlchenstr. 72
    28832 Achim
    Telefon: 04202-84981
    Telefax: 04202-955210
    Email: [email]info@buergerstiftung-achim.de[/email]
    [url]www.buergerstiftung-achim.de[/url]
    >> Weitere Details zu dieser Stiftung

I need to have the data that are "behind" the link - is there any way to do this
with a easy and understandable parser - one that can be understood and written by a newbie!?
well i could do this with XPahts - in PHP or Perl - (with mechanize)

i started with an php-approach: But -if i run the code (see below) i get this results

PHP Fatal error: Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5
martin@suse-linux:~/perl/foundations> cd foundations

caused by this code here

    <?php

// Create DOM from URL or file
$html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

// split it via body, so you only get to the contents inside body tag
$split = split('<body>', $html);
// it is usually in the top of the array but just check to be sure
$body = $split[1];
// split again with, say,<p class="divider">A</p>
$split = split('<p class="divider">A</p>', $body);
// now this should contain just the data table you want to process
$data = $split[1];

// Find all links from original html
foreach($html->find('a') as $element) {
       $link = $element->href;

       // check if this link is in our data table
       if(substr_count($data, $link) > 0) {
           // link is in our data table, follow the link
           $html = file_get_html($link);
          // do what you have to do
       }
}


?>

well some musings about my approach:

the standard practice for scrapping the pages would be:

read the page into a string (file_get_html or whatever is being used now)
split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by

, since it has the link we described above:

If we wish to follow the link, just repeat the same process, but using the link.
Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way.

Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward

Update: Hmm - i could try this:

addmiting that it doesn't get any simpler than using simple_html_dom.

    $records = array();
    foreach($html->find('#content dl') as $contact) {
       $record = array();
       $record["name"] = $contact->find("dt", 0)->plaintext;
       foreach($contact->find("dd") as $field) {
           /* parse each $field->plaintext in order to obtain $fieldname */
           $record[$fieldname] = $field->plaintext;
       }
       $records[] = $record;
    }

Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm...

any ideas

look forward