php file_gets_contents: how to apply it with php function

bernard_hinault

i currently write a little parser & harvester that collects the data of this website: (see below)

http://www.aktive-buergerschaft.de/buergerstiftungsfinder

i want to have all foundations that are listed on this page (see examples below).- Well i think, that i
need to choose between file_get_contents and curl - to fetch the datas.
And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!?

first .- i present my FETCHING-Part: with curl:

well I've never needed to use curl myself, but, obvious resource php.net's example is;

<?php
// create a new cURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);

// grab URL and pass it to the browser
$data = curl_exec($ch);

// close cURL resource, and free up system resources
curl_close($ch);


//Then you can use $data for parsing
?>

well to be frank:

If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier!

<?php
$html = file_get_contents('http://www.example.com');

//now all the html is the $html
?>

anyway - i think the much more interesting part is the parsing

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder

Bürgerstiftung Lebensraum ***
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: [email]info@buergerstiftung-.de[/email]
www.buergerstiftung-***.de
>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung

BürgerStiftung Region Ahrensburg
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Dr. Michael Eckstein
An der Reitbahn 3
22926 Ahrensburg
Telefon: 04102 - 67 84 89
Telefax: 04102 - 82 34 56
Email: info@buergerstiftung-ahrensburg.de
www.buergerstiftung-region-ahrensburg.de
>> Weitere Details zu dieser Stiftung

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder
Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link!

sneakyimp

file_get_contents only works for remote URLS if you have your configuration set up with allow_url_fopen = On in your PHP.ini.

cURL may or may not be included in your installation of PHP.

In either case, getting the file is the important part and the two are roughly equivalent so don't sweat too much over the best way to fetch the file unless you plan to try and distribute your code for use on a wide variety of machines.

Once you have the file's data in a string variable, you could use preg_match with an elaborate regular expression or you could parse the document using DOM.

I believe the text you'll get via cURL is not the simple text you have in your post, but is instead HTML formatted (i.e., the HTML you see when you load that page and click 'view source'). That being the case, I think using DOM would be much easier, HOWEVER the document apparently doesn't parse properly...i tried the following and keep getting lots and lots of errors when trying parse the remote document.

// fetch the remote document's HTML contents
$html = file_get_contents('http://www.aktive-buergerschaft.de/buergerstiftungsfinder');

// load the HTML and parse it into a DOM object
$doc = new DOMDocument();
$doc->loadXML($html, LIBXML_NOERROR);

// fetch the DOM element that contains all the data:
$div = $doc->getElementById('divider');

if (is_null($div)) {
  throw new Exception("unable to retrieve content div");
}

// each DL tag contains one data record
$dl_elements = $div->getElementsByTagName('dl');
if (is_null($dl_elements)) {
  throw new Exception("unable to retrieve content div");
}
$records = array();
for($c = 0; $c<$dl_elements->length; $c++){
  $dl = $dl_elements->item($c);
  $dt_elements = $dl->getElementsByTagName('dt');
  $title_element = $dt_elements->item[0];
  $records[$c]['title'] = $doc->saveXML($title_element);
}

print_r($records);