good evening dear PHPFreaks - hello to everybody.
i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs.
Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search
Note: i want to itterate over the resultpages - with a loop.
http://www.educa.ch/dyn/79376.asp?id=1568
http://www.educa.ch/dyn/79376.asp?id=2149
i take this loop:
for($i=1;$i<=$match[1];$i++)
{
$url = "http://www.example.com/page?page={$i}";
// access new sub-page, extract necessary data
}
what do you think? What about the Loop over the target-Urls?
BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.
well this is what i want to. And now i need to have a good parser-script.
Note: this is a tree-part-job:
- fetching the sub-pages
- parsing them
- storing the data in a mysql-db
Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to
leave them aside - unless i do not want to populate my mysql-db with too much infos..
Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.
The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.
No Problem here: But how to do the DOM-Document-Job ...
i have installed FireBug into the FireFox...
now i have the Xpaths for the sites:
http://www.educa.ch/dyn/79376.asp?id=1187
http://www.educa.ch/dyn/79376.asp?id=2939
http://www.educa.ch/dyn/79376.asp?id=1515
http://www.educa.ch/dyn/79376.asp?id=1469
Altes Schulhaus Ossingen :: /html/body/div[2]
Guntibachstrasse 10 :: /html/body/div[4]
8475 Ossingen :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 :: /html/body/div[11]
Fax:052 317 04 42 :: /html/body/div[12]
but how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/
look forward to a hint that gives me a starting point