Hi there - good day community,
i have parsing task to do. i have many many (several thousand) html-files in a folder. i want to parse them in order to get some results.
That is what i want to get - i want to gather a set of information:
country: countryname
name: myname
School-type: Type one
Adress: 20000 New York, Broadway 16
Telefon: 053333052-9899-0, Fax: 053333052-9899-55
index-number: 26666932002
Webmaster: Linus Thorwald
site registerd at: 08.03.2010
Website:
Well and i can rebuild a url with the index-number:
see the html here: (see more below )
<div style="display: inline;"><div class="logo_homepage"><a class="img_inl" href="http://www.the_search_site.org/26666932002"></a></div>
I have to extract the index-number and add it to the shorturl = [url]http://www.the_search_site.org/[/url] (here: 26666932002 )
How to do - how to proceed - to gather the above mentioned results?
below the (shortened html of one result):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<!-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=&kategorie=®ion=de&trefferzahlauswahl=alle&trefferzahl=10517&list_anfang=0&sort= >
<title>result-title: MyName, New York </title>
<img src=""Contryname" title="Contryname" />
<div style="width: 40em;">
<div style="display: inline;"><div class="logo_homepage"><a class="img_inl" href="http://www.the_search_site.org/26666932002"></a></div>
<div class="fm_linkeSpalte"><h2>My name</h2>
<span class="schulart_text">School-type: Type one</span>
<p class="einzel_text">Adress: 20000 New York, Broadway 16
<br />
Telefon: 053333052-9899-0, Fax: 053333052-9899-55
<br />
index-number: 26666932002 <br />
Webmaster: <a href="mailto: webmaster@the-site.com" class="p1">Linus Thorwald</a><br /></p> </div>
<div>
<p class="ta_left einzel_text">
</p></div>
<br /><div><p class="ta_left einzel_text">registered at: 08.03.2010</p></div>
</div>
</div>
</div>
</div>
<d-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=&kategorie=®ion=de&trefferzahlauswahl=alle&trefferzahl=10517&list_anfang=0&sort=-->
</html>
in short:
1) i have more than 10 000 files in a folder - all look the same. They contain informations. i want to gather this set of information.
2) If i can parse one file - then i am able to do it with all the ohters
3) How to parse to get the information (the above mentioned aresses with 5 lines of text [see also below])
4) after having the adresses - i have to get the URL - it is written down in a combination of an id-number.
5) the adress-data-set contains this id-number. I only have to add this to the URL and then i get the
Love to hear from you