running Xpather against HTML : finding & indentify the nodes

bernard_hinault · Oct 16, 2010

Hello dear PHP-Friends, i have a terrible job in front of me. Sure thing.

but i am pretty sure - that this can be done with PHP! SURE thing!

I am having a parser-job i need to use good tools!

I am bad bad in PHP - and only know a little Perl!

i like the idea of using HTML::TokeParser::Simple and DBI. in order to do a parser - job. with additional storage of the results!

I have very very little experience with HTML::TokeParser::Simple but this task goes over my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme

taken from the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

Note: In the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

This is what i have:


use strict;

use HTML::TreeBuilder:Path;

my $tree = HTML::TreeBuilder:Path->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});


print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;

Question: is this all right? BTW: See one of the example sites:

See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

In the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

That would be great!!

love to hear from you - :)

bernard_hinault · Oct 16, 2010

ohhh - sorry - i have to correct the paths. They are wrong!

sorry! i have to proof the puddding before posting here such a posting. I will try to correct the paths..

bernhard

bernard_hinault · Oct 16, 2010

Hello dear folks - good evening!

this is solved! i have a solution with HTML::TableExtract

I also read the documentation for HTML::TableExtract which might help here. The HTML::TableExtract does a good job: Extracts specific tables from HTML source code. And it does that really well.

BTW i want/need to do this with a table/site: See this page: SCHULE SUCHEN EINGANG

Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

9 (or ten lines)
Schuldaten.
Schulnummer:
Amtliche Bezeichnung:
Strasse:
Plz und Ort:
Telefon:
Fax:
E-Mail-Adresse:
Schuldaten ändern] this is UTF8 encoded or what)
Schülergesamtzahl (this is UTF8 encoded or what)

Question: can the HTML::TableExtract be applied here? At the resultpage of more than 6400 shools: (See above)

Love to hear from you

See what i have untill now:

I make Use some HTML::TableExtract

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new( attribs => {

 ,
 => '',
,
,
});

$te->parse_file('myFile.html');
my ($table) = $te->tables;

for my $row ( $table->rows ) {
    cleanup(@$row);
    print "@$row\n";
}

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

i need tho have some help with the attributes!

Any and all help will greatly be appreciated.

regards!
Bernhard_hinault

running Xpather against HTML : finding & indentify the nodes

Bbernard_hinault

Bbernard_hinault

Bbernard_hinault