Hello dear PHP-Friends, i have a terrible job in front of me. Sure thing.

but i am pretty sure - that this can be done with PHP! SURE thing!

I am having a parser-job i need to use good tools!

I am bad bad in PHP - and only know a little Perl!

i like the idea of using HTML::TokeParser::Simple and DBI. in order to do a parser - job. with additional storage of the results!

I have very very little experience with HTML::TokeParser::Simple but this task goes over my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme

taken from the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

Note: In the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

This is what i have:


use strict;

use HTML::TreeBuilder:Path;

my $tree = HTML::TreeBuilder:Path->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});


print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;

Question: is this all right? BTW: See one of the example sites:

See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

In the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

That would be great!!

love to hear from you - :)

    ohhh - sorry - i have to correct the paths. They are wrong!

    sorry! i have to proof the puddding before posting here such a posting. I will try to correct the paths..

    bernhard

      Hello dear folks - good evening!

      this is solved! i have a solution with HTML::TableExtract

      I also read the documentation for HTML::TableExtract which might help here. The HTML::TableExtract does a good job: Extracts specific tables from HTML source code. And it does that really well.

      BTW i want/need to do this with a table/site: See this page: SCHULE SUCHEN EINGANG

      Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

      9 (or ten lines)
      Schuldaten.
      Schulnummer:
      Amtliche Bezeichnung:
      Strasse:
      Plz und Ort:
      Telefon:
      Fax:
      E-Mail-Adresse:
      Schuldaten ändern] 🙁this is UTF8 encoded or what)
      Schülergesamtzahl (this is UTF8 encoded or what)

      Question:
      can the HTML::TableExtract be applied here? At the resultpage of more than 6400 shools: (See above)

      Love to hear from you

      See what i have untill now:

      I make Use some HTML::TableExtract

      #!/usr/bin/perl
      
      use strict; use warnings;
      use HTML::TableExtract;
      use YAML;
      
      my $te = HTML::TableExtract->new( attribs => {
      
       ,
       => '',
      ,
      ,
      });
      
      $te->parse_file('myFile.html');
      my ($table) = $te->tables;
      
      for my $row ( $table->rows ) {
          cleanup(@$row);
          print "@$row\n";
      }
      
      sub cleanup {
          for ( @_ ) {
              s/\s+//;
              s/[\xa0 ]+\z//;
              s/\s+/ /g;
          }
      }
      
      

      i need tho have some help with the attributes!

      Any and all help will greatly be appreciated.

      regards!
      Bernhard_hinault

        Write a Reply...