fetching data with PHP Simple HTML DOM Parser

dilbert2010 · Jan 29, 2018

good day dear experts,

hello i need to fetch the data out of this page

http://europa.eu/youth/volunteering/evs-organisation_en

first i do a view on the page source to find HTML elements: view-source:https://europa.eu/youth/volunteering/evs-organisation_en

note: i need to fetch the data that come right below this line:

<h3>EVS accredited organisations search results: <span class="ey_badge">6066</span></h3>  </div>

i have several optoins: to do this with PHP Simple HTML DOM Parser (cf.http://simplehtmldom.sourceforge.net/manual.htm ): This way i need to create HTML DOM object

BTW: there are other options: to do this with a special function: pc_link_extractor which is etracting all the links

function pc_link_extractor($s) {
$a = array();
if (preg_match_all(‘/>]*)[\”\’]?[^>]*>(.*?)\/a>/i’,$s,$matches,PREG_SET_ORDER)) {

foreach($matches as $match) {
array_push($a,array($match[1],$match[2]));
}
}
return $a;
}

or i am able to do it with -

preg_match_ all

see for example:

- preg_match
#1 preg_match_all      ("|<[^>]+>(.*)</[^>]+>|U",
 "<b>example: </b><div align="left">this is a test</div>",
 $out,
 PREG_PATTERN_ORDER)

see here the dataset which i am interested in derived from h site: http://europa.eu/youth/volunteering/evs-organisation_en

  <div class="view-content">

<div id="views-bootstrap-grid-1" class="views-bootstrap-grid-plugin-style">
            <div class="row is-flex">
                  <div class="col-md-4">
            <div class="vp ey_block block-is-flex">
  <div class="ey_inner_block">
    <h4 class="text-center"><a href="/youth/volunteering/organisation/948417016_en" target="_blank">&quot;Academy for Peace and Development&quot; Union</a></h4>
          <div class="org_cord"><strong>Topics: </straaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaong>Access for disadvantaged; Youth (Participation, Youth Work, Youth Policy); Intercultural/intergenerational education and (lifelong)learning</div>
            <p class="ey_info">
    <i class="fa fa-location-arrow fa-lg"></i>
    Tbilisi, <strong>Georgia</strong>
</p>    <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Receiving, Sending</p>
          <p class="ey_info"><i class="fa fa-external-link fa-lg"></i><span> <a href="http://www.apd.ge" target="_blank">www.apd.ge</a></span></p>
                  <p><strong>PIC no:</strong> 948417016</p>
        <div class="empty-block">
      <a href="/youth/volunteering/organisation/948417016_en" target="_blank" class="ey_btn btn btn-default pull-right">Read more</a>    </div>
  </div>
</div>
          </div>
                  <div class="col-md-4">

note there are hundreds of pages - [ see below the pagination things ]

well you see that we have some options here.

which way should i go?! Which way would you go?

love to hear from you

Greetings

dalecosp · Jan 29, 2018

Looking at the target site, I wouldn't recommend using regular expressions OR SimpleDOM. I would use the real PHP DOMDocument extension.

I have quite a bit of experience with all of it (see ombe.com/ezlink), and DOM typically yields the most consistent results with the least head-scratching, although SimpleDOM is great if the tree is, as it says, "simple" ... but I don't feel that this one is.

I'll tell you kind of how I'd do it, which will, I guess, give you some insight into how our system works at OMBE. Grab that first page (file_get_contents() should work), find that "LI" with class "pager-last", parse it (probably $pl_node->firstChild) and get that page number.

Then you're going to download a bunch of pages (as I currently see 288 pages of data).

Looping through those pages, you'll load it with file_get_contents(), looking for DIV elements with class "ey_inner_block". You may have to do saveHTML() on these elements and treat them as mini-documents within a loop. You'll be looking for <P>'s, <I>'s, it's rather a mishmash.

These kinds of projects seem to take a LOT of time to do for what they're worth. Good luck.

dilbert2010 · Jan 31, 2018

good evening dear dalescop,

first of all - many many thanks for the quick reply. I am very glad about this answer - and that you stop me thinking that simpleDOM would be the appropiate tool.

I am not so experienced and therefore in need your advice. Many thanks for your idea about DOMdocument. I will go that way.

great ideas - also the following:

Looping through those pages, you'll load it with file_get_contents(), looking for DIV elements with class "ey_inner_block". You may have to do saveHTML() on these elements and treat them as mini-documents within a loop. You'll be looking for <P>'s, <I>'s, it's rather a mishmash.

Many thanks - i follow as adviced,

greetings

dilbert2010 · Feb 10, 2018

good day dear dalesciü

many thanls for the great ideas you have: - i have some musings - can we port the following xpath to php


{ 
    internal_url => [ q#//a/@href#, [ $handler_relurl ] ], 
    external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ], 
    title        => [ q#//h4# ], 
    topics       => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ], 
    location     => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ], 
    hand         => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ], 
    pic_number   => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ], 
} 
}; 

print Dumper browse( $conf ); 

sub browse 
{ 
    my $conf = shift; 

my $ref = [ ]; 

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 ); 
my $response = $lwp_useragent->get( $conf->{url} ); 
die $response->status_line unless $response->is_success; 
my $content = $response->decoded_content;  
my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content ); 
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} ); 
for my $node ( @nodes ) 
{

can we port the xpath to php too?

Weedpacket · Feb 11, 2018

RTFM: [man]domxpath[/man]

fetching data with PHP Simple HTML DOM Parser

Ddilbert2010

dalecosp

Ddilbert2010

Ddilbert2010

Weedpacket