good day dear experts,

hello i need to fetch the data out of this page

http://europa.eu/youth/volunteering/evs-organisation_en

first i do a view on the page source to find HTML elements: view-source:https://europa.eu/youth/volunteering/evs-organisation_en

note: i need to fetch the data that come right below this line:

<h3>EVS accredited organisations search results: <span class="ey_badge">6066</span></h3>  </div>

i have several optoins: to do this with PHP Simple HTML DOM Parser (cf.http://simplehtmldom.sourceforge.net/manual.htm ): This way i need to create HTML DOM object

BTW: there are other options: to do this with a special function: pc_link_extractor which is etracting all the links

function pc_link_extractor($s) {
$a = array();
if (preg_match_all(‘/>]*)[\”\’]?[^>]*>(.*?)\/a>/i’,$s,$matches,PREG_SET_ORDER)) {

foreach($matches as $match) {
array_push($a,array($match[1],$match[2]));
}
}
return $a;
}


or i am able to do it with -

preg_match_ all  

see for example:

- preg_match
#1 preg_match_all      ("|<[^>]+>(.*)</[^>]+>|U",
 "<b>example: </b><div align="left">this is a test</div>",
 $out,
 PREG_PATTERN_ORDER)

see here the dataset which i am interested in derived from h site: http://europa.eu/youth/volunteering/evs-organisation_en

  <div class="view-content">

<div id="views-bootstrap-grid-1" class="views-bootstrap-grid-plugin-style">
            <div class="row is-flex">
                  <div class="col-md-4">
            <div class="vp ey_block block-is-flex">
  <div class="ey_inner_block">
    <h4 class="text-center"><a href="/youth/volunteering/organisation/948417016_en" target="_blank">&quot;Academy for Peace and Development&quot; Union</a></h4>
          <div class="org_cord"><strong>Topics: </straaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaong>Access for disadvantaged; Youth (Participation, Youth Work, Youth Policy); Intercultural/intergenerational education and (lifelong)learning</div>
            <p class="ey_info">
    <i class="fa fa-location-arrow fa-lg"></i>
    Tbilisi, <strong>Georgia</strong>
</p>    <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Receiving, Sending</p>
          <p class="ey_info"><i class="fa fa-external-link fa-lg"></i><span> <a href="http://www.apd.ge" target="_blank">www.apd.ge</a></span></p>
                  <p><strong>PIC no:</strong> 948417016</p>
        <div class="empty-block">
      <a href="/youth/volunteering/organisation/948417016_en" target="_blank" class="ey_btn btn btn-default pull-right">Read more</a>    </div>
  </div>
</div>
          </div>
                  <div class="col-md-4">

note there are hundreds of pages - [ see below the pagination things ]

well you see that we have some options here.

which way should i go?! Which way would you go?

love to hear from you

Greetings

    Looking at the target site, I wouldn't recommend using regular expressions OR SimpleDOM. I would use the real PHP DOMDocument extension.

    I have quite a bit of experience with all of it (see ombe.com/ezlink), and DOM typically yields the most consistent results with the least head-scratching, although SimpleDOM is great if the tree is, as it says, "simple" ... but I don't feel that this one is.

    I'll tell you kind of how I'd do it, which will, I guess, give you some insight into how our system works at OMBE. Grab that first page (file_get_contents() should work), find that "LI" with class "pager-last", parse it (probably $pl_node->firstChild) and get that page number.

    Then you're going to download a bunch of pages (as I currently see 288 pages of data).

    Looping through those pages, you'll load it with file_get_contents(), looking for DIV elements with class "ey_inner_block". You may have to do saveHTML() on these elements and treat them as mini-documents within a loop. You'll be looking for <P>'s, <I>'s, it's rather a mishmash.

    These kinds of projects seem to take a LOT of time to do for what they're worth. Good luck.

      good evening dear dalescop,

      first of all - many many thanks for the quick reply. I am very glad about this answer - and that you stop me thinking that simpleDOM would be the appropiate tool.

      I am not so experienced and therefore in need your advice. Many thanks for your idea about DOMdocument. I will go that way.

      great ideas - also the following:

      Looping through those pages, you'll load it with file_get_contents(), looking for DIV elements with class "ey_inner_block". You may have to do saveHTML() on these elements and treat them as mini-documents within a loop. You'll be looking for <P>'s, <I>'s, it's rather a mishmash.

      Many thanks - i follow as adviced,

      greetings

        10 days later

        good day dear dalesciü

        many thanls for the great ideas you have: - i have some musings - can we port the following xpath to php

        
        { 
            internal_url => [ q#//a/@href#, [ $handler_relurl ] ], 
            external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ], 
            title        => [ q#//h4# ], 
            topics       => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ], 
            location     => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ], 
            hand         => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ], 
            pic_number   => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ], 
        } 
        }; 
        
        print Dumper browse( $conf ); 
        
        sub browse 
        { 
            my $conf = shift; 
        
        my $ref = [ ]; 
        
        my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 ); 
        my $response = $lwp_useragent->get( $conf->{url} ); 
        die $response->status_line unless $response->is_success; 
        my $content = $response->decoded_content;  
        my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content ); 
        my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} ); 
        for my $node ( @nodes ) 
        { 
        

        can we port the xpath to php too?

          Write a Reply...