Making this local link crawler more efficient
Results 1 to 14 of 14

Thread: Making this local link crawler more efficient

  1. #1
    Member
    Join Date
    Oct 2016
    Posts
    61

    Making this local link crawler more efficient

    Hi there everyone!

    I'm building a sitemap generator for a site with a BUNCH of links. The snippet was working very efficiently until I added the ability to check to see if the file was an image. Now it's super slow.

    I'm only using this script locally on the site it's crawling so I'm wondering if there's a better way to hand this to speed it back up and use less resources. I could check for an image extension at the end but results found during my googling result in those in the know bad mouth the solution as a poor one.

    Here's my current attempt:

    PHP Code:
    <?php

    $site_domain 
    'https://wheeltastic.com';

    function 
    isImage($url){
        
    $params = array('http' => array(
            
    'method' => 'HEAD'
        
    ));
        
    $ctx stream_context_create($params);
        
    $fp = @fopen($url'rb'false$ctx);
        if (!
    $fp){
            return 
    false;  // Problem with url
        
    }
        
    $meta stream_get_meta_data($fp);
        if (
    $meta === false){
            
    fclose($fp);
            return 
    false;  // Problem reading data from url
        
    }

        
    $wrapper_data $meta["wrapper_data"];
        if(
    is_array($wrapper_data)){
            foreach(
    array_keys($wrapper_data) as $hh){
                if (
    substr($wrapper_data[$hh], 019) == "Content-Type: image"){  // strlen("Content-Type: image") == 19       
                    
    fclose($fp);
                    return 
    true;
                }
            }
        }

        
    fclose($fp);
        return 
    false;
    }

    $options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
    $context  stream_context_create($options);
    $html file_get_contents('https://wheeltastic.com'false$context);
    $html file_get_contents('https://wheeltastic.com');

    $dom = new DOMDocument();
    @
    $dom->loadHTML($html);

    // grab all the links on the page
    $xpath = new DOMXPath($dom);
    $hrefs $xpath->evaluate("/html/body//a");

    for (
    $i 0$i $hrefs->length$i++) {
        
    $href $hrefs->item($i);
        
    $url $href->getAttribute('href');
        
        if(!
    isImage($site_domain.$url) AND substr$url0) === "/" AND $url != '/'){
            echo 
    $url.'<br />';
        }
    }

    ?>
    Could anyone tell me if there's a more efficient/faster method of checking the local link to see if it's an image?

    Thanks for your time!

  2. #2
    Member
    Join Date
    Oct 2016
    Posts
    61
    I've managed to make it quite a bit faster by utilizing getimagesize but had to suppress errors due to them being thrown on non-images:

    PHP Code:
    <?php

    $site_domain 
    'https://wheeltastic.com';

    function 
    is_image($path){
        
    $a getimagesize($path);
        
    $image_type $a[2];
         
        if(
    in_array($image_type , array(IMAGETYPE_GIF IMAGETYPE_JPEG ,IMAGETYPE_PNG IMAGETYPE_BMP))){
            return 
    true;
        }
        return 
    false;
    }

    $options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
    $context  stream_context_create($options);
    $html file_get_contents('https://wheeltastic.com'false$context);

    $dom = new DOMDocument();
    @
    $dom->loadHTML($html);

    // grab all the links on the page
    $xpath = new DOMXPath($dom);
    $hrefs $xpath->evaluate("/html/body//a");

    for (
    $i 0$i $hrefs->length$i++) {
        
    $href $hrefs->item($i);
        
    $url $href->getAttribute('href');
        
        if(!@
    is_image('.'.$url) AND substr$url0) === "/" AND $url != '/'){
            echo 
    $url.'<br />';
        }

    ?>

  3. #3
    Member
    Join Date
    Oct 2016
    Posts
    61
    Alright, I've made a modified version and it seems to be suffering as the list of links grows.

    PHP Code:
    $site_domain 'https://wheeltastic.com';
    $tocrawl_array[0] = '/';
    $crawled_array = array();

    function 
    is_image($path){
        
    $a getimagesize($path);
        
    $image_type $a[2];
         
        if(
    in_array($image_type , array(IMAGETYPE_GIF IMAGETYPE_JPEG ,IMAGETYPE_PNG IMAGETYPE_BMP))){
            return 
    true;
        }
        return 
    false;
    }

    $cd 0;
    $tc 0;
    $ii 0;

    while(
    array_key_exists($ii$tocrawl_array)){
        
        
    $crawl $tocrawl_array[$ii];
        if(!
    in_array($crawl$crawled_array)){

            
    $options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
            
    $context  stream_context_create($options);
            
    $html file_get_contents($site_domain.$crawlfalse$context);
            
    $crawled_array[$cd] = $crawl;
            
    $cd = ++$cd;

            
    $dom = new DOMDocument();
            @
    $dom->loadHTML($html);

            
    // grab all the links on the page
            
    $xpath = new DOMXPath($dom);
            
    $hrefs $xpath->evaluate("/html/body//a");

            for (
    $i 0$i $hrefs->length$i++) {
                
    $href $hrefs->item($i);
                
    $url $href->getAttribute('href');
                
                if(!@
    is_image('.'.$url) AND substr$url0) === "/" AND !in_array($url$crawled_array) AND !in_array($url$tocrawl_array)){
                    
                    
    $tc = ++$tc;
                    
    $tocrawl_array[$tc] = $url;
                    
                    if(
    $tc %25 == 0) {
                        echo 
    $tc.' records processed<br>';
                    }
                    
                    if(
    $tc == 1000){
                        
    print_r($crawled_array);
                        exit();
                    }
                    
                }
            }
        }
        
    $ii = ++$ii;

    In another topic that I started here, it was mentioned that large arrays could cause problems with memory. Is that perhaps what's happening here? Would I be better off to save the arrays to the database every 1,000 or so, then refresh the page, grab the arrays, do another 1,000, rinsing and repeating until finished?

    Is there another part to this code that raises red flags in regards to efficiency?

  4. #4
    Settled 4 red convertible dalecosp's Avatar
    Join Date
    Jul 2002
    Location
    Accelerating Windows at 9.81 m/s....
    Posts
    8,483
    Well, what I'm curious about, why are you using HTTP to traverse "your own" site? If it's DB-driven, you should be able to parse all the content without HTTP calls?? (I suppose by extension you could parse files from the local filesystem also if the DB didn't contain the site content ...)

    HTTP calls are expensive ... even if it's "local"....
    /!!\ mysql_ is deprecated --- don't use it! Tell your hosting company you will switch if they don't upgrade! /!!!\ ereg() is deprecated --- don't use it!

    dalecosp "God doesn't play dice." --- Einstein "Perl is hardly a paragon of beautiful syntax." --- Weedpacket

    Getting Help at All --- Collected Solutions to Common Problems --- Debugging 101 --- Unanswered Posts --- OMBE: Office Machines, Business Equipment

  5. #5
    Member
    Join Date
    Oct 2016
    Posts
    61
    Hi there Dale and thanks very much for the help.

    Some are htaccess enabled paths ( /stuff-from-db), others, GET variables (/?action=contact) and yet others are static files. I couldn't think of an easy way to wrangle all the different types together and before I invested 11 hours in modifying this, I wasn't aware it was going to work so poorly My bash script on my computer works much better than the PHP script (remote retrieval with wget) so I really thought I could get it to work better than in it's current form.

    I've also tried some of the online offerings to make a sitemap for your site and they did a fairly great job as well, in regards to speed so I'm going to try to make it a bit less of a pig. I'm pretty sure I've done some bad actions in this script.

  6. #6
    Pedantic Curmudgeon Weedpacket's Avatar
    Join Date
    Aug 2002
    Location
    General Contact Unit "Coping Mechanism"
    Posts
    22,489
    One quick improvement would be to shift the is_image test to after the others, since that's the expensive one.

    Starting the loop with a check to see if the URL you're about to crawl has already been crawled is redundant, because you checked that before adding it in the first place.

    Another would be to use store the URLs as sets; store them as the keys of $crawled_array and $tocrawl_array and irrelevant (though non-null) value. isset($array[$url]) would be faster than in_array($url, $array).

    Yet another observation: you never remove anything from $tocrawl_array: that's going to take ever longer to search as well, for stuff that would almost always already be in $crawled_array. Running it as a stack or queue (depending on whether you want to go depth-first or breadth-first) would keep its size down:
    Code:
    $tocrawl_array = [Starting URL];
    $crawled_array = [];
    while(!empty($tocrawl_array))
    {
    	$crawl = array_pop($tocrawl_array);
    	get DOM from document at crawl;
    	make $urls_found a list of links in current page
    
    	// Filter out those already crawled
    	$urls_found = array_keys(array_diff_key(array_flip($urls_found), $crawled_array));
    	// Keep only those that are interesting enough to follow further
    	$urls_found = array_diff(array_filter($urls_found, '...interesting urls only...'), $tocrawl_array);
    	$crawled_array[$crawl] = true;
    	$tocrawl_array = array_merge($tocrawl_array, $urls_found);
    }
    Finally, since you are only interested in HTML documents, having a check to see if what you fetched was an image doesn't seem useful anyway; especially since there are many many other kinds of file than just "HTML document" and "JPEG/GIF/TIFF/PNG/BMP image". So it would make more sense to see if what you've fetched is an HTML document, and discard it if not.
    THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
    FAQs! FAQs! FAQs! Most forums have them!
    Search - Debugging 101 - Collected Solutions - General Guidelines - Getting help at all

  7. #7
    Member
    Join Date
    Oct 2016
    Posts
    61
    Hi there weedpacket and thanks a bunch for your help. I think I've gotten the basic implementation of your fantastic alteration operating and my one big issue is that I can't find a way to tell if(is_html) rather than checking for is_image. I keep googling my way to if file exists, which doesn't distinguish between the types. Is there a particular method of determining this that would work best/well for my needs? For now, I'm just using a simple extension check to handle them.

    Here's my latest effort:

    PHP Code:
    function is_valid($path){

        if(
    substr$path0) != "/"){
            return 
    false;
        }
        
        
    $supported_image = array(
            
    'gif',
            
    'jpg',
            
    'jpeg',
            
    'png'
        
    );

        
    $ext strtolower(pathinfo($pathPATHINFO_EXTENSION));
        if (
    in_array($ext$supported_image)) {
            return 
    false;
        }
        
        return 
    true;
    }

    $site_domain 'https://wheeltastic.com';
    $tocrawl_array[] = '/';
    $crawled_array = [];
    $ii 0;
    while(!empty(
    $tocrawl_array))
    {
        
    $crawl array_pop($tocrawl_array);
        
    $options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
        
    $context  stream_context_create($options);
        echo 
    $site_domain.$crawl.'<br>';
        
    $html file_get_contents($site_domain.$crawlfalse$context);
        
    $dom = new DOMDocument();
        @
    $dom->loadHTML($html);
        
    $xpath = new DOMXPath($dom);
        
    $hrefs $xpath->evaluate("/html/body//a");

        for (
    $i 0$i $hrefs->length$i++){
            
    $href $hrefs->item($i);
            
    $url $href->getAttribute('href');
            
    $urls_found[$url] = $url;
            
        }
        

        
    // Filter out those already crawled
        
    $urls_found array_keys(array_diff_key(array_flip($urls_found), $crawled_array));
        
    // Keep only those that are interesting enough to follow further
        
    $urls_found array_diff(array_filter($urls_found'is_valid'), $tocrawl_array);
        
    $crawled_array[$crawl] = true;
        
    $tocrawl_array array_merge($tocrawl_array$urls_found);
        
        
    $ii = ++$ii;
        if(
    $ii %25 == 0){
            echo 
    $ii.' links processed<br>';
        }
        
        if(
    $ii == 1000){
            
    print_r($crawled_array);
            exit();
        }

    Last edited by schwim2; 03-11-2017 at 10:40 PM.

  8. #8
    Senior Member Derokorian's Avatar
    Join Date
    Apr 2011
    Location
    Denver
    Posts
    2,261
    You should look at using curl instead. Then you can use curl_multi_exec and friends, meaning you can be processing a response while other requests are continuing, instead of only every requesting one at a time as you are now.
    Sadly, nobody codes for anyone on this forum. People taste your dishes and tell you what is missing, but they don't cook for you. ~anoopmail
    I'd rather be a comma, then a full stop.
    User Authentication in PHP with MySQLi - Don't forget to mark threads resolved - MySQL(i) warning

  9. #9
    Pedantic Curmudgeon Weedpacket's Avatar
    Join Date
    Aug 2002
    Location
    General Contact Unit "Coping Mechanism"
    Posts
    22,489
    Quote Originally Posted by schwim2
    if(is_html) rather than checking for is_image.
    Well, if the DOM fails to parse it as HTML...
    THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
    FAQs! FAQs! FAQs! Most forums have them!
    Search - Debugging 101 - Collected Solutions - General Guidelines - Getting help at all

  10. #10
    Member
    Join Date
    Oct 2016
    Posts
    61
    Quote Originally Posted by Weedpacket View Post
    Well, if the DOM fails to parse it as HTML...
    I tried:

    PHP Code:
    if($dom->loadHTML($html)){ 
    But that resulted in errors:

    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 215 in /home/wheeltastic/public_html/crawler.php on line 25

    Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 224 in /home/wheeltastic/public_html/crawler.php on line 25

    Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 342 in /home/wheeltastic/public_html/crawler.php on line 25
    Am I on the correct path on how to check the DOM for failure?

  11. #11
    Member
    Join Date
    Oct 2016
    Posts
    61
    Quote Originally Posted by Derokorian View Post
    You should look at using curl instead. Then you can use curl_multi_exec and friends, meaning you can be processing a response while other requests are continuing, instead of only every requesting one at a time as you are now.
    I've never once dealt with it and it look kind of daunting(there's a lot of issues of certain applications of the code causing issues with CPU resources). Would something like this be a possible candidate for implementation or do I need to look for something particular for my needs? I'm at such a loss concerning it's usage, I'm not sure what to google

  12. #12
    Pedantic Curmudgeon Weedpacket's Avatar
    Join Date
    Aug 2002
    Location
    General Contact Unit "Coping Mechanism"
    Posts
    22,489
    Quote Originally Posted by schwim2
    Am I on the correct path on how to check the DOM for failure?
    Quote Originally Posted by DOMDocument::loadHTML
    Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters. ... While malformed HTML should load successfully, this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.
    PHP Code:
    <?php

    $document 
    = new DOMDocument();

    $use_libxml_errors libxml_use_internal_errors(true);
    $loaded $document->loadHTMLFile('007845-3.png');
    if(
    $loaded && count(libxml_get_errors()) == 0)
    {
        echo 
    "Document loaded";
    }
    else
    {
        echo 
    "BÝrk";
    }
    libxml_use_internal_errors($use_libxml_errors);
    THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
    FAQs! FAQs! FAQs! Most forums have them!
    Search - Debugging 101 - Collected Solutions - General Guidelines - Getting help at all

  13. #13
    Senior Member Derokorian's Avatar
    Join Date
    Apr 2011
    Location
    Denver
    Posts
    2,261
    Quote Originally Posted by schwim2 View Post
    I've never once dealt with it and it look kind of daunting(there's a lot of issues of certain applications of the code causing issues with CPU resources). Would something like this be a possible candidate for implementation or do I need to look for something particular for my needs? I'm at such a loss concerning it's usage, I'm not sure what to google
    I dunno about this gist, it mighjt be useful - but what I'm suggesting is making multiple curl calls as you know them. So this would just be a replacement for file_get_contents such that multiple calls are happening at once. I'll see if i can find a script I wrote that might be helpful to you.
    Sadly, nobody codes for anyone on this forum. People taste your dishes and tell you what is missing, but they don't cook for you. ~anoopmail
    I'd rather be a comma, then a full stop.
    User Authentication in PHP with MySQLi - Don't forget to mark threads resolved - MySQL(i) warning

  14. #14
    Member
    Join Date
    Oct 2016
    Posts
    61
    Quote Originally Posted by Derokorian View Post
    I dunno about this gist, it mighjt be useful - but what I'm suggesting is making multiple curl calls as you know them. So this would just be a replacement for file_get_contents such that multiple calls are happening at once. I'll see if i can find a script I wrote that might be helpful to you.
    I would really appreciate that. So far, I'm not succeeding in moving the system over using stuff I've found on the web.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •