Making this local link crawler more efficient

schwim2

Hi there everyone!

I'm building a sitemap generator for a site with a BUNCH of links. The snippet was working very efficiently until I added the ability to check to see if the file was an image. Now it's super slow.

I'm only using this script locally on the site it's crawling so I'm wondering if there's a better way to hand this to speed it back up and use less resources. I could check for an image extension at the end but results found during my googling result in those in the know bad mouth the solution as a poor one.

Here's my current attempt:

<?php

$site_domain = 'https://wheeltastic.com';

function isImage($url){
	$params = array('http' => array(
		'method' => 'HEAD'
	));
	$ctx = stream_context_create($params);
	$fp = @fopen($url, 'rb', false, $ctx);
	if (!$fp){
		return false;  // Problem with url
	}
	$meta = stream_get_meta_data($fp);
	if ($meta === false){
		fclose($fp);
		return false;  // Problem reading data from url
	}

$wrapper_data = $meta["wrapper_data"];
if(is_array($wrapper_data)){
	foreach(array_keys($wrapper_data) as $hh){
		if (substr($wrapper_data[$hh], 0, 19) == "Content-Type: image"){  // strlen("Content-Type: image") == 19 	  
			fclose($fp);
			return true;
		}
	}
}

fclose($fp);
return false;
}

$options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
$context  = stream_context_create($options);
$html = file_get_contents('https://wheeltastic.com', false, $context);
$html = file_get_contents('https://wheeltastic.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');

if(!isImage($site_domain.$url) AND substr( $url, 0, 1 ) === "/" AND $url != '/'){
	echo $url.'<br />';
}
}

?>

Could anyone tell me if there's a more efficient/faster method of checking the local link to see if it's an image?

Thanks for your time!

schwim2

I've managed to make it quite a bit faster by utilizing getimagesize but had to suppress errors due to them being thrown on non-images:

<?php

$site_domain = 'https://wheeltastic.com';

function is_image($path){
	$a = getimagesize($path);
	$image_type = $a[2];

if(in_array($image_type , array(IMAGETYPE_GIF , IMAGETYPE_JPEG ,IMAGETYPE_PNG , IMAGETYPE_BMP))){
	return true;
}
return false;
}

$options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
$context  = stream_context_create($options);
$html = file_get_contents('https://wheeltastic.com', false, $context);

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');

if(!@is_image('.'.$url) AND substr( $url, 0, 1 ) === "/" AND $url != '/'){
	echo $url.'<br />';
}

?>

schwim2

Alright, I've made a modified version and it seems to be suffering as the list of links grows.

$site_domain = 'https://wheeltastic.com';
$tocrawl_array[0] = '/';
$crawled_array = array();

function is_image($path){
	$a = getimagesize($path);
	$image_type = $a[2];

if(in_array($image_type , array(IMAGETYPE_GIF , IMAGETYPE_JPEG ,IMAGETYPE_PNG , IMAGETYPE_BMP))){
	return true;
}
return false;
}

$cd = 0;
$tc = 0;
$ii = 0;

while(array_key_exists($ii, $tocrawl_array)){

$crawl = $tocrawl_array[$ii];
if(!in_array($crawl, $crawled_array)){

	$options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
	$context  = stream_context_create($options);
	$html = file_get_contents($site_domain.$crawl, false, $context);
	$crawled_array[$cd] = $crawl;
	$cd = ++$cd;

	$dom = new DOMDocument();
	@$dom->loadHTML($html);

	// grab all the links on the page
	$xpath = new DOMXPath($dom);
	$hrefs = $xpath->evaluate("/html/body//a");

	for ($i = 0; $i < $hrefs->length; $i++) {
		$href = $hrefs->item($i);
		$url = $href->getAttribute('href');

		if(!@is_image('.'.$url) AND substr( $url, 0, 1 ) === "/" AND !in_array($url, $crawled_array) AND !in_array($url, $tocrawl_array)){

			$tc = ++$tc;
			$tocrawl_array[$tc] = $url;

			if($tc %25 == 0) {
				echo $tc.' records processed<br>';
			}

			if($tc == 1000){
				print_r($crawled_array);
				exit();
			}

		}
	}
}
$ii = ++$ii;
}

In another topic that I started here, it was mentioned that large arrays could cause problems with memory. Is that perhaps what's happening here? Would I be better off to save the arrays to the database every 1,000 or so, then refresh the page, grab the arrays, do another 1,000, rinsing and repeating until finished?

Is there another part to this code that raises red flags in regards to efficiency?

dalecosp

Well, what I'm curious about, why are you using HTTP to traverse "your own" site? If it's DB-driven, you should be able to parse all the content without HTTP calls?? (I suppose by extension you could parse files from the local filesystem also if the DB didn't contain the site content ...)

HTTP calls are expensive ... even if it's "local"....

schwim2

Hi there Dale and thanks very much for the help.

Some are htaccess enabled paths ( /stuff-from-db), others, GET variables (/?action=contact) and yet others are static files. I couldn't think of an easy way to wrangle all the different types together and before I invested 11 hours in modifying this, I wasn't aware it was going to work so poorly 🙂 My bash script on my computer works much better than the PHP script (remote retrieval with wget) so I really thought I could get it to work better than in it's current form.

I've also tried some of the online offerings to make a sitemap for your site and they did a fairly great job as well, in regards to speed so I'm going to try to make it a bit less of a pig. I'm pretty sure I've done some bad actions in this script.

Weedpacket

One quick improvement would be to shift the [font=monospace]is_image[/font] test to after the others, since that's the expensive one.

Starting the loop with a check to see if the URL you're about to crawl has already been crawled is redundant, because you checked that before adding it in the first place.

Another would be to use store the URLs as sets; store them as the keys of [font=monospace]$crawled_array[/font] and [font=monospace]$tocrawl_array[/font] and irrelevant (though non-null) value. [font=monospace]isset($array[$url])[/font] would be faster than [font=monospace]in_array($url, $array)[/font].

Yet another observation: you never remove anything from [font=monospace]$tocrawl_array[/font]: that's going to take ever longer to search as well, for stuff that would almost always already be in [font=monospace]$crawled_array[/font]. Running it as a stack or queue (depending on whether you want to go depth-first or breadth-first) would keep its size down:

$tocrawl_array = [Starting URL];
$crawled_array = [];
while(!empty($tocrawl_array))
{
	$crawl = array_pop($tocrawl_array);
	get DOM from document at crawl;
	make $urls_found a list of links in current page

// Filter out those already crawled
$urls_found = array_keys(array_diff_key(array_flip($urls_found), $crawled_array));
// Keep only those that are interesting enough to follow further
$urls_found = array_diff(array_filter($urls_found, '...interesting urls only...'), $tocrawl_array);
$crawled_array[$crawl] = true;
$tocrawl_array = array_merge($tocrawl_array, $urls_found);
}

Finally, since you are only interested in HTML documents, having a check to see if what you fetched was an image doesn't seem useful anyway; especially since there are many many other kinds of file than just "HTML document" and "JPEG/GIF/TIFF/PNG/BMP image". So it would make more sense to see if what you've fetched is an HTML document, and discard it if not.

schwim2

Hi there weedpacket and thanks a bunch for your help. I think I've gotten the basic implementation of your fantastic alteration operating and my one big issue is that I can't find a way to tell if(is_html) rather than checking for is_image. I keep googling my way to if file exists, which doesn't distinguish between the types. Is there a particular method of determining this that would work best/well for my needs? For now, I'm just using a simple extension check to handle them.

Here's my latest effort:

function is_valid($path){

if(substr( $path, 0, 1 ) != "/"){
	return false;
}

$supported_image = array(
	'gif',
	'jpg',
	'jpeg',
	'png'
);

$ext = strtolower(pathinfo($path, PATHINFO_EXTENSION));
if (in_array($ext, $supported_image)) {
	return false;
}

return true;
}

$site_domain = 'https://wheeltastic.com';
$tocrawl_array[] = '/';
$crawled_array = [];
$ii = 0;
while(!empty($tocrawl_array))
{
	$crawl = array_pop($tocrawl_array);
	$options  = array('http' => array('user_agent' => 'Wheelie-Bot / Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'));
	$context  = stream_context_create($options);
	echo $site_domain.$crawl.'<br>';
	$html = file_get_contents($site_domain.$crawl, false, $context);
	$dom = new DOMDocument();
	@$dom->loadHTML($html);
	$xpath = new DOMXPath($dom);
	$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++){
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	$urls_found[$url] = $url;

}


// Filter out those already crawled
$urls_found = array_keys(array_diff_key(array_flip($urls_found), $crawled_array));
// Keep only those that are interesting enough to follow further
$urls_found = array_diff(array_filter($urls_found, 'is_valid'), $tocrawl_array);
$crawled_array[$crawl] = true;
$tocrawl_array = array_merge($tocrawl_array, $urls_found);

$ii = ++$ii;
if($ii %25 == 0){
	echo $ii.' links processed<br>';
}

if($ii == 1000){
	print_r($crawled_array);
	exit();
}
}

Derokorian

You should look at using curl instead. Then you can use [man]curl_multi_exec[/man] and friends, meaning you can be processing a response while other requests are continuing, instead of only every requesting one at a time as you are now.

Weedpacket

schwim2 wrote:
if(is_html) rather than checking for is_image.

Well, if the DOM fails to parse it as HTML...

schwim2

Weedpacket;11061117 wrote:
Well, if the DOM fails to parse it as HTML...

I tried:

if($dom->loadHTML($html)){

But that resulted in errors:

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 215 in /home/wheeltastic/public_html/crawler.php on line 25

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 224 in /home/wheeltastic/public_html/crawler.php on line 25

Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 342 in /home/wheeltastic/public_html/crawler.php on line 25

Am I on the correct path on how to check the DOM for failure?

schwim2

Derokorian;11061113 wrote:
You should look at using curl instead. Then you can use [man]curl_multi_exec[/man] and friends, meaning you can be processing a response while other requests are continuing, instead of only every requesting one at a time as you are now.

I've never once dealt with it and it look kind of daunting(there's a lot of issues of certain applications of the code causing issues with CPU resources). Would something like this be a possible candidate for implementation or do I need to look for something particular for my needs? I'm at such a loss concerning it's usage, I'm not sure what to google 🙂

Weedpacket

schwim2 wrote:
Am I on the correct path on how to check the DOM for failure?

DOMDocument::loadHTML wrote:
Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters. ... While malformed HTML should load successfully, this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.

<?php

$document = new DOMDocument();

$use_libxml_errors = libxml_use_internal_errors(true);
$loaded = $document->loadHTMLFile('007845-3.png');
if($loaded && count(libxml_get_errors()) == 0)
{
	echo "Document loaded";
}
else
{
	echo "Børk";
}
libxml_use_internal_errors($use_libxml_errors);

Derokorian

schwim2;11061121 wrote:
I've never once dealt with it and it look kind of daunting(there's a lot of issues of certain applications of the code causing issues with CPU resources). Would something like this be a possible candidate for implementation or do I need to look for something particular for my needs? I'm at such a loss concerning it's usage, I'm not sure what to google 🙂

I dunno about this gist, it mighjt be useful - but what I'm suggesting is making multiple curl calls as you know them. So this would just be a replacement for file_get_contents such that multiple calls are happening at once. I'll see if i can find a script I wrote that might be helpful to you.

schwim2

Derokorian;11061127 wrote:
I dunno about this gist, it mighjt be useful - but what I'm suggesting is making multiple curl calls as you know them. So this would just be a replacement for file_get_contents such that multiple calls are happening at once. I'll see if i can find a script I wrote that might be helpful to you.

I would really appreciate that. So far, I'm not succeeding in moving the system over using stuff I've found on the web.