PHP Web Crawler

joeiscoolone

Were would you store a lexicon for your search engine,and how would your search engine access it?By lexicon I mean for spelling corrections. How can you record how many searchs have been done on your search engine? How do you show how long a search took?

devinemke

joeiscoolone wrote:
Were would you store a lexicon for your search engine

in a database

joeiscoolone wrote:
and how would your search engine access it?

connect to the database, query the table

joeiscoolone wrote:
How can you record how many searchs have been done on your search engine?

store each search in the database

joeiscoolone wrote:
How do you show how long a search took?

grab a [man]microtime[/man] before the search and another one after and compare

MarkR

Joe, I don't mean to discourage you, but so far all of your posts regarding PHP web crawlers, have suggested to me that your level of programming ability is probably insufficient to complete even the least capable and most simplistic web crawler / search engine.

I have many years of commercial PHP (and more of many other languages) experience, and I consider myself underqualified for something of this magnitude. I have attempted to create a web crawler (not for indexing purposes though) and found it nontrivial (see my previous posts on the topic). The web crawling is only a small part of the problem.

Perhaps you should restrict your experiments to something slightly simpler, at least to start with?

What is your remit? What kind of funding do you have for this project? Perhaps it would be easier and cost effictive to licence someone else's technology?

Cheers

Mark

joeiscoolone

MarkR, my PHP skills are developing at a steady pase and wile I may not have the skill now I soon may. And the post above was actually for a search engine that searches your own site not crawel the web, im just so used to titling my post PHP Web Crawler. And I have found sites that make it very easy to make a search engine that will search your own site. As far as the web crawler goes I do appreciate how hard it is to make one,but it's hard for me to start to make one if I have no were to start. That was the purpose of the posts before what i was asking is maybe for some one to show me a block of code and say ok this does that and this does this. So i at least had an idea of were to start. Even if it's very complicated there has to be specific syntax to start with.

Weedpacket

But there's no point showing you any syntax if you don't know how to use it. In fact, I'm guessing that you're not aware of what "syntax" actually means.

As far as crawling your own site goes ... it would be a lot simpler if you already knew what was on your site; generating its content from a database for example would mean that you already have all of the content available in the database - no need to search it.

joeiscoolone

Actually i do know what syntax means, and the point of showing it to me would be to point out what it does. So yes there would be a point. It's hard to start something when you have know were to start, all I hear is how complicated it is. Ok fine it's complicated but it can still be done. I don't ask questions just so people can tell me how hard it is, I ask questions on how to start. Even if there are alot of diffrent ways to make a PHP Web Crawler there has to be some base to start from.

MarkR

joeiscoolone wrote:
Even if there are alot of diffrent ways to make a PHP Web Crawler there has to be some base to start from.

There is no "base code" to start from. Web crawlers are very complicated. If for the sake of argument, you wanted to just do a single-threaded crawl of a single web site, you'd need to:

Be able to fetch web pages over HTTP
Be able to parse the HTML for links
Be able to determine the locations of new pages based on relative links
Be able to ignore types of links you weren't interested in, or were malformed in some way.
Be able to remember where you'd been so you didn't go there again
Have some kind of data structure to remember which pages have been fetched, when, and their contents.

None of these things is individually particularly trivial, however, some are easier than others.

I can post my "resolving relative links" routine if you like.

Mark

joeiscoolone

Thank You MarkR, if you posted that ruitine that would at least give me something to work with. I understand making a web crawler is very complicated but it's something I want to do so I am going to try.

MarkR

Here are some routines I used in my spider:


function HasTrailingSlash(& $str)
{
	if ($str == '') { return false; }
	return ($str[strlen($str) - 1] == '/');
}

function FixPath(& $path)
{
	$bits = explode('/', $path);
	$newbits = array();
	foreach ($bits as $bit) {
		// Ignore empty bits, and '.'
		if (($bit == '') || ($bit =='.')) {
			continue;
		}
		// If we see '..', chop the previous bit off.
		if ($bit == '..') {
			$numbits = count($newbits);
			if ($numbits > 0) {
				unset($newbits[count($newbits) -1 ]);
			}
		} else {
			// Otherwise, add it.
			$newbits[] = $bit;
		}
	}
	// Stick them back together
	$path = '/' . implode('/', $newbits);
}

// DOes the inverse of parse_url
// Ignores username and password.
// Ignores fragment.
function ConstructUrl($bits)
{
	// Check we actually have a scheme...
	assert(isset($bits['scheme']));
	$url = $bits['scheme'] . '://';
	if (isset($bits['host'])) {
		$url = $url .  $bits['host'];
	}
	if (isset($bits['port'])) {
		$url = $url . ':' . $bits['port'];
	}
	// Add path (mandatory).
	if (! isset($bits['path'])) {
		$bits['path'] = '/';
	}
	FixPath($bits['path']);
	$url = $url . $bits['path'];
	if (isset($bits['query'])) {
		$url = $url . '?' . $bits['query'];
	}
	return $url;

}

/*
 * This is a work-around for the fact that parse_url fails
 * for URLs like
 * /blah?add=http://www.example.com
 *
 */
function MyParseUrl($url)
{
	$qpos = strpos($url, '?');
	if ($qpos !== FALSE) {
		$baseurl = substr($url, 0, $qpos);
		$bits = parse_url($baseurl);
		$bits['query'] = substr($url, $qpos+1);
		return $bits;
	} else {
		return parse_url($url);
	}
}

function IsAbsoluteUrl($url)
{
	return (preg_match('/^[a-z]+:/i', $url));
}

function FindRelativeUrl($url, $relative)
{
	// If it's an absolute URL already, return it unchanged.
	if (IsAbsoluteUrl($relative)) {
		return $relative;
	}
	// chop the filename off the original url
	$original_url_bits = parse_url($url);
	if (! isset($original_url_bits['path'])) {
		$original_url_bits['path'] = '';
	}
	// If path *does not* end in a /, use dirname on it.
	$path = $original_url_bits['path'];
	if (! HasTrailingSlash($path)) {
		$path = dirname($path); 
		// If we don't have a trailing slash, add one.
		if (! HasTrailingSlash($path)) {
			$path = $path . '/';
		}

}
try {
	$rel_url_bits = MyParseUrl($relative);
} catch (Exception $e) {
	echo "WARNING: found a duff relative URL that we can't parse.\n";
	echo "Relative URL:$relative Linked from:$url\n";
	return "malformed:";
}
// If it has no path at all, then we have to assume it's something 
// like an anchor, or invalid. Return the original URL.
if (! isset($rel_url_bits['path'])  || ($rel_url_bits['path'] == '')) {
	$rel_url_bits['path'] = $original_url_bits['path'];
}

// If it has an absolute path, use that
$new_url_bits = $original_url_bits;
if (substr($rel_url_bits['path'],0,1) == '/') {
	$new_url_bits['path'] = $rel_url_bits['path'];
} else {
	// Otherwise, stick it on to the original path.
	$new_url_bits['path'] = $path . $rel_url_bits['path'];
}
// Blank the previous query string
unset($new_url_bits['query']);
// If set, use the new one.
if (isset($rel_url_bits['query'])) {
	$new_url_bits['query'] = $rel_url_bits['query'];
}

return ConstructUrl($new_url_bits);
}

function StrStartsWith(& $str, $start)
{
	return (substr($str, 0, strlen($start)) == $start);
}

function IsValidUrl($url)
{
	try {
		if (strlen($url) > MAX_URL_LENGTH) {
			return false;
		}
		$bits = MyParseUrl($url);
		// Check host for validity.
		if (! isset($bits['host'])) {
			return false;
		}
		$host = $bits['host'];
		// NOTE: we require at least one dot in the hostname.
		return preg_match('/^[a-z0-9\.\-]+\.[a-z]+$/i', $host);
	} catch (Exception $e) {
		echo "Really broken URL has caused an exception in IsValidUrl\n";
		echo $e . "\n";
		return false;
	}
}

function CanonicaliseUrl($url)
{
	try {
		$bits = MyParseUrl($url);
		if (isset($bits['host'])) {
			$bits['host'] = strtolower($bits['host']);
		}
		if (! isset($bits['path'])) {
			$bits['path'] = '/';
		}
		return ConstructUrl($bits);
	} catch (Exception $e) {
		echo "Cannot canonicalise URL: $url\n";
		return FALSE;
	}
}

I want to stress that this is a VERY SMALL part of a VERY COMPLICATED application. The above contains one of the key routines, FindRelativeUrl() which, given a base URL and relative URL, finds the new URL. This is not as easy as you may think as there are a lot of strange cases.

Here are some test cases for the above:

function TestUtils()
{
	echo FindRelativeUrl("http://www.example.com/testing/qwer", 'asdf') . "\n";
	echo FindRelativeUrl("http://www.example.com/testing/qwer", '/asdf') . "\n";
	echo FindRelativeUrl("http://www.example.com/testing/qwer", 'http://blah.example.com') . "\n";
	echo FindRelativeUrl("http://www.example.com/testing/qwer", '../asdf') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", '././/../e') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", 'test.html?x=42') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", '?x=99') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", '/blah?add=http://www.somewhere.com/') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", '?add=http://www.example.com/blah') . "\n";
	echo FindRelativeUrl("http://www.example.com/a/b/c/d", 'javascript://') . "\n";
	echo "\n\n";

All of which it handled correctly last time I checked.

Mark

joeiscoolone

Thank You MarkR, that gives me some idea, so would fopen() be were you would start? And how many diffrent files would you use all together including every thing that goes into the spider. I know that it depends but what do you think would be a good estimate. And as your crawling how would you connect to the data base and store the pages your crawling. What would the code for parsing links look like agian I know it depends but could you give an example code to give me some idea?

MarkR

joeiscoolone wrote:
Thank You MarkR, that gives me some idea, so would fopen() be were you would start?

Use whatever HTTP method you like; but be prepared to revise this based on issues you discover (I did this at least once).

And how many diffrent files would you use all together including every thing that goes into the spider.

That's entirely a matter of taste. Mine had about 14 in I think, including some non-production code (test harness, experiments etc).

And as your crawling how would you connect to the data base and store the pages your crawling?

Use whatever database method and database you're happy with - but again, be prepared to revise this. I started on sqlite, then changed to mysql/innodb, finally switching to mysql/myisam due to threading / locking issues.

What would the code for parsing links look like agian I know it depends but could you give an example code to give me some idea?

I used DOM::loadHTML then searched for anchor nodes in the document (for instance with getElementsByTagName).

Mark

joeiscoolone

Thank You, that helps me to better understand how it works.

kmussel

If you want a good start to building a web crawler, I found a tutorial that uses cURL. It user preg_match to get all the links on the page and follows each link. The link is http://kevinmusselman.com/blog/2009/11/crawling-web-pages-for-sitemaps/