Hi, I've written a medium-large scale web spider. It's still an experimental prototype, over a few hours of runs it has fetched over 50,000 pages so far.
abc123 wrote:
Open up the page as a file.
Well, you can use fopen(), but I've found it's not flexible enough. Its timeouts don't really have enough control, and its behaviour on redirects is a bit lame.
Therefore I wrote my own HTTP implementation (I normally advise against this for simple apps). This lets me use reasonable timeouts and keep tabs on exactly how it works. And it is correct.
Use some string functions to take out data that is useful. (Title, Keywords etc...)
Using string functions is a really bad idea, when PHP already has a perfectly good HTML parser (DOMDocument->loadHTML).
I'm using the DOM parser, although there are some issues with encoding - the PHP DOM parser assumes a default encoding (not sure what, and I don't think it can be changed) in the absence of a meta http-equiv content-type element in the head.
This is a problem at the moment, as a lot of documents on the web rely on HTTP headers for the encoding, which my spider does not (currently) respect.
Internally I store everything as utf8 of course.
Take out any links from the page for traversing more pages -
That is actually much harder than you imagine. Using a DOM you can easily find the links, but how do you determine where they point?
PHP's parseurl() function is a bit lame, it throws errors on some valid types of URL, instead of parsing them.
Moreover, a lot of pages use relative URLs. You need a function which can interpret relative URLs. You also need to figure out what schemes you support, take into account query strings. Links like
<a href="?blah=42">42 things</a>
Need to work, as well as ../ and ../../ etc.
If there is multi-threading in php then I would imagine spawning more crawlers would be the way.
There isn't a standard thread library (might be one in PECL, I haven't investigated). The obvious options are:
- Creating your own SAPI that spawns a number of threads
- Running it in multiple processes instead of threads.
I'm currently running my test jobs with 6 processes concurrently. This loads my machine up pretty well, especially seeing as I haven't optimised the amount of queries the spider does yet.
I store all the metadata in MySQL, including the queue of pages to index etc. Thus there are a lot of queries, to look up URLs in the database (to find out if we've already seen them, etc).
Also multiprocessing means doing things in the right order, and ensuring that there aren't database contention problems. MySQL MyISAM table type has fast table-level locking, which is pretty good for serialising operations you want to be atomic (although of course, this has disadvantages too).
I found that using InnoDB was hopeless, because it deadlocks (and rolls back your transaction) far too much if you have lots of processes doing writes at the same time.
Index the current page with the link and somehow have it organised by most relevant depending on the search being performed. (This part I find puzzling)
I'm not currently indexing them by topic at all. That is not my plan - rather I plan to make the spider go preferentially after certain types of page. I'm really interested in gathering technical data rather than words.
I'm also aware that some pages don't want web crawlers so some headers tell the crawler not to traverse the page will need to be checked for.
Yes. The robots exclusion protocol tells you in robots.txt, what directories, URI prefixes etc, it wants you to stay out of.
My spider understands robots.txt, but doesn't take any notice of robots meta tags (yet).
Does anyone have any experience with this type of thing or any tips?
Yes. See above.
Also I am slightly worried about creating these spiders and then just causing problems on peoples sites.
It's not really a big problem. If you simply make a rule that the spider must not visit the same site too often, it's fine. Because there are soo many other sites in the queue, all the processes just keep themselves busy with other sites until it's time to go back.
Mark