PHP Web Crawler

joeiscoolone

What are some of the syntax you need to use to access a web page to start crawling the web?

Weedpacket

The stuff you'd find in the [man]manual[/man], of course. You'll be needing some functions, some variables (including some arrays, no doubt), an operator here and there...

You'll need to read files and parse strings, and probably write some files, too.

Have fun!

MarkR

Well, you can use fopen() with fopen_url enabled (or whatever it's called). You can use stream contexts to supply additional parameters for this.

Alternatively you can use curl (if enabled).

As a third option, you can use one of the other HTTP clients already made (there are at least two in PEAR, I've not tried either of them).

As a fourth option you can write your own HTTP implementation (which is what I ultimately did after it became obvious that fopen() wasn't flexible enough even with the stream context options).

You'll also need a HTML parser - fortunately PHP5 has one built in via libxml2 - the DOMDocument::loadHTML function will do what you want.

Making a web crawler is VERY involved and takes a huge amount of work. Real web pages have a lot of errors in and you'll encounter a lot of problems.

Issues I found:
- Multithreading efficiently
- Database locking / contention issues
- Startup/shutdown and remembering what pages are done
- Parsing robots.txt
- Handling broken things (for example, servers which return a 200 status even for pages which don't exist).
- Handling SPAM sites created just to piss robots off (believe me, there are a LOT of these)
- Gracefully handling errors / exceptions thrown from inside the crawler itself and deciding what to do with those URLs in the queue
- Handling encodings correctly - even when the page has several conflicting messages (headers, meta) about which encoding it's in or just plain lies.
- Handling non-HTML pages
- Redirect handling
- Deciding what to spider next / prioritisation

These are just a few of the issues I found when trying to do this.

My conclusion was that PHP isn't a very suitable language for a HTTP spider - it simply doesn't give you enough low level control over most things (such as sockets, processes, threads, locking, high performance db stuff).

But it did work and I spidered hundreds of thousands of web pages with it.

Mark

joeiscoolone

Thanks I appreciate the answers. Now how would you start to build one I mean the basics what kind of files how many files, basic starting syntax. Were you would put the ranking code.

MarkR

That depends entirely what the purpose of the spider is.

Mine was really just to find out what kind of technology companies use on their web site, surveying things like headers, file extensions, etc.

If I were to do it again, I'd say:

Separate into at least three separate tasks: 1. Fetch pages from HTTP (and robots.txt) 2. Parse pages and generate metadata 3. Analyse data and therefore add items into the queue for the fetcher
Run these as separate tasks on the server. You'll need metadata and locking to ensure that these tasks don't trip over one another.

If you don't want to spider the entire web (presumably you don't), then you'll need something in the analysis task which determines whether a given URL is "worthy" of attention by the spider later.

There are a lot of issues, as I mentioned.

You will need a lot of patience, a great deal of experience, and a lot of disc space. Don't even THINK about doing this from a home or business ADSL / Broadband connection - especially if your provider has a transparent proxy (you may overload it and it may ban you).

Ranking I have no idea about. My spider ranked pages according to "Interestingness" which was based on how many similar pages we'd seen before (from the same site), so that it spread its work out over the internet instead of staying near to home.

HTTP is the easy bit. I wrote my own HTTP implementation, but that was dead easy compared to the rest.

Mark

joeiscoolone

Thank you, but I accutaully do want to crawel the entire Internet. I looked up the syntax you specified, but Im just curious about a starting point.Maybe an example code to show the basics of starting at a page and crawling.

MarkR

I don't think you fully appreciate the difficulty of this task.

Even crawling a single web site (made of tagsoup) is nontrivial. Just calculating relative URLs isn't particularly easy.

I cannot give you a "simple example", because one does not exist.

Mark

ak_256

I am also trying to write my own simple web crawler, and am relatively familiar with PHP and MYSQL. First i want to get some values from the HTML page, such as

title
meta tags (for web crawlers)
links

I read that the function preg_match(), could come in handy, can you think of any other functions that can find a pattern (starts with <title>, ends with </title> etc), and puts it into an array?😃 😃

EDIT: I have tried using the following regular expression to remove a title tag:

"/^<title> (.*) <\/title>$/"

however it doesnt work

Can someone please give me an example of the correct regular expression to use

MarkR

You can't write a SIMPLE web crawler. It's just not possible.

Nor can you parse HTML with regular expressions - use a HTML parser (e.g. DOM::loadHTML) instead!

There are a lot of very badly formatted pages out there. Any regular expression you construct will either give the wrong answer or be too inefficient for practical use.

You need to support any encoding you're given, any content you're given. You need to spend a lot of time handling possible error conditions and have (graceful) retries etc.

People will deliberately serve the robot pages for it to choke on - you need to be capable of detecting them and behaving accordingly (if your robot is small in scope, a hand-coded blacklist is probably sufficient).

Spammers will:
Create huge numbers of domains/ subdomains for you to trawl
Create very large pages
* Create large numbers of small pages

Everyone will:
Run web servers which don't send correct HTTP statuses back
Run web servers which don't terminate connections properly and/or don't speak HTTP properly.
Create "not found" pages which you cannot distinguish from real pages
Create very large pages
Create every possible kind of badly formed page
Have broken DNS which points at your / their internal network
Serve binary instead of HTML pages
Send you the wrong content-type and/or encoding

Mark

ak_256

fair enough, sounds like writing a web crawler wont be easy. Can someone give me advice on something else though:

I am trying to write a simple form script that inserts the values into the database. One of the values is a textarea, and that is the only field that wont work. Any ideas

bradgrafelman

Start a new thread for this new problem, and include the code you've tried but doesn't work (as well as the error messages).

thelastname

hi..
i want to build a web crawler..it can get every links in one web..
ex : this web crawler can checking every links in a web, give an information about
links (if links could't open the page, means link is error)..
i hope someone can give me a simple script..sorry if my english isn't good..
thx before..