Well, you can use fopen() with fopen_url enabled (or whatever it's called). You can use stream contexts to supply additional parameters for this.
Alternatively you can use curl (if enabled).
As a third option, you can use one of the other HTTP clients already made (there are at least two in PEAR, I've not tried either of them).
As a fourth option you can write your own HTTP implementation (which is what I ultimately did after it became obvious that fopen() wasn't flexible enough even with the stream context options).
You'll also need a HTML parser - fortunately PHP5 has one built in via libxml2 - the DOMDocument::loadHTML function will do what you want.
Making a web crawler is VERY involved and takes a huge amount of work. Real web pages have a lot of errors in and you'll encounter a lot of problems.
Issues I found:
- Multithreading efficiently
- Database locking / contention issues
- Startup/shutdown and remembering what pages are done
- Parsing robots.txt
- Handling broken things (for example, servers which return a 200 status even for pages which don't exist).
- Handling SPAM sites created just to piss robots off (believe me, there are a LOT of these)
- Gracefully handling errors / exceptions thrown from inside the crawler itself and deciding what to do with those URLs in the queue
- Handling encodings correctly - even when the page has several conflicting messages (headers, meta) about which encoding it's in or just plain lies.
- Handling non-HTML pages
- Redirect handling
- Deciding what to spider next / prioritisation
These are just a few of the issues I found when trying to do this.
My conclusion was that PHP isn't a very suitable language for a HTTP spider - it simply doesn't give you enough low level control over most things (such as sockets, processes, threads, locking, high performance db stuff).
But it did work and I spidered hundreds of thousands of web pages with it.
Mark