There are two parts, a web spider and a search engine. In reality there are probably a lot more.
The web spider finds pages to read and goes and reads them. This could include:
- Pages already in the database
- Pages linked from pages in the database
- Pages manually submitted
- Possibly entries in various directories
Each page is read and the relevant text extracted. The exact method for doing this will vary, but they'd typically use a HTML parser. Major search engines (such as Google) can also extract text and links from MSWord documents, PDFs etc.
The next part is to put the page into a highly specialised database which allows for very fast searching - effectively they have a type of full-text index which is very specialised for their application.
All the search engine really does is search this database using this highly specialised index.
I've written a spider in PHP, it wasn't very easy and didn't work particularly well. There are a lot of issues you will need to think about, chiefly how to get around broken servers (e.g. ones which send back a 200 OK response for files which don't exist) and servers / domains which send content to deliberately confuse or pollute your spider.
Mark