Hi,
I am looking for the base code in PHP for building the web spidering system. I am interested in build up a web spidering system in PHP, by constructing a spider that searches for information useful to the user within the given web site and sub-sites. The spider should be restricted to all sites within the given domain.
A user would be entering a few search terms they are generally interested in and the spider would return from each run a list of the links most likely to match the user's requirements
The spider, given a base URL, will visit all pages within 3 links of that URL and report which links it has visited.
The program would recognise and return a list of pages that match the search criteria entered by the user.
The program would also store some information about the matching pages visited. For example, storingthe title and information from the meta-tags. This information would be stored in any format. The spider should also store if URLs were unavailable, and the codes with errors.
The program would be able to spider pages to an arbitrary depth, and be stopped by either a timeout, or because there are no more pages that have not been visited. The spider would not loop, that is, it should re-visit pages that it has already visited. This will require some the appropriate data structure to use. The spider would be able to deal with more sophisticated search criteria using Boolean terms such as "drums and not guitar"