Limited Spider / Link Archival By Date

tsweezey

This isn't a specific question, rather more general regarding architecture. Any comments and guidance are appreciated...

I have a requirement where I have to search the home pages of about 40 sites to check to see if there's any mention of a series of products (keywords I define). If there is a mention, I need spider the corresponding link, assign a crawl date, and archive the link in a database. I'm just spidering one level deep.

Basically, I'm trying to create an "alert" system of sorts so I'm notified when someone has posted a preview, review, etc. of one of the defined products. Unlike a typical search engine, the latest crawl should only report qualified links that aren't already in the database. At a later time, I'll need to query document links by date ranges.

Examples:

a. Show me all widget links between January 1 and March 15th.
b. Which sites have posted information on my widget today?

What I'm thinking I need to do is:

Establish a Baseline:

Open the destination site and extract all the links
Open each link and look for keywords... If matched then date and file in db.
Create a link exclusion list. Basically, archiving all links from the homepage that will not lead to a mention of my product. Also posting the good links I just followed, so I don't follow them again.

On an ongoing basis:

Open the destination site and extract all the links
Compare against the link exclusion file
Follow only new links, archive link with date if product(s) mentioned.
Append bad links and links just filed to exclusion file
Move on to the next site.

My concern is that the script will take eons to run, and likely "time out." Before I embark on this mini-adventure, I'd certainly appreciate any comments. Time saving comments are welcome too 🙂

Regards,

Tim

sfullman

Learn preg_match -- learn to use regular expressions
when you read the files, you're going to need to clear (unset()) things if you run a lot of pages. Usually this wouldn't matter but in this case it does
IAAP use switch(){ } vs. if(condition1){ }else if(){ } etc. -- this is faster
IF you know the strings will always be a certain format you can use stristr() vs. preg_match() but I would recommend the second because I don't trust humans (and only computers sometimes).

HTH, and keep folks updated on this thread -- big project.

Sam