This isn't a specific question, rather more general regarding architecture. Any comments and guidance are appreciated...
I have a requirement where I have to search the home pages of about 40 sites to check to see if there's any mention of a series of products (keywords I define). If there is a mention, I need spider the corresponding link, assign a crawl date, and archive the link in a database. I'm just spidering one level deep.
Basically, I'm trying to create an "alert" system of sorts so I'm notified when someone has posted a preview, review, etc. of one of the defined products. Unlike a typical search engine, the latest crawl should only report qualified links that aren't already in the database. At a later time, I'll need to query document links by date ranges.
Examples:
a. Show me all widget links between January 1 and March 15th.
b. Which sites have posted information on my widget today?
What I'm thinking I need to do is:
Establish a Baseline:
- Open the destination site and extract all the links
- Open each link and look for keywords... If matched then date and file in db.
- Create a link exclusion list. Basically, archiving all links from the homepage that will not lead to a mention of my product. Also posting the good links I just followed, so I don't follow them again.
On an ongoing basis:
- Open the destination site and extract all the links
- Compare against the link exclusion file
- Follow only new links, archive link with date if product(s) mentioned.
- Append bad links and links just filed to exclusion file
- Move on to the next site.
My concern is that the script will take eons to run, and likely "time out." Before I embark on this mini-adventure, I'd certainly appreciate any comments. Time saving comments are welcome too 🙂
Regards,
Tim