I'm in the process of creating a spidering system for a new site I'm designing, and have been getting decent ideas so far. What I envision is, a script that processes through a list of URL's assigned to it, stripping out everything except the text per page and grab all words 5 characters or more for inclusion to a database.
Since my data is 90% dynamic, I can't rely on meta tags as heavily as I once thought, so I'll need to build the index based on content. In any event, I know how to read in a url, and how to strip it down to just plain text without any html (to a degree). Now all I need to do is figure out a way to count all the words listed and only print those with 5 or more characters.
Can this be done?