Spider ideas

el_kab0ng

I'm in the process of creating a spidering system for a new site I'm designing, and have been getting decent ideas so far. What I envision is, a script that processes through a list of URL's assigned to it, stripping out everything except the text per page and grab all words 5 characters or more for inclusion to a database.

Since my data is 90% dynamic, I can't rely on meta tags as heavily as I once thought, so I'll need to build the index based on content. In any event, I know how to read in a url, and how to strip it down to just plain text without any html (to a degree). Now all I need to do is figure out a way to count all the words listed and only print those with 5 or more characters.

Can this be done?

erokar

Here is what I suggest. Let's suppose the text is stripped of html-tags.

Put the text in an array using the explode() function - making each word an element in the array.

Traverse the array, getting rid of . , ; : etc. It would also be a goog idea to make each word either lower- or upper case. Use the strlen() function to get the lengt of the word. If length > 5 then put the word in a new array. It would also be a good idea to check for duplicates here, but maybe you want to do that using the database. To count the words just use the sizeof() function on the array.

That's basically it. In general, check out the string pages in the manual: http://www.phpbuilder.com/manual/ref.strings.php