Hi,
I have few questions regarding the following:
Lets say I have a page where I allow the user to upload a text file. Then I would like take the text file, open it, and create an array containing a unique word with the number of occurences of that word.
Now I'm not sure how to write an efficient algorithm to do that, I want to first remove 'stop words' from the text, then create an array that contains the number of occurences of each unique word. That's the first step.
Secondly, I want to be able to store it in a database, in such a way where I can do a search based on the occurences of a word. So if I run a search for 'apple pie', I want to be able to figure out which word is more relevant (i.e. has occurred more in the text). Now what I was thinking was to serialize the array, then retrieve it from the db, unserialize it, and check for those words, but wouldn't that be too inefficient? (obviously i would check for those words to be in the serialized array in the first place)
Now, finally if I am dealing with a text file that has a lot of text, the process will take a while, I would like to put the code in a seperate php file and execute it from the form processing php script. (i.e. run it in the background ). In case the commands to do so are OS specific, I am running a windows server 2003 system. Also, would the best way to keep track of the process is to create a field in the database which i flag once the processing is done ? (what if an error happens, how would i relay that information ? )
thanks!