Extracting Occurences

goobers

Hi,

I have few questions regarding the following:

Lets say I have a page where I allow the user to upload a text file. Then I would like take the text file, open it, and create an array containing a unique word with the number of occurences of that word.

Now I'm not sure how to write an efficient algorithm to do that, I want to first remove 'stop words' from the text, then create an array that contains the number of occurences of each unique word. That's the first step.

Secondly, I want to be able to store it in a database, in such a way where I can do a search based on the occurences of a word. So if I run a search for 'apple pie', I want to be able to figure out which word is more relevant (i.e. has occurred more in the text). Now what I was thinking was to serialize the array, then retrieve it from the db, unserialize it, and check for those words, but wouldn't that be too inefficient? (obviously i would check for those words to be in the serialized array in the first place)

Now, finally if I am dealing with a text file that has a lot of text, the process will take a while, I would like to put the code in a seperate php file and execute it from the form processing php script. (i.e. run it in the background ). In case the commands to do so are OS specific, I am running a windows server 2003 system. Also, would the best way to keep track of the process is to create a field in the database which i flag once the processing is done ? (what if an error happens, how would i relay that information ? )

thanks!

NogDog

<?php
/**
 * Get a word count array for a string: word => count.
 * Uses function stopWords(). 
 * @param string $text;
 * @return array  

 */ 
function countWords($text)
{
   $words = str_word_count($text, 1);
   array_walk($words, create_function('&$a', '$a = strtolower($a);'));
   $words = array_filter($words, 'stopWords');
   return array_count_values($words);
}
/**
 * This could easily be modified to read in a file of stop words via file().
 */ 
function stopWords($word)
{
   $stopWords = array(
      'a',
      'an',
      'the',
      'this',
      'is',
      'are',
      'it'
   );
   return (!in_array(strtolower($word), $stopWords));
}
// TEST:
$test = "This is a test. It is only a TEST.";
$wordCount = countWords($test);
echo "<pre>".print_r($wordCount,1)."</pre>";
?>

You could then do a ksort() to alphabetize the list, or rsort() to sort the counts from highest to lowest, etc.

goobers

Thanks,

As for running it as a background process, or retrieving relevancy from the database, anything pointers on how I would go about doing that?

NogDog

I ran my script against this text, and it only took about 3-4 seconds to run on my PC. I did have to increase the memory_limit setting from the default '256M' setting. (It ran OK at 512 in this case.)

goobers

Yeah,

However, I will also be running this along with some other processes which take much more time. (the text will be extracted from proprietary file formats.. the extraction process takes a while). That's why i wish to create a 'thread' and not do it directly.

Also, just wondering. If I were to run it against a text file, for it to work with your script wouldn't I have to read the whole file into memory all at once, if I were to go line by line, i would then have to merge the arrays and sum the occurrences from each line ?

NogDog

I would create a separate shell script that will launch the desired script (PHP or otherwise) in the background, then just call that wrapper script from your PHP web page.