[RESOLVED] Overused word filter exclude from populating database

Simplyput

I am using a script to build a table with words from the titles of my articles.
I have a list of variables that I want to exclude being placed in the database, but it ignores my overused words arrays and still places them in the db.

Example - Title "I like the Transformers movie" after going through the script I should have in the db the words like, Transformers, movie. However, I get "the" also placed in the db even though it's in my stop words array.

This is the code where I am having my hiccup. What am I doing wrong?

/* Start parsing through the text, and build an index in the database: */

$buf = $title;

   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);

//function message() {
     $overusedwords = array( 'a', 'an', 'the', 'and', 'of', 'i', 'to', 'is', 'in', 'with', 'for', 'as', 'that', 'on', 'at', 'this', 'my', 'was', 'our', 'it', 'you', 'we', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '10', 'about', 'after', 'all', 'almost', 'along', 'also', 'amp', 'another', 'any', 'are', 'area', 'around', 'available', 'back', 'be', 'because', 'been', 'being', 'best', 'better', 'big', 'bit', 'both', 'but', 'by', 'c', 'came', 'can', 'capable', 'control', 'could', 'course', 'd', 'dan', 'day', 'decided', 'did', 'didnt', 'different', 'div', 'do', 'doesn', 'don', 'down', 'drive', 'e', 'each', 'easily', 'easy', 'edition', 'end', 'enough', 'even', 'every', 'example', 'few', 'find', 'first', 'found', 'from', 'get', 'go', 'going', 'good', 'got', 'gt', 'had', 'hard', 'has', 'have', 'he', 'her', 'here', 'how', 'if', 'into', 'isn', 'just', 'know', 'last', 'left', 'li', 'like', 'little', 'll', 'long', 'look', 'lot', 'lt', 'm', 'made', 'make', 'many', 'mb', 'me', 'menu', 'might', 'mm', 'more', 'most', 'much', 'name', 'nbsp', 'need', 'new', 'no', 'not', 'now', 'number', 'off', 'old', 'one', 'only', 'or', 'original', 'other', 'out', 'over', 'part', 'place', 'point', 'pretty', 'probably', 'problem', 'put', 'quite', 'quot', 'r', 're', 'really', 'results', 'right', 's', 'same', 'saw', 'see', 'set', 'several', 'she', 'sherree', 'should', 'since', 'size', 'small', 'so', 'some', 'something', 'special', 'still', 'stuff', 'such', 'sure', 'system', 't', 'take', 'than', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'those', 'though', 'through', 'time', 'today', 'together', 'too', 'took', 'two', 'up', 'us', 'use', 'used', 'using', 've', 'very', 'want', 'way', 'well', 'went', 'were', 'what', 'when', 'where', 'which', 'while', 'white', 'who', 'will', 'would', 'your', 'to', 'can' );
     foreach ($overusedwords as $overusedwords) {
         $buf = str_replace($overusedwords, '', $buf);
     }
     return $buf;
}


   /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);



   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

			 /* add the following to filter unwanted words */
     //if (!in_array( $cur_word, $overusedwords)) {
     //$cur_word = eregi_replace($overusedwords, " ", $cur_word);
} 


     $result3 = $db->sql_query("SELECT word_id FROM ".$prefix."_ct_word 
                            WHERE word_word = '$cur_word'");
     $row = $db->sql_fetchrow($result3);
     if( $row['word_id'] )
     {
        /* If yes, use the old word_id: */
        $word_id = $row['word_id'];
     }
     else
     {
        /* If not, create one: */
        $db->sql_query("INSERT INTO ".$prefix."_ct_word (word_word) VALUES (\"$cur_word\")");
        $word_id = mysql_insert_id();
     }

     /* And finally, register the occurrence of the word: */
     $db->sql_query("INSERT INTO ".$prefix."_ct_occurrence (word_id,page_id) 
                  VALUES ($word_id,$page_id)");
     print "Indexing: $cur_word<br>";
  }
   //}

include ('footer.php');

laserlight

This:

foreach ($overusedwords as $overusedwords) {
    $buf = str_replace($overusedwords, '', $buf);
}

should be:

foreach ($overusedwords as $overusedword) {
    $buf = str_replace($overusedword, '', $buf);
}

Actually, you can just write it as:

$buf = str_replace($overusedwords, '', $buf);

Simplyput

I made the suggested change and ran the script and the end product dumped into the db was the following.


f x th o n bu ng eff v s l clo p wh pl nn y ho how b h fun u g ffa ov op w cu co t hoo lf k hn qu z in so lpful ugg c why houl ful en copy sub j do own un gh m i ow nk ff ly po

Looks like it's chopping up the words after the filter. Where did i go wrong in the code?

laserlight

Oh, but of course: str_replace does not care if your words to be ignored at by themselves, or within some other interesting word. You should use preg_replace instead, with care taken to match against word boundaries. Try it yourself first 🙂

Derokorian

Basically str_replace is removing all "the" even if its part of the word "tithe" which would then become ti. Regex will allow you to do remove the with spaces around it.

Another option would be to explode the title by space, then compare each word to the disallowed array using in_array() if its there don't insert it, if its not go for it! (just another option, i think regular expressions will be marginally faster tho)

NogDog

See the "\b" word boundary assertion in the preg_*() functions.

Simplyput

After lots of reading searching and trolling the internet. I found some code that helped me achieve what I needed. The finished code is presented below for anyone who might need this in the future.

Thanks

/* Start parsing through the text, and build an index in the database: */

$buf = $title;

   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);


 //Remove common words
$order = array(" a "," about "," after "," against "," all "," almost "," also "," am "," an "," and "," another "," any "," are "," around "," as "," at "," b "," be "," because "," been "," before "," behind "," being "," both "," but "," by "," c "," came "," come "," comes "," could "," d "," did "," do "," does "," done "," e "," each "," either "," etc "," ever "," every "," example "," f "," few "," for "," from "," g "," go "," h "," had "," has "," have "," here "," how "," however "," i "," if "," ii "," iii "," in "," include "," included "," including "," into "," is "," it "," its "," iv "," ix "," j "," just "," k "," l "," m "," many "," may "," midst "," might "," my "," n "," nbsp "," neither "," never "," next "," no "," nor "," not "," now "," o "," of "," often "," on "," once "," or "," other "," others "," our "," over "," p "," put "," q "," r "," s "," same "," shall "," should "," since "," so "," some "," something "," sometimes "," soon "," such "," t "," than "," that "," the "," their "," them "," then "," there "," these "," they "," this "," those "," through "," to "," too "," toward "," u "," under "," underneath "," until "," us "," use "," used "," uses "," using "," usually "," v "," very "," vi "," vii "," viii "," w "," was "," we "," went "," were "," what "," when "," where "," whether "," which "," while "," who "," why "," with "," within "," without "," would "," x "," xi "," xii "," xiii "," xiv "," xix "," xv "," xvi "," xvii "," xviii "," xx "," y "," you "," your "," z ");
$replace = ' ';
$buf = str_replace($order, $replace,$buf);

//remove duplicate words
$buf = preg_replace("/([,.?!])/"," \\1",$buf);
$parts = explode(" ",$buf);
$unique = array_unique($parts);
$unique = implode(" ",$unique);
$buf = preg_replace("/\s([,.?!])/","\\1",$unique); 	 


 /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);



   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

     $result3 = $db->sql_query("SELECT word_id FROM ".$prefix."_ct_word 
                            WHERE word_word = '$cur_word'");
     $row = $db->sql_fetchrow($result3);
     if( $row['word_id'] )
     {
        /* If yes, use the old word_id: */
        $word_id = $row['word_id'];
     }
     else
     {
        /* If not, create one: */
        $db->sql_query("INSERT INTO ".$prefix."_ct_word (word_word) VALUES (\"$cur_word\")");
        $word_id = mysql_insert_id();
     }

     /* And finally, register the occurrence of the word: */
     $db->sql_query("INSERT INTO ".$prefix."_ct_occurrence (word_id,page_id) 
                  VALUES ($word_id,$page_id)");
     print "Indexing: $cur_word<br>";
  }
   }



}
    include ('footer.php');
?>