I am using a script to build a table with words from the titles of my articles.
I have a list of variables that I want to exclude being placed in the database, but it ignores my overused words arrays and still places them in the db.
Example - Title "I like the Transformers movie" after going through the script I should have in the db the words like, Transformers, movie. However, I get "the" also placed in the db even though it's in my stop words array.
This is the code where I am having my hiccup. What am I doing wrong?
/* Start parsing through the text, and build an index in the database: */
$buf = $title;
/* Remove whitespace from beginning and end of string: */
$buf = trim($buf);
/* Try to remove all HTML-tags: */
$buf = strip_tags($buf);
$buf = ereg_replace('/&\w;/', '', $buf);
//function message() {
$overusedwords = array( 'a', 'an', 'the', 'and', 'of', 'i', 'to', 'is', 'in', 'with', 'for', 'as', 'that', 'on', 'at', 'this', 'my', 'was', 'our', 'it', 'you', 'we', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '10', 'about', 'after', 'all', 'almost', 'along', 'also', 'amp', 'another', 'any', 'are', 'area', 'around', 'available', 'back', 'be', 'because', 'been', 'being', 'best', 'better', 'big', 'bit', 'both', 'but', 'by', 'c', 'came', 'can', 'capable', 'control', 'could', 'course', 'd', 'dan', 'day', 'decided', 'did', 'didnt', 'different', 'div', 'do', 'doesn', 'don', 'down', 'drive', 'e', 'each', 'easily', 'easy', 'edition', 'end', 'enough', 'even', 'every', 'example', 'few', 'find', 'first', 'found', 'from', 'get', 'go', 'going', 'good', 'got', 'gt', 'had', 'hard', 'has', 'have', 'he', 'her', 'here', 'how', 'if', 'into', 'isn', 'just', 'know', 'last', 'left', 'li', 'like', 'little', 'll', 'long', 'look', 'lot', 'lt', 'm', 'made', 'make', 'many', 'mb', 'me', 'menu', 'might', 'mm', 'more', 'most', 'much', 'name', 'nbsp', 'need', 'new', 'no', 'not', 'now', 'number', 'off', 'old', 'one', 'only', 'or', 'original', 'other', 'out', 'over', 'part', 'place', 'point', 'pretty', 'probably', 'problem', 'put', 'quite', 'quot', 'r', 're', 'really', 'results', 'right', 's', 'same', 'saw', 'see', 'set', 'several', 'she', 'sherree', 'should', 'since', 'size', 'small', 'so', 'some', 'something', 'special', 'still', 'stuff', 'such', 'sure', 'system', 't', 'take', 'than', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'those', 'though', 'through', 'time', 'today', 'together', 'too', 'took', 'two', 'up', 'us', 'use', 'used', 'using', 've', 'very', 'want', 'way', 'well', 'went', 'were', 'what', 'when', 'where', 'which', 'while', 'white', 'who', 'will', 'would', 'your', 'to', 'can' );
foreach ($overusedwords as $overusedwords) {
$buf = str_replace($overusedwords, '', $buf);
}
return $buf;
}
/* Extract all words matching the regexp from the current line: */
preg_match_all("/(\b[\w+]+\b)/",$buf,$words);
/* Loop through all words/occurrences and insert them into the database: */
for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
/* Does the current word already have a record in the word-table? */
$cur_word = addslashes( strtolower($words[$i][$j]) );
/* add the following to filter unwanted words */
//if (!in_array( $cur_word, $overusedwords)) {
//$cur_word = eregi_replace($overusedwords, " ", $cur_word);
}
$result3 = $db->sql_query("SELECT word_id FROM ".$prefix."_ct_word
WHERE word_word = '$cur_word'");
$row = $db->sql_fetchrow($result3);
if( $row['word_id'] )
{
/* If yes, use the old word_id: */
$word_id = $row['word_id'];
}
else
{
/* If not, create one: */
$db->sql_query("INSERT INTO ".$prefix."_ct_word (word_word) VALUES (\"$cur_word\")");
$word_id = mysql_insert_id();
}
/* And finally, register the occurrence of the word: */
$db->sql_query("INSERT INTO ".$prefix."_ct_occurrence (word_id,page_id)
VALUES ($word_id,$page_id)");
print "Indexing: $cur_word<br>";
}
//}
include ('footer.php');