oh well, I felt like doing a simple hack before I sleep, so...
/* Compare 2 strings for similiar words
* Only compare the first 125 characters
* Words are blocks of word characters separated by whitespace
* Word characters are defined according to the current locale
* Comparison is case-sensitive
* If at least 4 words match, returns true
* Else returns false
*/
function word_check($str1, $str2) {
//take the first 125 characters
$str1 = substr($str1, 0, 125);
$str2 = substr($str2, 0, 125);
//break $str1 into its constituent words
//first eliminate non-word characters
$str1 = preg_replace("/[^\\w\\t ]/", "", $str1);
//reduce consecutive whitespace into a single space
$str1 = preg_replace("/[\\t ]+/", " ", $str1);
//now split it
$words1 = explode(" ", $str1);
//break $str2 into its constituent words
//first eliminate non-word characters
$str2 = preg_replace("/[^\\w\\t ]/", "", $str2);
//reduce consecutive whitespace into a single space
$str2 = preg_replace("/[\\t ]+/", " ", $str2);
//now split it
$words2 = explode(" ", $str2);
//now we compare arrays
$count = 0;
foreach ($words1 as $word)
if (in_array($word, $words2))
$count++;
//return the result
if ($count >= 4)
return true;
else
return false;
}
which may or may not do what you want, and may be fairly inefficient depending on what you want.
EDIT:
but only 6 words match in the first example?