It's hard to say. I used the function you had first posted:
function count_words($string) {
$word_count = 0;
$string = preg_replace('/(\s)+/', '$1', $string);
$string = explode(' ', $string);
while (list(, $word) = each ($string)) {
if (eregi('[0-9A-Za-zÀ-ÖØ-öø-ÿ]', $word)) {
$word_count++;
}
}
return($word_count);
}
and got an accurate count in the sample text you gave, though you said in other instances it starts to fall short.
The problem with my function is that apparently PHP interprets the ’ in don’t as a word boundary. To fix this problem, we can simply replace all punction that is between two letters (aka inside a word, such as an apostrophe) with nothing. Here's the result:
function wordcount($string) {
$string = preg_replace('/(\w)[[:punct:]]+(\w)/', '$1$2', $string);
preg_match_all('/\b[^\s]+\b/U', $string, $matches);
return count($matches[0]);
}
The result gives us 16, the accurate number.
Try running some of your short stories through this modified version of the function and let me know how far off it is.
EDIT: In case you want to try and fix it further on your on, I'll tell you what I did.
When you said that my original function was giving an inaccurate word count, the first thing I did was make it spit out the different "words" it thought it had found, like so:
function wordcount($string) {
preg_match_all('/\b[^\s]+\b/U', $string, $matches);
print_r($matches[0]); // this shows us all "words" it matched
return count($matches[0]);
}
I then noticed that "don", "’", and "t" were in three separate array indeces, meaning it thought it had found three different words there. That's how I diagnosed the problem and decided to strip punctuation inside words. Once you run some of your short stories through, we can do the same thing and see what other situations might come about that would throw off the count.
The nature of this process means that we'll never know just how accurate it is. It isn't feasible that we might test every single construct in the English language in combination with every other to predict and adapt the function; all we can do is fix it to the best of our ability and lower the margin of error as much as possible.