Would something like this help?
if ($_POST['submitted']){
$filename = $_POST['url'];
if( fopen($filename, "r") ){
$data = html_entity_decode(strip_tags(file_get_contents($filename)));
$dataSplit = preg_split('#[\W\d-]#i', $data, -1, PREG_SPLIT_NO_EMPTY);
$word = array_count_values(array_map('strtolower', $dataSplit));
foreach($word as $key=>$val){
if(strlen($key) < 2 && $key !='a' && $key != 'i'){
unset($word[$key]);
}
}
echo "<pre>".print_r($word, true);
} else {
exit("Unable to open file!");
}
}
Note that when a valid url is passed, and after tags are stripped, html_entity_decode is applied and everything is split via preg_split, there may be words left over that are not words.. for example, if, somwhere within the site in question, there is text that contains 'www.amazon.com', this will end up as:
[www]
[amazon]
[com]
So in this context, what defines a word as an actual word is not so cut and dry...
You'll also notice that I checked to see if single characters are not an 'a' or an 'i', I unset the,=m, as depending on the circumstances, I have found some odd single word entries like [d] or [x].. so this measure should help.. You can add specific allowable single character words, or even outright delete all single characters altogether if you don't care about such words...
I provided the meat and potatoes (a version of many solutions I'm sure). I'll leave you to provide the gravy.
EDIT - When I tested this further on other web pages, initial words like "isn't" is broken into isn and t.. So you could use this preg_replace pattern that doesn't break apostrophes instead of the one I included above with the snippet:
$dataSplit = preg_split('#[^a-z\']#i', $data, -1, PREG_SPLIT_NO_EMPTY);
Any additional characters that you want protected from the split can also be added into that character list.