You could try something like:
<?php
$filename = '/path/to/file.html';
$file = fopen( $filename, 'r' );
while( ! feof( $file )) {
$line = fgets( $file, 1024 );
$split_line = split( '[.!?,; ]', $line );
foreach( $split_line as $word ) {
$word = strtolower( $word );
if( ! isset( $words[$word]))
$words[$word] = 1;
else
$words[$word]++;
}
}
fclose( $file );
?>
This will load the file in $filename and loop through it until the end. Each line is split on periods, exclamations, questions, commas, semicolons, or spaces.
Then, a name index array $words will be populated with each word and keep track of how many you've seen.
For example, if I have file.html containing:
This is an example for this script.
You would end up with:
$words[this] = 2
$words[is] = 1
$words[an] = 1
$words[example] = 1
$words[for] = 1
$words[script] = 1
So, to loop through all the distinct words found in alphabetical order, you could use:
<?php
ksort( $words );
foreach( $words as $word=>$count ) {
print $word . "\n";
}
?>
Hope this helps!
-Rich