web crawler and word count

crazyconnie

I need to use a web crawler script to locate a url, and then use a word count to count the frequency of each word and display it with the script. This is what I have so far.

<form action="<? echo $PHP_SELF;?>" method=post> 
URL: <input name=url size=30 value="<? echo $_POST['url']; ?>"> 
<input type=submit value=submit> 
<input type=reset value=reset> 
<input type=hidden name=submitted value=true> 
</form> 
</center> 
<hr> 
<?php 
if ($_POST['submitted']){ 	
	$filename=$_POST['url'];  	
	$file = fopen($filename, "r") or exit("Unable to open file!"); 	
	//Output a line of the file until the end is reached 	
	while(!feof($file)) 		
		{  		
		echo fgets($file); 		
		} 	
	fclose($file); 	
	} 

if ($_POST['submitted']){ 	
	$wordarray=explode(" ", $_POST['text']);  	
	foreach ($wordarray as $value){ 		
		$freq{$value}++;  		
		} 	
	arsort($freq);  	
	echo "<table border=1><tr><th>word</th><th>freq.</th></tr>";  	
	foreach ($freq as $key=>$value){ 		
		if (strlen($key)>0){ 			
			echo "<tr><td>".$key."</td><td>".$value."</td></tr>";  			
			} 		
		} 	
	echo "</table>"; 	 	
	} 
?>

I know I need the first IF statement, but was going to get rid of everything after the first line and insert the word count function. Does this make sense to anyone?

nrg_alpha

Would this be an example of what you are looking for?

$data = "This is simply a test to break down a sentence into words, list them and also include how many actual repeat words there are!
Are there any repeat words here?";

$arr = preg_split('#[ ,.!?\r\n]#', $data, -1, PREG_SPLIT_NO_EMPTY);
echo "<pre>".print_r(array_count_values($arr), true);

Output:

Array
(
    [This] => 1
    [is] => 1
    [simply] => 1
    [a] => 2
    [test] => 1
    [to] => 1
    [break] => 1
    [down] => 1
    [sentence] => 1
    [into] => 1
    [words] => 3
    [list] => 1
    [them] => 1
    [and] => 1
    [also] => 1
    [include] => 1
    [how] => 1
    [many] => 1
    [actual] => 1
    [repeat] => 2
    [there] => 2
    [are] => 1
    [Are] => 1
    [any] => 1
    [here] => 1
)

EDIT - Now note that 'are' and 'Are' are two separate words.. you could always apply strtolower() on the POST you are checking first to reduce redundancies like that.

EDIT 2- Alternatively, I suppose you can combine the two ( strtolower and array_cout_values ) via array_map() as such:

echo "<pre>".print_r(array_count_values(array_map('strtolower', $arr)), true);

crazyconnie

That is what the output is suppose to look like. But in the form that is included in the script, you must enter a URL and hit submit. Then insted of displaying the url, it displays the word count for the site.

nrg_alpha

Would something like this help?

if ($_POST['submitted']){
	$filename = $_POST['url'];
	if( fopen($filename, "r") ){
		$data = html_entity_decode(strip_tags(file_get_contents($filename)));
		$dataSplit = preg_split('#[\W\d-]#i', $data, -1, PREG_SPLIT_NO_EMPTY);
		$word = array_count_values(array_map('strtolower', $dataSplit));
		foreach($word as $key=>$val){
			if(strlen($key) < 2  && $key !='a' && $key != 'i'){
				unset($word[$key]);
			}
		}
		echo "<pre>".print_r($word, true);
	} else {
		exit("Unable to open file!");
	}
}

Note that when a valid url is passed, and after tags are stripped, html_entity_decode is applied and everything is split via preg_split, there may be words left over that are not words.. for example, if, somwhere within the site in question, there is text that contains 'www.amazon.com', this will end up as:

[www]
[amazon]
[com]

So in this context, what defines a word as an actual word is not so cut and dry...
You'll also notice that I checked to see if single characters are not an 'a' or an 'i', I unset the,=m, as depending on the circumstances, I have found some odd single word entries like [d] or [x].. so this measure should help.. You can add specific allowable single character words, or even outright delete all single characters altogether if you don't care about such words...

I provided the meat and potatoes (a version of many solutions I'm sure). I'll leave you to provide the gravy.

EDIT - When I tested this further on other web pages, initial words like "isn't" is broken into isn and t.. So you could use this preg_replace pattern that doesn't break apostrophes instead of the one I included above with the snippet:

$dataSplit = preg_split('#[^a-z\']#i', $data, -1, PREG_SPLIT_NO_EMPTY);

Any additional characters that you want protected from the split can also be added into that character list.

crazyconnie

That works very nicely. Now I have to try to get it to show up in the table. I haven't been able to do that yet.