I need to use a web crawler script to locate a url, and then use a word count to count the frequency of each word and display it with the script. This is what I have so far.

<form action="<? echo $PHP_SELF;?>" method=post> 
URL: <input name=url size=30 value="<? echo $_POST['url']; ?>"> 
<input type=submit value=submit> 
<input type=reset value=reset> 
<input type=hidden name=submitted value=true> 
</form> 
</center> 
<hr> 
<?php 
if ($_POST['submitted']){ 	
	$filename=$_POST['url'];  	
	$file = fopen($filename, "r") or exit("Unable to open file!"); 	
	//Output a line of the file until the end is reached 	
	while(!feof($file)) 		
		{  		
		echo fgets($file); 		
		} 	
	fclose($file); 	
	} 

if ($_POST['submitted']){ 	
	$wordarray=explode(" ", $_POST['text']);  	
	foreach ($wordarray as $value){ 		
		$freq{$value}++;  		
		} 	
	arsort($freq);  	
	echo "<table border=1><tr><th>word</th><th>freq.</th></tr>";  	
	foreach ($freq as $key=>$value){ 		
		if (strlen($key)>0){ 			
			echo "<tr><td>".$key."</td><td>".$value."</td></tr>";  			
			} 		
		} 	
	echo "</table>"; 	 	
	} 
?> 

I know I need the first IF statement, but was going to get rid of everything after the first line and insert the word count function. Does this make sense to anyone?

    Would this be an example of what you are looking for?

    $data = "This is simply a test to break down a sentence into words, list them and also include how many actual repeat words there are!
    Are there any repeat words here?";
    
    $arr = preg_split('#[ ,.!?\r\n]#', $data, -1, PREG_SPLIT_NO_EMPTY);
    echo "<pre>".print_r(array_count_values($arr), true);
    

    Output:

    Array
    (
        [This] => 1
        [is] => 1
        [simply] => 1
        [a] => 2
        [test] => 1
        [to] => 1
        [break] => 1
        [down] => 1
        [sentence] => 1
        [into] => 1
        [words] => 3
        [list] => 1
        [them] => 1
        [and] => 1
        [also] => 1
        [include] => 1
        [how] => 1
        [many] => 1
        [actual] => 1
        [repeat] => 2
        [there] => 2
        [are] => 1
        [Are] => 1
        [any] => 1
        [here] => 1
    )
    

    EDIT - Now note that 'are' and 'Are' are two separate words.. you could always apply strtolower() on the POST you are checking first to reduce redundancies like that.

    EDIT 2- Alternatively, I suppose you can combine the two ( strtolower and array_cout_values ) via array_map() as such:

    echo "<pre>".print_r(array_count_values(array_map('strtolower', $arr)), true);
    

      That is what the output is suppose to look like. But in the form that is included in the script, you must enter a URL and hit submit. Then insted of displaying the url, it displays the word count for the site.

        Would something like this help?

        if ($_POST['submitted']){
        	$filename = $_POST['url'];
        	if( fopen($filename, "r") ){
        		$data = html_entity_decode(strip_tags(file_get_contents($filename)));
        		$dataSplit = preg_split('#[\W\d-]#i', $data, -1, PREG_SPLIT_NO_EMPTY);
        		$word = array_count_values(array_map('strtolower', $dataSplit));
        		foreach($word as $key=>$val){
        			if(strlen($key) < 2  && $key !='a' && $key != 'i'){
        				unset($word[$key]);
        			}
        		}
        		echo "<pre>".print_r($word, true);
        	} else {
        		exit("Unable to open file!");
        	}
        }
        

        Note that when a valid url is passed, and after tags are stripped, html_entity_decode is applied and everything is split via preg_split, there may be words left over that are not words.. for example, if, somwhere within the site in question, there is text that contains 'www.amazon.com', this will end up as:

        [www]
        [amazon]
        [com]

        So in this context, what defines a word as an actual word is not so cut and dry...
        You'll also notice that I checked to see if single characters are not an 'a' or an 'i', I unset the,=m, as depending on the circumstances, I have found some odd single word entries like [d] or [x].. so this measure should help.. You can add specific allowable single character words, or even outright delete all single characters altogether if you don't care about such words...

        I provided the meat and potatoes (a version of many solutions I'm sure). I'll leave you to provide the gravy.

        EDIT - When I tested this further on other web pages, initial words like "isn't" is broken into isn and t.. So you could use this preg_replace pattern that doesn't break apostrophes instead of the one I included above with the snippet:

        $dataSplit = preg_split('#[^a-z\']#i', $data, -1, PREG_SPLIT_NO_EMPTY);
        

        Any additional characters that you want protected from the split can also be added into that character list.

          That works very nicely. Now I have to try to get it to show up in the table. I haven't been able to do that yet.

            Write a Reply...