I have written a script to compare files.
There are 1000 files so to compare with each other
is approx 1 million combinations. 😃

However it is only comparing two at at time and should
be overwriting the variable that holds the file each iteration
of the for loop ... so I don't see the reason for using up the memory ! 😕

This is the script:

	$path = "/home/compare/projects/$Db_directory/";
  $art_path = $path.'spin/';

$out_file = $path.'page_check.txt';
  $out_file1 = $path.'page_check80.txt';
  $out_file2 = $path.'page_check70.txt';
  $out_file3 = $path.'page_check60.txt';

$fp = fopen("$out_file", "wb");
  $fp1 = fopen("$out_file1", "wb");
  $fp2 = fopen("$out_file2", "wb");
  $fp3 = fopen("$out_file3", "wb");


  if ($fp === FALSE ) { 
    echo "Problem opening file: $out_file<br>";
  	exit;
  }

$output = 'Starting Comparison Run - '.$logstamp.'\n';
fwrite($fp, $output, strlen($output));

$art_cnt = 0; 
if ($handle = opendir($art_path)) {
    while (($file = readdir($handle)) !== false){
        if (!in_array($file, array('.', '..')) && !is_dir($art_path.$file)) 
            $art_cnt++;
    }
}

$remove_these = array(',', '.', '!', '?', ':', ';');

// 'XZ', '[hd2]', '[hd3]', '[hd4]', '[hd5]', '[hd6]',
// '[b]', '[z]', '[u]', '[li]', '[l]', '[r]', '[str]', '[c]', '[em]',  '[/b]', '[/z]', '[/u]', '[/li]', '[/l]', '[/r]', '[/str]', '[/c]', '[/em]',
// '[list=dc w=400]', '[list=dc w=500]', '[/list]');
$art1 = 0;
$art2 = 0;
$data = array();

// Select article to compare with all others
for ($art1 = 0; $art1 <= $art_cnt; $art1++) {

	// For files that are named: 1.txt, 2.txt etc.

	$filename1 = $art1.$Db_file_end;
	$art_path1 = $art_path.$filename1;  

  // Now select the second article for the comparison

  for ($art2 = 0; $art2 <= $art_cnt; $art2++) { 

	$filename2 = $art2.$Db_file_end;
	$art_path2 = $art_path.$filename2;

	// Ensure we are comparing different articles
  if( $art_path1 != $art_path2) {

	 if (file_exists($art_path1)) {
		// echo "Filename1: $path1<br>";
	 if (file_exists($art_path2)) {
		// echo "Filename2: $path2<br>";
		$article1 = file_get_contents($art_path1);
		$article1 = strtolower($article1);
		$article1 = str_replace($remove_these, '', $article1); 
		$article1 = str_replace('  ', ' ', $article1); 
		$words1 = explode(' ', $article1); 

LINE 127 		$article2 = file_get_contents($art_path2);
			$article2 = strtolower($article2);
			$article2 = str_replace($remove_these, '', $article2); 

		echo "<br>First Article: $art_path1 <br>$article1<br><br>";
		echo "<br><br>Second Article: $art_path2 <br>$article2<br><br>";

		$found_match = 0; 
		$word_count = count($words1) - $Db_words;
		$found_match = 1;

                    [DO THE COMPARE ]

		$output = "$art1 vs $art2 = $compare".'\r\n';
		fwrite($fp, $output, strlen($output));
                    echo "$output<br>";
		if($unique < 80){
					 fwrite($fp1, $output, strlen($output));
					 }	
		if($unique < 70){
					 fwrite($fp2, $output, strlen($output));
					 }
		if($unique < 60){
					 fwrite($fp3, $output, strlen($output));
					 }

	} // end different articles
  }	// end for - to select second article
}	// end for - to select first article  

  $logstamp1 = date('H:i:s l, j F Y');
  $output = "Finished $logstamp1";
	fwrite($fp, $output, strlen($output));
	fwrite($fp1, $output, strlen($output));
	fwrite($fp2, $output, strlen($output));
	fwrite($fp3, $output, strlen($output));

fclose($fp);
fclose($fp1);
fclose($fp2);
fclose($fp3);

The last entry written in the log file ( $fp ) is:
252 vs 236 = 85\r\n252 vs 237 = 85\r\n252 vs 238 = 85

So it does a lot of them ... it is comparing file 252 with 238 when
is runs out of memory.

The failure is at line 127 - indicated
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 8192 bytes) in /home/compare/check.php on line 127,

Can you see why it is using up the memory ?

Thanks
David.

    Are some files significantly larger than others? Maybe you're hitting a combination where $article1 and $article2 are two of the largest?

      The fault could happen earlier, since after [font=monospace]$words1[/font] is defined, [font=monospace]$article1[/font] is still defined as well; when the line 127is reached, [font=monospace]$article1[/font], [font=monospace]$article2[/font] and [font=monospace]$words1[/font] and no doubt others are all remain defined from the previous iteration, and on top of that the new [font=monospace]$article2[/font] content is being read.

      Incidentally, are you really reading both files every time, even though one of them (and its words) only changes once with each iteration of the outer loop?

        Hi,
        Thanks for the replies
        Very good questions.

        The files for comparing are all very small, only 77 bytes

        I think I have found my error.

        I was building an array inside the loop:

        $data[]=array($art1, $art2, $unique, $word_count, $found_match); // add one line to the data array

        This must have filled up the memory - I have deleted it and will run again 🙂

        Thanks for helping.

          Yep, arrays are notorious memory hogs in PHP. (I think it's supposed to be at least somewhat improved in PHP 7.) First thing I looked for in your code snippet was evidence of an array being built within the loop, but didn't see it.

            Write a Reply...