Hi,

i've a performance issue while doing a regex search and replace on a 2 GB file (i have to parse it line by line for memory reason).

I'm using this code, under Windows 10, with PHP 7 64 bits. The code is pretty simple

<?php
	$handle = fopen("2Gb.txt", "rb");
	$handle2 = fopen("2Gb-2.txt", "wb");
	stream_set_write_buffer($handle2, 32000);
	while (($line = fgets($handle)) !== false) {
		fwrite($handle2, preg_replace('/123456\r?$/', 'azerty', $line));
	}
	fclose($handle);
	fclose($handle2);	
?>

It takes me 17 minuts in PHP against 72 in perl and 66 in C#. I've tried to tweak the memory buffer (stream_set_write_buffer($handle2, x)); but it's changing nothing. The file is generated by script, a random alphanum string between 0 and 32, filling 2 GB of file, nothing else...

Why it is so slow on PHP side ???

    PHP against 72 in perl and 66 in C#

    72s and 66s of course...

      John.FENDER;11055083 wrote:

      Hi,

      i've a performance issue while doing a regex search and replace on a 2 GB file (i have to parse it line by line for memory reason).

      I'm using this code, under Windows 10, with PHP 7 64 bits. The code is pretty simple

      <?php
      	$handle = fopen("2Gb.txt", "rb");
      	$handle2 = fopen("2Gb-2.txt", "wb");
      	stream_set_write_buffer($handle2, 32000);
      	while (($line = fgets($handle)) !== false) {
      		fwrite($handle2, preg_replace('/123456\r?$/', 'azerty', $line));
      	}
      	fclose($handle);
      	fclose($handle2);	
      ?>
      

      It takes me 17 minuts in PHP against 72 in perl and 66 in C#. I've tried to tweak the memory buffer (stream_set_write_buffer($handle2, x)); but it's changing nothing. The file is generated by script, a random alphanum string between 0 and 32, filling 2 GB of file, nothing else...

      Why it is so slow on PHP side ???

      Show us your C++ and or Perl code. I certainly think we could do better if we think about it. fwrite(), for one thing, is costly.

        To narrow down the problem to the I/O or the preg_replace(), I would temporarily change the fwrite(.....) line so that it simply writes the input $line to the output file.

        Also, are you sure that the line ending in the file is the same as what php recognizes and what are your php memory settings?

          Show us your C++ and or Perl code

          Yes sir ! Perl code :

          open (FH, '<', "2Gb.txt");
          open(FILE, ">2Gb-2.txt");
          while (<FH>) {
          	$_ =~ s/123456$/azerty/g;
          	print FILE $_;
          }
          close FH;
          close FILE;

          Are you sure you wanna see C# code ?

          are you sure that the line ending in the file is the same as what php recognizes

          It's CR+LF, i'v generated the test file with PERL but it could be raised by php as well. A test with pure LF like Unix doesn't change anything.

          what are your php memory settings?

          memory_limit = 3200000M.

          As the file is parsed line by line, i didn't have any big memory consumption on the graph.

            I would put a counter inside the loop and display its value at the end, to see how many lines php processes.

            I tried your code on a 1.5GB iso file. A loop/line counter reports 5.6M (5,634,704) lines. You would likely have more, shorter lines, but the result would be proportional. With the preg_replace() in the code, it takes 21 Seconds on my system. With just writing the input $line variable to the output file, it takes 14 Seconds.

            That's a ridiculously large and improbable memory setting (3200000M = 3,200,000,000,000 = 3200GšŸ˜Ž. I suspect that your line ending character(s) (try it with just a CR) are not being recognized by php and it is trying to read the whole file into memory as one line. That memory setting is much larger than the amount of memory a personal computer can even support and assuming php is using that large of a setting it could be causing the contents of physical memory to be paged to the disk.

            Try it with a value such as 128M and check what a php script with phpinfo(); statement in it says the memory setting actually is.

              About 95 seconds here.

              $f = "dummy.txt";
              $g = "dummy2.txt";
              
              $handle = fopen($f, "rb");
              $handle2 = fopen($g, "wb");
              stream_set_write_buffer($handle2, 32000);
              while (($line = fgets($handle)) !== false) {
                      fwrite($handle2, preg_replace('/booyah/', 'yabbadabba', $line));
              }
              fclose($handle);
              fclose($handle2);

              Interesting that a difference of 4 chars made 200M difference in the output file size:

              du -hc dumm*
                2G    dummy.txt
              2.2G    dummy2.txt
              4.2G    total
              

                One thing I notice is that your regexes are different; the first won't match any line that ends with a "\n" (which is probably every line). I also don't recall Perl's '$' regexp symbol as ignoring newlines, or its readline function stripping them; I know PHP's doesn't, and its regex language is supposed to be "Perl-compatible".

                You also seem to assume in the Perl code that you're reading a text file that uses the system's newline convention, while in PHP you read the file as binary and handle text line endings manually.

                I'm not sure about setting stream_set_write_buffer; do you have multiple processes writing to the same output file?

                Another thing is that you're reading the input as binary and then fiddling with text line endings by hand; are you unsure about whether the lines of your data end with Windows or Unix line endings? Because your Perl code assumes that it matches the system (you're not adjusting $/). Reading the input as text and using \b to match the end of the alphanumeric string instead of "\r?$" (or "$").

                I don't have Perl installed, so I can't do a side-by-side comparison on my machine. However, your PHP code runs in 227s and a command-line sed substitution takes 80s.

                On the other hand, I thought of having PHP do the substitution one 32MB block at a time, which took about 4.5s to do the job:

                	$handle = fopen("2Gb.txt", "rb");
                	$handle2 = fopen("2Gb-2.txt", "wb");
                	stream_set_write_buffer($handle2, 32000);
                	while(!feof($handle)) {
                		$chunk = fread($handle, 32*1024*1024);
                		$chunk .= fgets($handle); // Avoid a false-positive where the chunk ends with '123456' but not at the end of a line
                		fwrite($handle2, preg_replace('/123456\b/', 'azerty', $chunk));
                	}
                	fclose($handle);
                	fclose($handle2);	
                

                So I reckon it's the underlying I/O that's the hassle, here (btw, these tests were using an SSD drive, so disk seeks between read and write positions weren't an issue).

                  I would put a counter inside the loop and display its value at the end, to see how many lines php processes.

                  It's done. Same result as with perl or a wc -l.

                  it takes 21 Seconds on my system

                  I use a USB3 5200 tpm hard disk with 32 mb of cache, no SSD here.

                  ridiculously large and improbable memory setting

                  Yes, in fact, i have booster it a lot to see if it change anything. And it's changing anything. The one time it was not enought, PHP have send me an error with memory. And i don't see anything more than 100 mb more on the task manager. Sound not like a matter of memory consomption for me.

                  About 95 seconds here.

                  It's a time i would get ! But Are you in the same condition like me ? Using a standard disk, under Windows 7/8 or 10 ?

                  the first won't match any line that ends with a "\n"

                  It will match all line ending with CRLF. At the beginning, i've put \R to be tranquille, but when i've seen the performance, my first point was to tweak this value. But it doesn't changing nothing. More, in fact, currently just a search return very good result, 50s against 32 on perl. The regex engine in PHP is good and i don't think the point is coming from the regex. I think it come from the fwrite, and maybe the implementation of it under windows.

                  FYI, here is the code i use for generating my 2GB.txt file.

                  open (FH, '>>', "2GB.txt");
                  rand( time() ^ ($$ + ($$ << 15)) );
                  my @v = qw ( a e i o u y 1 2 3 4 5 6 7 8 9 0);
                  my @c = qw ( b c d f g h j k l m n p q r s t v w x z );
                  for (1..99999999) {
                      my ($flip, $str) = (0,'');
                      $integer=rand(31)+1;
                      $str .= ($flip++ % 2) ? $v[rand(16)] : $c[rand(20)] for 1 .. $integer;
                      $str =~ s/(....)/$1 . int rand(10)/e;
                      $str = ucfirst $str if rand() > 0.5;
                      print FH "$str\n";
                  }
                  close FH;
                    John.FENDER;11055101 wrote:

                    It's a time i would get ! But Are you in the same condition like me ? Using a standard disk, under Windows 7/8 or 10 ?

                    Alas, no, FreeBSD 9 on a VirtualBox VM running under Win7. Also, my text files probably didn't look like yours, so maybe not apples/orange.

                    If I can find time I'll try using your file generator, but they're piling it on here beyond precedence of recent memory ...

                      @

                      6 mn 22 on a RAM Disk of 4,5 Gb. Perl goes at 50s in the same condition.

                      @

                      5s 79 on the RAMDisk with your code. But it's biaised, as we are not using anymore the original regex pattern.

                        John.FENDER wrote:

                        5s 79 on the RAMDisk with your code. But it's biaised, as we are not using anymore the original regex pattern.

                        Well, you can change that back (and add the /m modifier). I doubt it would make much difference.

                          @

                          Physical disk on a file with 14901 occurencies of 123456$ with CR+LF for each lines.

                          PERL : 1mn 20s
                          Your code :1m 4s
                          preg_replace('/123456\b/', 'azerty', $chunk) : grep -ic 123456$ file1 send back the same number as grep -ic azerty$ file2 -> 14901
                          preg_replace('/123456/', 'azerty', $chunk) : grep -ic 123456$ file1 send back the same number as grep -ic azerty$ file2 -> 14901
                          preg_replace('/123456/m', 'azerty', $chunk) : grep -ic 123456$ file1 send back the same number as grep -ic azerty$ file2 -> 14901
                          preg_replace('/123456$/', 'azerty', $chunk) : grep -ic 123456$ file1 dont send back the same number as grep -ic azerty$ file2 -> 0
                          preg_replace('/123456$/m', 'azerty', $chunk) : grep -ic 123456$ file1 dont send back the same number as grep -ic azerty$ file2 -> 0
                          preg_replace('/123456\R$/m', 'azerty', $chunk) : grep -ic 123456$ file1 dont send back the same number as grep -ic azerty$ file2 -> 0
                          preg_replace('/123456\R$/', 'azerty', $chunk) : grep -ic 123456$ file1 dont send back the same number as grep -ic azerty$ file2 -> 0
                          Cheers.

                            Write a Reply...