I'm working on a CLI PHP script to parse the Common Crawl URL Index... this is 300 files, ~5 GB each.
I made a basic script that downloads each file ( one at a time ), reads it line by line with fgets(), and after some minor processing, getting the RAW URL out of all the text in each line, inserts the URL into my database, one at a time.
Each file has about 9 million entries. So 9 million fgets( ) and 9 million INSERT INTO's.
The script works great, but it takes about 6 days to run. So I quickly came up with three ways to speed it up:
- Do multiple INSERTs at once into MySQL
- Read large chunks of the source file at once
- Skip 90% of lines using a random-ish algorithm. This will cut down the database from 3 billion rows to 300 million rows which is plenty for my needs. I haven't implemented this yet but it's straight forward enough that I'll leave the details out of this post.
At this point I made two copies of my script. The first keeps reading 1 line at a time with fgets( ), but stores urls in a php array and flushes it every n times. It flushes it by way of doing one large INSERT INTO query for the entire array. I made a benchmark version of this script that tried many different queue sizes and got very good, interesting results.
note all benchmark results are averages of many runs, and all results were VERY close.
Rows | Minutes
1 | 24.83
5 | 11.41
10 | 9.89
20 | 7.37
30 | 6.87
50 | 6.4
100 | 6.05
150 | 5.9
200 | 5.83
250 | 5.81
300 | 5.78
400 | 5.77
500 | 5.78
750 | 5.81
1000 | 5.83
As you can see, doing 400 rows at once gave me the best results, significantly better than individual. After that, things started to slow again, which makes sense considering the size of these queries.
Next I took the other copy of the original script, left it doing 1 INSERT INTO at a time, but set it up to use fread( ) to read larger chunks at a time. The average line of these files is probably .25 to .5 KB. Just like before, I set it up to benchmark many different read sizes, from 1 KB all the way up to 16 MB. I was concerned that I was joining remainder lines properly, but every benchmark had the exact same table size, which is good.
HOWEVER the benchmark times are not good at all. There was basically zero speed difference between reading 1 line at a time, and read 16 MB at a time, all the way through this 5 GB source file!
KB | Minutes
1 | 25.4
2 | 25.07
4 | 24.91
8 | 24.83
16 | 24.71
32 | 24.6
64 | 24.62
128 | 24.64
256 | 24.7
512 | 24.8
1024 | 24.84
2048 | 24.87
4096 | 24.87
8192 | 24.96
16384 | 25.12
This surprised me. How could reading 1 line at a time be just as fast as reading 16 MB? The obvious answer is that those single MySQL INSERTs are the bottleneck. But even so, shouldn't there still be some meaningful speed increase?
Right now I have a benchmark running through the big reads test again, using a 400 row INSERT queue. That should shed some light on things, but I have a funny feeling the results of that are going to be the same as doing 400 INSERTs reading 1 line at a time. Stay tuned for an hour or two after this post, for the 3rd benchmark results, which should be very interesting.
I've been working with PHP and MySQL for many years now, but one thing I do not have a lot of experience with is reading large files. The few times I've done this before, I've just read them line by line with fgets( ). I've never had a script that was taking 6 days, that I had to try to speed up to about 4 hours ideally Breaking up arbitrary sized chunks of a file and parsing through the lines, while leaving half-eaten lines for the next pass, was kind of an odd process. I am very surprised fgets( ) doesn't simply have an optional second argument for number of lines, that would have it return n lines as an array instead of 1 line as a string. This would be a super easy way to get the advantages of doing fewer large reads instead of more tiny reads (which apparently there are no advantages, but more on that to come...)
This is what I came up with for a process for reading large chunks, gluing the ends together, and hopefully not loosing or mis-processing any data. Using my MySQL row count as a test, it does seem to work without issue.
$chunk = "";
$file = fopen("very_large_file.txt","r");
if ($file)
{
$insert_count = 0;
while ( !feof($file) )
{
// Read a chunk and append it to the main string $chunk
$chunk .= fread($file,$size*1024);
if ( feof($file) )
{
// If the last read hit EOF, then process $chunk and clear it for next loop (through next file)
$payload_lines = explode("\n",$chunk);
$chunk = "";
}
else
{
// If we're not at the end, find the last newline character in $chunk
$last_newline = strrpos($chunk,"\n");
if ($last_newline === false)
{
// If there is no newline, skip the rest of this loop and go read more
continue;
}
else
{
// Process the string up to the last newline, and set $chunk to the string after the newline, the "remainder" that
// I currently have no way to tell if it's a complete line or a partial line (probably partial but occasionally complete)
$payload = substr($chunk,0,$last_newline);
$chunk = substr( strrchr($chunk,10), 1);
$payload_lines = explode("\n",$payload);
}
}
foreach ($payload_lines as $line)
{
// Process each $line and add the end result to my insert queue
$insert_queue[] = "('$url')";
}
if ( count($insert_queue) > 399 ) // Flush the queue every 400 items
{
$insert_count += count($insert_queue);
flush_queue();
}
}
if ( count($insert_queue) )
{ flush_queue(); }
fclose($file);
}
I'm hoping there isn't some combination of string and read length that will cause this to flake out and read improper data.