Parsing Very Large Text Files

l008com · Jun 8, 2017

I'm working on a CLI PHP script to parse the Common Crawl URL Index... this is 300 files, ~5 GB each.

I made a basic script that downloads each file ( one at a time ), reads it line by line with fgets(), and after some minor processing, getting the RAW URL out of all the text in each line, inserts the URL into my database, one at a time.
Each file has about 9 million entries. So 9 million fgets( ) and 9 million INSERT INTO's.

The script works great, but it takes about 6 days to run. So I quickly came up with three ways to speed it up:
- Do multiple INSERTs at once into MySQL
- Read large chunks of the source file at once
- Skip 90% of lines using a random-ish algorithm. This will cut down the database from 3 billion rows to 300 million rows which is plenty for my needs. I haven't implemented this yet but it's straight forward enough that I'll leave the details out of this post.

At this point I made two copies of my script. The first keeps reading 1 line at a time with fgets( ), but stores urls in a php array and flushes it every n times. It flushes it by way of doing one large INSERT INTO query for the entire array. I made a benchmark version of this script that tried many different queue sizes and got very good, interesting results.

note all benchmark results are averages of many runs, and all results were VERY close.

Rows | Minutes
1    | 24.83
5    | 11.41
10   | 9.89
20   | 7.37
30   | 6.87
50   | 6.4
100  | 6.05
150  | 5.9
200  | 5.83
250  | 5.81
300  | 5.78
400  | 5.77
500  | 5.78
750  | 5.81
1000 | 5.83

As you can see, doing 400 rows at once gave me the best results, significantly better than individual. After that, things started to slow again, which makes sense considering the size of these queries.

Next I took the other copy of the original script, left it doing 1 INSERT INTO at a time, but set it up to use fread( ) to read larger chunks at a time. The average line of these files is probably .25 to .5 KB. Just like before, I set it up to benchmark many different read sizes, from 1 KB all the way up to 16 MB. I was concerned that I was joining remainder lines properly, but every benchmark had the exact same table size, which is good.

HOWEVER the benchmark times are not good at all. There was basically zero speed difference between reading 1 line at a time, and read 16 MB at a time, all the way through this 5 GB source file!

KB    | Minutes
1     | 25.4
2     | 25.07
4     | 24.91
8     | 24.83
16    | 24.71
32    | 24.6
64    | 24.62
128   | 24.64
256   | 24.7
512   | 24.8
1024  | 24.84
2048  | 24.87
4096  | 24.87
8192  | 24.96
16384 | 25.12

This surprised me. How could reading 1 line at a time be just as fast as reading 16 MB? The obvious answer is that those single MySQL INSERTs are the bottleneck. But even so, shouldn't there still be some meaningful speed increase?

Right now I have a benchmark running through the big reads test again, using a 400 row INSERT queue. That should shed some light on things, but I have a funny feeling the results of that are going to be the same as doing 400 INSERTs reading 1 line at a time. Stay tuned for an hour or two after this post, for the 3rd benchmark results, which should be very interesting.

I've been working with PHP and MySQL for many years now, but one thing I do not have a lot of experience with is reading large files. The few times I've done this before, I've just read them line by line with fgets( ). I've never had a script that was taking 6 days, that I had to try to speed up to about 4 hours ideally Breaking up arbitrary sized chunks of a file and parsing through the lines, while leaving half-eaten lines for the next pass, was kind of an odd process. I am very surprised fgets( ) doesn't simply have an optional second argument for number of lines, that would have it return n lines as an array instead of 1 line as a string. This would be a super easy way to get the advantages of doing fewer large reads instead of more tiny reads (which apparently there are no advantages, but more on that to come...)

This is what I came up with for a process for reading large chunks, gluing the ends together, and hopefully not loosing or mis-processing any data. Using my MySQL row count as a test, it does seem to work without issue.

$chunk = "";
$file = fopen("very_large_file.txt","r");
if ($file)
	{
	$insert_count = 0;
	while ( !feof($file) )
		{
		// Read a chunk and append it to the main string $chunk
		$chunk .= fread($file,$size*1024);

	if ( feof($file) )
		{
		// If the last read hit EOF, then process $chunk and clear it for next loop (through next file)
		$payload_lines = explode("\n",$chunk);
		$chunk = "";
		}
	else
		{
		// If we're not at the end, find the last newline character in $chunk
		$last_newline = strrpos($chunk,"\n");
		if ($last_newline === false)
			{
			// If there is no newline, skip the rest of this loop and go read more
			continue;
			}
		else
			{
			// Process the string up to the last newline, and set $chunk to the string after the newline, the "remainder" that
			// I currently have no way to tell if it's a complete line or a partial line (probably partial but occasionally complete)
			$payload = substr($chunk,0,$last_newline);
			$chunk = substr( strrchr($chunk,10), 1);
			$payload_lines = explode("\n",$payload);
			}
		}

	foreach ($payload_lines as $line)
		{
		// Process each $line and add the end result to my insert queue
		$insert_queue[] = "('$url')";
		}

	if ( count($insert_queue) > 399 ) // Flush the queue every 400 items
		{
		$insert_count += count($insert_queue);
		flush_queue();
		}
	}

if ( count($insert_queue) )
	{ flush_queue(); }

fclose($file);
}

I'm hoping there isn't some combination of string and read length that will cause this to flake out and read improper data.

Weedpacket · Jun 8, 2017

What happens if you use MySQL's bulk load facilities? I don't use MySQL enough to know what format it wants the data to be in, but it should surely be able to work with CSV; if you've only got one column (the URL) then that should look enough like CSV to pass.

l008com · Jun 8, 2017

To do that, i'd have to parse each 5 GB file into a bare "list of urls" CSV file that would probably be around 2-3 GB, then bulk load THAT into mysql. I don't think that would be faster, although I haven't actually tried.

l008com · Jun 8, 2017

Benchmarks are in for reading large chunks of the 5 GB file, while also running a 400 row queue. The difference in speed is very subtle but it is there, and it does correspond to the large chunk reading with individual INSERT queries. So I guess this is as fast as it gets. And that's fine, I can work with this.

KB    | Minutes
1     | 6.17
2     | 5.87
4     | 5.71
8     | 5.63
16    | 5.57
32    | 5.53
64    | 5.53
128   | 5.52
256   | 5.58
512   | 5.73
1024  | 5.82
2048  | 5.98
4096  | 6.14
8192  | 6.32
16384 | Errors

NogDog · Jun 8, 2017

How are the input files formatted? Is there a columnar structure (tabbed, CSV, piped, spaces, whatever)? If so, the the load data infile option is probably the way to go.

You can also discard an input value by assigning it to a user variable and not assigning the variable to a table column:
LOAD DATA INFILE 'file.txt'
  INTO TABLE t1
  (column1, @dummy, column2, @dummy, column3);

l008com · Jun 8, 2017

Well they definitely need to be processed first, if for no other reason, than the fact that i'm going to dump a semi-random 90% of the entries, and I don't have enough room on my server to even temporarily store them all, it will be about 250 GB vs 25 GB. Plus there are some tests I want to do to filter out certain kinds of entries that will be a lot easier to do this way, than to be doing full text searches of the entire built database.

Also this is what each line of the file looks like. Kind of a terrible format but easy enough to split at the open bracket, process the bracketed part as jason and get the info I want out of it:

de,pedocs)/frontdoor.php?source_opus=7502 20170427222802 {"url": "http://www.pedocs.de/frontdoor.php?source_opus=7502", "mime": "text/html", "status": "200", "digest": "HSXDRYDKSLYEEVLXNLDXZL46TJE6IB56", "length": "8437", "offset": "653335664", "filename": "crawl-data/CC-MAIN-2017-17/segments/1492917122629.72/warc/CC-MAIN-20170423031202-00085-ip-10-145-167-34.ec2.internal.warc.gz"}

NogDog · Jun 8, 2017

Hmm...so then the question becomes whether optimizing for DB efficiency means less efficiency on the PHP side, and vice versa.

One thought if it would apply in your case: if you are creating a new table when you do this process, and if you have any indexes on that table, creating the indexes after you do all the inserts can potentially save you some time, as each inserted row does not then have to be sorted and added to the indexes as it's inserted.

pbismad · Jun 8, 2017

The reason that the read block-size didn't affect the time much, is probably because the time taken is in the processing of the data, not in the reading of the data.

Some things that will reduce the time -

1) Use the bulk LOAD DATA INFILE method. Experience has shown that this results in the fastest time, though the php processing to write the parsed data to a file may offset this savings.

2) If not using the LOAD DATA INFILE method, eliminate as much php code as possible. The foreach() loop you have processing the data and adding elements to $insert_queue[] can be sped up by making the processing into a call-back function and using array_map() to apply it to all the elements in the $payload_lines array. Your parsing of the url out of each line should also be as efficient as possible, since you are doing it for each line. If the line is json, using one call to json_decode may (untested) be better than calling multiple string statements.

3) As an extension of item #2, use php array functions in the forming of the sql query statement, rather than any loops. I/We don't know what your current code is doing.

4) Use a prepared query for the insert statement, even for a 400 multi-value statement. This will eliminate all the repetitive communication with the database server for just getting the sql query syntax over to the server. The values you put into the $insert_queue will just be the url, not the (' and ') parts of the syntax. The communication with the sql server is where most of the time is taken for simple sql queries, which an insert query is (assuming that the you are building any indexes after the data has been inserted, see reply #7 above.) You would dynamically prepare the 400 multi-value query once, before the start of all the processing, then just supply the data and execute the query as you process the data. The last partial block of data would need a separate query, using your existing code. Note: using a prepared query for a single value at a time, will only save about 5%, since it only eliminates a portion of the communication and processing. The only way of significantly speeding this up it so use a multi-value query or the bulk LOAD DATA INFILE.

dalecosp · Jun 8, 2017

Nix-like OS?

Might you gain any traction from split(1)?

Weedpacket · Jun 8, 2017

dalecosp;11062229 wrote:
Nix-like OS?

Or grep(1), while we're there? Maybe some processing could be done in a command chain. Maybe enough to make the output ready for LOAD DATA INFILE. Or maybe enough that its output can be read by a [man]popen[/man]ed process handle - though at that point one might as well make the PHP script just another tool on the chain and pipe into its STDIN.

(If all the preprocessing were done in PHP, then it wouldn't be necessary to unpack the archive: [man]gzread[/man] it instead.)

l008com · Jun 8, 2017

NogDog;11062225 wrote:
Hmm...so then the question becomes whether optimizing for DB efficiency means less efficiency on the PHP side, and vice versa.

One thought if it would apply in your case: if you are creating a new table when you do this process, and if you have any indexes on that table, creating the indexes after you do all the inserts can potentially save you some time, as each inserted row does not then have to be sorted and added to the indexes as it's inserted.

I am making a new table every time this script runs, and the only index is an auto-incremented ID field. Would the time savings of making the index after-the-fact still apply in that case?

Weedpacket · Jun 8, 2017

This might be relevant:
https://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html

You could add the column afterwards.

l008com · Jun 10, 2017

Now that I have the insertion and the file-reading going nice and fast, i'm working through the rest of the script. Lots of little tweaks whose results add up when you are repeated them several million times. One that really stood out to me was this:

if ( substr( crc32($url) , -1) == "0" )

Replacing that with:

if ( crc32($url) % 10 === 0)

The results are different, but the same, and much faster. Both ways give you about 10% of the whole, and both ways, unlike using a random function, will always give the same result for the same input, which is what I wanted. But using mod and keeping everything as an integer is far faster than turning the result into a string and comparing the last character. This is one of the probably rare cases where crc32 returning integer instead of hex works out really well

I'm going to keep tweaking things and see what I can come up with, then check back.
In the meantime, I've looked and haven't found any, but if anyone knows of a way to checksum a short string that is even faster than crc32, lemme know

l008com · Jun 10, 2017

Ok this is as small as I've been able to get it. This, on my test machine (which is actually a little slower than my production server) runs through a 5 GB test file in 3 minutes, 2 seconds. Not too shabby. But of course, any other speed tips would be appreciated.

I found it odd that sometimes breaking up multiple nested functions would speed things up, sometimes they would slow things down (always by tiny amounts).
And array_map( ) was actually slower than foreach( ), but array_walk( ) was faster.

So this whole chunk of the script, gets repeated 300 times, on each of the 300 index files. Each file has about 10 million lines each. All things considered, I'm pretty happy with the performance. I wanted to wait till the whole thing is done before I try building with and without the index. Because it's an autoincrement index, I suspect I won't save any time by building without it, then indexing after the fact. But I feel like the results I get from a 900,000 row table could be different from what I get in a 300,000,000 row table.

define("CHUNK_READ_SIZE",131072);
define("MYSQL_INSERT_ITEMS",399);

$begginingtime = time();

$index_file = fopen("/common_crawl_temp/cdx-00194","r");
if ($index_file)
	{
	while ( !feof($index_file) )
		{
		$chunk .= fread($index_file,CHUNK_READ_SIZE);

	if ( feof($index_file) )
		{
		$payload_lines = explode("\n",$chunk);
		$chunk = "";
		}
	else
		{
		$last_newline = strrpos($chunk,"\n");
		if ($last_newline === false)
			{
			continue;
			}
		else
			{
			$payload = substr($chunk,0,$last_newline);
			$chunk = substr( strrchr($chunk,10), 1);
			$payload_lines = explode("\n",$payload);
			}
		}

	array_walk($payload_lines,function($line)
		{
		global $db,$insert_queue;

		if ( strlen($line ) < 6)							{ return; }

		$chr = strpos($line,"{");
		$json = substr($line,$chr);
		$data = json_decode($json,true);

		$num = crc32($data['url']);
		if ($num % 10 !== 0)								{ return; }
		if ( $data['status'] !== "200" )					{ return; }
		if ( strlen($data['url']) > 255 )					{ return; }
		if ( $data['mime'] !== "text/html" )				{ return; }
		if ( substr($data['url'],-10) === "robots.txt" )	{ return; }

		if ( !($url_parts = parse_url($data['url'])) )		{ return; }

		if ( array_key_exists("host",$url_parts) )
			{
			if ( ip2long($url_parts['host']) !== false )	{ return; }
			if ( strpos($url_parts['host'],".") === false ) { return; }
			}

		$url = mysqli_real_escape_string($db,trim($data['url']));
		$insert_queue[] = "('$url')";
		});


	if ( count($insert_queue) > MYSQL_INSERT_ITEMS )
		{
		flush_queue();
		}
	}

if ( count($insert_queue) )
	{ flush_queue(); }
fclose($index_file);

$timer = time() - $begginingtime;
$timer = round(($timer/60),2);

$table_count = mysqli_fetch_assoc(mysqli_query($db,"SELECT COUNT(*) AS `count` FROM `commoncrawl`"));
$msg = "File 00194 Processed - ".number_format($table_count['count'])." rows - $timer minutes - 128 KB / 400 SQL Inserts\n";
echo $msg;
file_put_contents("/common_crawl_temp/big_read_log.txt",$msg,FILE_APPEND);
}
else
	{
	echo "Error Opening /common_crawl_temp/cdx-00194\n";
	}

pbismad · Jun 10, 2017

any other speed tips would be appreciated.

Use a prepared query as mentioned in reply #8.

For a simple query using string input data, the time taken to communicate all the characters between php and the database server is much greater than the time it takes to execute the query on the database server. Eliminating the number of separate communications (due to the hand-shaking for each one) and the size of each communication will speed up the process. The sql query syntax will only be sent once, when the query is first prepared, not each time the query gets executed, and the size of the sql statement will be reduced because it won't contain the single-quotes around the string values. Only the final url string data will be sent when the query gets executed. This will also reduce the amount of php code and php data for each value, since you won't need to call the escape string function and you won't need the extra (' and ') with each piece of data.

Since you are using the php mysqli extension, dynamically binding the 400 multi-values of input data can be simplified using the ... splat operator.

Weedpacket · Jun 11, 2017

            $num = crc32($data['url']);
            if ($num % 10 !== 0)                                { return; }
            if ( $data['status'] !== "200" )                    { return; }
            if ( strlen($data['url']) > 255 )                    { return; }
            if ( $data['mime'] !== "text/html" )                { return; }
            if ( substr($data['url'],-10) === "robots.txt" )    { return; }

Do you want 10% of all of the recorded URLs, or 10% of the "interesting" ones? If the latter, then you could move the crc and mod checks to after the status, MIME type, etc. tests (actually, you'd need to if you want the latter).

[man]parse_url[/man] takes a second parameter (you're only using the host part). Aside from that, parse_url doesn't return an array with null elements, so [man]array_key_exists[/man] could have been [man]isset[/man].

Since the anonymous function isn't closing over any variables in the outer scope, it doesn't need to be an anonymous function: an ordinary one, referred to by name, may be faster.

l008com · Jun 11, 2017

So I tried using this instead:

 				if ( !($host = parse_url($data['url'],PHP_URL_HOST)) )	{ return; }

			if ( !is_null($host) )
				{
				if ( ip2long($host) !== false )						{ return; }
				if ( strpos($host,".") === false ) 					{ return; }
				}

Turns out it's a bit slower than returning the whole array from parse_url. If I skip the is_null( ) call and just do "if ($host)", it's even slower still. Granted these are just tiny amounts slower, but it was faster to return the full array.

So then i put it back to parse_url returning the array. I was going to try using isset isntead of array_key_exists, but then I realized... I can just skip that entirely. No need to wrap those next two lines in their own if statement. Removing that "if" check does speed me up a tiny bit.

So at this point, I tried getting parse_url to return just the host as a string one more time, but it was still slower than having it return the full array.

I'll try making my anon function into a regular function and see if that gives me any more time, at a later time when I have more time

As far as which 10% I want... It doesn't really matter, and it's more important at this point to speed up the script, while dumping "about" 90%. As far as that test block, the fastest order of it really dependents on what the contents of the file are. I put the 10% test at the top because that's saving 90% of the lines form hitting all of the rest of the tests, which seems faster than putting the 10% test at the end. Unless it takes more CPU power to do a CRC than it does all other string related tests combined? I can try it but I suspect the current order will be the fastest.

l008com · Jun 11, 2017

Also confirmed: breaking my anonymous function out into a regular function and calling it, is a little bit slower.

l008com · Jun 12, 2017

Deleted post due to wrong-ness.

Parsing Very Large Text Files

Ll008com

Weedpacket

Ll008com

Ll008com

NogDog

Ll008com

NogDog

Ppbismad

dalecosp

Weedpacket

Ll008com

Weedpacket

Ll008com

Ll008com

Ppbismad

Weedpacket

Ll008com

Ll008com

Ll008com