Ok this is as small as I've been able to get it. This, on my test machine (which is actually a little slower than my production server) runs through a 5 GB test file in 3 minutes, 2 seconds. Not too shabby. But of course, any other speed tips would be appreciated.
I found it odd that sometimes breaking up multiple nested functions would speed things up, sometimes they would slow things down (always by tiny amounts).
And array_map( ) was actually slower than foreach( ), but array_walk( ) was faster.
So this whole chunk of the script, gets repeated 300 times, on each of the 300 index files. Each file has about 10 million lines each. All things considered, I'm pretty happy with the performance. I wanted to wait till the whole thing is done before I try building with and without the index. Because it's an autoincrement index, I suspect I won't save any time by building without it, then indexing after the fact. But I feel like the results I get from a 900,000 row table could be different from what I get in a 300,000,000 row table.
define("CHUNK_READ_SIZE",131072);
define("MYSQL_INSERT_ITEMS",399);
$begginingtime = time();
$index_file = fopen("/common_crawl_temp/cdx-00194","r");
if ($index_file)
{
while ( !feof($index_file) )
{
$chunk .= fread($index_file,CHUNK_READ_SIZE);
if ( feof($index_file) )
{
$payload_lines = explode("\n",$chunk);
$chunk = "";
}
else
{
$last_newline = strrpos($chunk,"\n");
if ($last_newline === false)
{
continue;
}
else
{
$payload = substr($chunk,0,$last_newline);
$chunk = substr( strrchr($chunk,10), 1);
$payload_lines = explode("\n",$payload);
}
}
array_walk($payload_lines,function($line)
{
global $db,$insert_queue;
if ( strlen($line ) < 6) { return; }
$chr = strpos($line,"{");
$json = substr($line,$chr);
$data = json_decode($json,true);
$num = crc32($data['url']);
if ($num % 10 !== 0) { return; }
if ( $data['status'] !== "200" ) { return; }
if ( strlen($data['url']) > 255 ) { return; }
if ( $data['mime'] !== "text/html" ) { return; }
if ( substr($data['url'],-10) === "robots.txt" ) { return; }
if ( !($url_parts = parse_url($data['url'])) ) { return; }
if ( array_key_exists("host",$url_parts) )
{
if ( ip2long($url_parts['host']) !== false ) { return; }
if ( strpos($url_parts['host'],".") === false ) { return; }
}
$url = mysqli_real_escape_string($db,trim($data['url']));
$insert_queue[] = "('$url')";
});
if ( count($insert_queue) > MYSQL_INSERT_ITEMS )
{
flush_queue();
}
}
if ( count($insert_queue) )
{ flush_queue(); }
fclose($index_file);
$timer = time() - $begginingtime;
$timer = round(($timer/60),2);
$table_count = mysqli_fetch_assoc(mysqli_query($db,"SELECT COUNT(*) AS `count` FROM `commoncrawl`"));
$msg = "File 00194 Processed - ".number_format($table_count['count'])." rows - $timer minutes - 128 KB / 400 SQL Inserts\n";
echo $msg;
file_put_contents("/common_crawl_temp/big_read_log.txt",$msg,FILE_APPEND);
}
else
{
echo "Error Opening /common_crawl_temp/cdx-00194\n";
}