I hear this is the place to brag, so I felt I wanted to show off a little. I had a script to import about 60,000 lines of data into MySQL from a couple of tab-delimited files using a PHP script. Initially I got it working but the script took over 5 hours to complete. I rearranged my thinking a little and changed my table structure, as well as found a nifty little SQL directive called "ON DUPLICATE KEY" and now the import script completes in 12.7 seconds (with no changes to the data files).
So here's the whole process:
1) Cost data is downloaded from a MS-SQL database to a couple of flat files -
there's one file for estimated costs, and one for actual costs.
2) My script runs as follows:
a) The "final" and "temp" tables are deleted, if they exist
b) Estimated costs (and related data) are imported into the final table
c) Actual costs are imported into the temp table
d) Actual costs are merged into the final table, inserting where there is no estimated cost data
e) Variance is calculated
And here's the code:
<?php
include('/srv/www/htdocs/includes/ffdb_connect.php');
include('/srv/www/htdocs/includes/functions.php');
// Data/log files...
$estFile = "/tmp/data/est_costs.csv";
$actFile = "/tmp/data/actual_costs.csv";
$errFile = "/tmp/logs/import_costs.log";
// Delete existing log file, if it exists...
if (file_exists($errFile))
unlink($errFile);
// Open error file for writing...
$efh = fopen($errFile, 'x');
fwrite($efh, "Costs Import Function Log File\n****\n\n");
// Clear out current tables...
$query = "DROP TABLE IF EXISTS Est_vs_Actual_Costs";
mysql_query($query, $link);
$query = "DROP TABLE IF EXISTS ActualCosts_Raw";
mysql_query($query, $link);
// Re-create main cost table...
$query = "CREATE TABLE Est_vs_Actual_Costs (".
"cost_id VARCHAR(32) NOT NULL, ".
"location_id INT, ".
"job_number VARCHAR(12), ".
"bus_segment VARCHAR(1), ".
"cost_type VARCHAR(1), ".
"phase_code VARCHAR(10), ".
"est_cost DECIMAL(12,2) DEFAULT 0, ".
"act_cost DECIMAL(12,2) DEFAULT 0, ".
"variance DECIMAL(12,2) DEFAULT 0, ".
"pm VARCHAR(24), ".
"PRIMARY KEY (cost_id)".
") ENGINE = MyISAM";
mysql_query($query, $link);
// Check for table creation errors...
if (mysql_error() != "")
{
$err = "CREATE TABLE Est_vs_Actual_Costs error: ".mysql_error()."\n".$query."\n\n";
fwrite($efh, $err);
}
// Re-create temporary estimate table...
$query = "CREATE TABLE ActualCosts_Raw (".
"cost_id VARCHAR(32) NOT NULL, ".
"location_id INT, ".
"job_number VARCHAR(12), ".
"bus_segment VARCHAR(1), ".
"cost_type VARCHAR(1), ".
"phase_code VARCHAR(10), ".
"act_cost DECIMAL(12,2) DEFAULT 0, ".
"pm VARCHAR(24), ".
"PRIMARY KEY (cost_id)".
") ENGINE = MyISAM";
mysql_query($query, $link);
// Check for table creation errors...
if (mysql_error() != "")
{
$err = "CREATE TABLE ActCost_Raw error: ".mysql_error()."\n".$query."\n\n";
fwrite($efh, $err);
}
// Open "Estimated Cost" data file...
$fh = fopen($estFile, 'r');
$i = 1;
fwrite($efh, "Starting Est_Cost imports...\n-------------------\n");
// Import Estimated Costs...
while ($Data = fgets($fh))
{
// Format data for import...
$Data = trim($Data);
// remove single-quotes
$Data = str_replace("'", "", $Data);
// remove commas
$Data = str_replace(",", "", $Data);
// remove newlines
$Data = str_replace("\n", "", $Data);
// replace tabs with commas
$Data = str_replace("\t", ",", $Data);
// Load formatted fields into an array...
$str_array = explode(",", $Data);
// Initialize job_id as a string...
$job_id = "";
// Get and validate fields from array...
$location_id = (int)trim($str_array[0]);
$job_number = trim($str_array[1]);
$bus_segment = trim($str_array[2]);
$cost_type = trim($str_array[3]);
$phase_code = trim($str_array[4]);
$est_cost = (float)trim($str_array[5]);
$pm = trim($str_array[6]);
$cost_id = $location_id.$job_number.$bus_segment.$cost_type.$phase_code;
// Skip rows with 0 cost...
if ($est_cost == 0)
continue;
// Insert values into table...
$query = "INSERT INTO Est_vs_Actual_Costs (cost_id, location_id, job_number, bus_segment, ".
"cost_type, phase_code, est_cost, pm) ".
"VALUES ('$cost_id', $location_id, '$job_number', '$bus_segment', '$cost_type', ".
"'$phase_code', $est_cost, '$pm') ".
"ON DUPLICATE KEY UPDATE est_cost = est_cost + $est_cost";
$result = mysql_query($query, $link);
// Check for insert query errors...
if (mysql_error() != "")
{
$err = "INSERT Est_Costs error: ".mysql_error()."\n".$query."\n\n";
fwrite($efh, $err);
}
$i++;
}
// Totals for Est_Costs...
fwrite($efh, "\n-------------------\n $i Est_Cost records imported.\n\n");
fclose($fh);
// Open "Actual Cost" data file...
$fh = fopen($actFile, 'r');
// Error file header...
fwrite($efh, "Starting Actual_Cost imports...\n-------------------\n");
$i = 1;
// Import Actual Costs...
while ($Data = fgets($fh))
{
// Format data for import...
$Data = trim($Data);
// remove single-quotes
$Data = str_replace("'", "", $Data);
// remove commas
$Data = str_replace(",", "", $Data);
// remove newlines
$Data = str_replace("\n", "", $Data);
// replace tabs with commas
$Data = str_replace("\t", ",", $Data);
// Load formatted fields into an array...
$str_array = explode(",", $Data);
// Get and validate fields from array...
$location_id = (int)trim($str_array[0]);
$job_number = trim($str_array[1]);
$bus_segment = trim($str_array[2]);
$cost_type = trim($str_array[3]);
$phase_code = trim($str_array[4]);
$act_cost = (float)trim($str_array[5]);
$pm = trim($str_array[6]);
$cost_id = $location_id.$job_number.$bus_segment.$cost_type.$phase_code;
// Skip rows with 0 cost...
if ($act_cost == 0)
continue;
// Insert data...
$query = "INSERT INTO ActualCosts_Raw ".
"(cost_id, job_number, bus_segment, cost_type, phase_code, act_cost, pm) VALUES ".
"('$cost_id', '$job_number', '$bus_segment', '$cost_type', '$phase_code', '$act_cost', '$pm')".
"ON DUPLICATE KEY UPDATE act_cost = act_cost + $act_cost";
$result = mysql_query($query, $link);
// Check for errors...
if (mysql_error() != "")
{
$err = "Error: " . mysql_error() . "\n" . $query . "\n\n";
fwrite($efh, $err);
continue;
}
$i++;
}
// Totals for Est_Costs...
fwrite($efh, "\n-------------------\n $i Est_Cost records imported.\n\n");
fclose($fh);
// Merge ActualCost_Raw data into Est_vs_Actual_Costs table where matches are found...
$query = "INSERT INTO Est_vs_Actual_Costs ".
"(cost_id, job_number, bus_segment, cost_type, phase_code, act_cost, pm) ".
"SELECT cost_id, job_number, bus_segment, cost_type, phase_code, act_cost, pm FROM ActualCosts_Raw ".
"ON DUPLICATE KEY UPDATE Est_vs_Actual_Costs.act_cost = ActualCosts_Raw.act_cost";
mysql_query($query, $link);
// Check for update errors...
if (mysql_error() != "")
{
$err = "Error merging actual cost data: " . mysql_error() . "\n" . $query . "\n\n";
fwrite($efh, $err);
}
// Update the Variance column...
$query = "UPDATE Est_vs_Actual_Costs SET variance = est_cost - act_cost";
$result = mysql_query($query, $link);
// Check for update errors...
if (mysql_error() != "")
{
$err = "Error updating variances: " . mysql_error() . "\n" . $query . "\n\n";
fwrite($efh, $err);
}
fclose($fh);
fclose($efh);
?>
And a little sample data from the final table for you:
cost_id location_id job_number bus_segment cost_type phase_code est_cost act_cost variance pm
101101BN7004BC051000 101 101BN7004 B C 51000 73.08 92.91 -19.83 MCK
There are around 30,000 rows of data in both the final and temp tables before the merge, and around 32,000 rows in the final table after the merge.