a slow fgetcsv script

zoomsan

Hello,

I have about 25 csv files with a total of about 500,000 records I have to do a couple of things with.

I have currently been reading the files into an array with fgetcsv. After doing this, I set up a while loop where I select certain parts of the csv files (since each of the 25 files include different fields) and group them together one by one by assigning them to variables within the while loop. Then, I do an insert with these variables, which essentially reads each of these modified rows of data in the csv files into the mysql database one by one.

The huge problem with this is the time it takes. For 500,000 rows it is taking over 24 hours using apache 2 and php 4.38. It obviously takes the longest amount of time when it is reading the larger files. For example, there is one 20 MB csv file and it literally takes "all day".

My question is this - is my thinking the fastest way for me to accomplish this task? It definitely works, but it is SLOW. Would LOAD DATA INFILE be any faster? The problem is I would have to normalzie the data after it is in the database, which could be a bit tricky. Also, I am not even sure if this is faster than a row by row insert from what I have read.

Does anyone else have any thoughts on how to read in different csv files, use php to make them similar, so they "fit" into the same table/database and then insert the contents of these csvs into the table (or modified csvs more accurately) at the fastest rate?

Further I've done some tests.

I have used the code to disable the indexes and then re-enable them at the end of the script to speed up inserts. Inserts aren't really the problem now. It is coming in the actual reading of the file into the array with fgetcsv I believe.

$fn = "filename.txt";

$bcs = array();

$id = fopen($fn, "r"); //open the file

$size = filesize($fn) +1;

while (($record = fgetcsv($id, $size, "|")) !== FALSE) { //start a loop
   $bcs[] = $record;
}
$count1 = (count($bcs));

// read file into main array
// set amount of times to go through the loop

$print = $count1;

echo $print . "\n";

$row = 0; 

$a = 0;

do { assign parts of the array into variables & do inserts here } while (++$a < $print);

For some reason, especially with files with 50MB of data, this goes very slow. It appears the slowdown is somewhere here in this code I've cited. Is it because the files are not broken into chunks, and taking up a lot of memory something to that effect? If so, how would I go about reading the file in chunks and would that do the trick? Because this method does work, it's just VERY slow.

Also, I've tried eliminating the do while, and calculating the time to go through that loop. I did that successfully, eliminating the need for the count(); function, however it didn't do much to speed it up. Again, I believe it is in reading the file somehow. I am sure the experts here know more about this however. This was just some very light testing I did to speed it up. I am fairly certain the database structure is not the problem.

Thank you VERY much in advance!

Weedpacket

Is it necessary to read the whole CSV file in before processing it? Reading and processing on a line-by-line basis would greatly cut down on memory consumption, and hence page swapping, etc.

Another approach (although if inserts aren't the problem, this might not be so much of a bonus), is to write modified CSVs out as files temporarily, and then ask MySQL to suck those down directly. You could try LOAD DATA INFILE, in other words.

And, of course, there's the option of using a DBMS that would allow you to write scripts in SQL to do all this processing, and then take PHP out of the loop completely 🙂

tsinka

Hi,

what about this:

read in the files, normalize the data and write it out to a new csv file. Then use LOAD DATA with that new csv file. LOAD DATA should be faster that doing 500000 inserts.

You might additionally try to delete all indexes (exception: auto_increment) then to LOAD DATA the contents and create the indexes afterwards. LOAD DATA and creating indexes seperately might be faster.

Thomas

zoomsan

Yes the LOAD DATA I do not believe is the major problem. I believe the bottle neck is probably due to the fact I am reading it all into in array.

How would I modify my script to read and process on a line-by-line basis? (which would greatly cut down on memory consumption, and hence page swapping)

Weedpacket

while (($record = fgetcsv($id, $size, "|")) !== FALSE) { //start a loop
assign parts of the array into variables and do inserts here
}