[RESOLVED] csv file + data structures - algorithm?

h4x0rmx

I need your ideas to come up with the best solution to this problem.
I have a CSV file with information on when employees clock in and clock out. The data is separated by days, so if they work over night it will appear as if they had punched out at 00:00 and punched in again at 00:00 of the following day:

Id	Name	Start_date	start_time	end_date	end_time
01	Ernie	20090331	19:30		20090331	00:00
01	Ernie	20090401	00:00		20090401	04:30

Also, there's a shift differential, which means that if they work during certain time of the day they get paid more. Let's say that if they work from 20:00 to 08:00 they get paid double only during that time. So an employee can be scheduled to work from 05:00 to 13:00. In order to for them to get paid the shift differential, they have to clock out at 08:00 [or 20:00 if it is the case] and clock in immediately:

02	Rachel	20090321	05:00		20090321	08:00
02	Rachel	20090321	08:00		20090321	13:00

When I'm reading the CSV file I have to store the data on the DB. This information is going to be compared against their scheduled shifts. Whenever the employee works overnight and/or has a shift differential I have to be able to combine the data and store it as a single entry:

01	Ernie	20090331	19:30		20090401	04:30
02	Rachel	20090321	05:00		20090321	13:00

It would be somewhat easy if the data was ordered as I have presented it, but it is not:

Id	Name	Start_date	start_time	end_date	end_time
08	Michael	20090301	12:00		20090301	15:00
02	Rachel	20090321	08:00		20090321	13:00
01	Ernie	20090401	00:00		20090401	04:30
23	John	20090301	08:00		20090301	13:00
02	Rachel	20090321	05:00		20090321	08:00
52	Ben	20090205	10:33		20090205	13:58
01	Ernie	20090331	19:30		20090331	00:00
99	Dan	20090129	20:56		20091029	00:00

Any ideas on how to go about this?
I have to keep in mind that the CSV file has usually 500+ entries and that at the same time I'm reading the file and parsing it, I have to store the parsed data on the DB without timing out the script.

Please give me any ideas or suggestions?
Any data structures on php that would help me?

Weedpacket

01	Ernie	20090331	19:30		20090331	00:00

First problem is that the ending date is wrong (unless Ernie managed to finish before he started).

h4x0rmx wrote:
at the same time I'm reading the file and parsing it, I have to store the parsed data on the DB

Or you could read all the data at once, fix it, then write it all to the database.

h4x0rmx wrote:
without timing out the script.

500 records shouldn't take too long (I just tried the below code for 1600 records and it ran in under two seconds). Besides, even if it did take a long time it's possible to adjust timeout whenever necessary.

You don't say what the ID relates to. Name? Shift? What I am going to assume is that two records with the same ID belong to the same shift by the same person.

I reckon this could all be done in the database using a temporary table for scratch space.

This does the first part (the shift differential stuff depends on information not given). All it really does is fix those broken end dates and joins the bits that were split over midnight. I've got it writing a new csv file for the sake of having something to look at. Also, as it happens, it changes the timestamp into something that would be easier to insert into a database's time type.

/* All standard disclaimers about forum-posted example code apply */

// Read the CSV
$csv = fopen('junk.csv','rb');
$lines = array();
while(!feof($csv))
{
	$lines[] = fgetcsv($csv,0,"\t");
}
fclose($csv);
array_pop($lines);

// Process
foreach($lines as $line)
{
	// $junk because there are _two_ field delimiters between starttime and enddate
	list($id,$name,$startdate,$starttime,$junk,$enddate,$endtime) = $line;

// Fix dates and join.
$start = strtotime("$startdate $starttime");
$end = strtotime("$enddate $endtime");
// the ending dates are wrong at midnight.
if($end<$start) $end = strtotime('+1 day', $end);
$start = date('Y-m-d H:i', $start);
$end = date('Y-m-d H:i', $end);
echo "$id\t$name\t$start\t$end\n";

if(isset($records[$id][$name]))
{
	// This record has been split. Find when the existing part started and ended, and see how
	// the new piece relates.
	$found_start = $records[$id][$name]['start'];
	$found_end = $records[$id][$name]['end'];
	if($found_end == $start)
	{
		$records[$id][$name]['end'] = $end;
	}
	elseif($found_start == $end)
	{
		$records[$id][$name]['start'] = $start;
	}
}
else
{
	$records[$id][$name] = array('start'=>$start, 'end'=>$end);
}
}

ksort($records);

// Write a fixed CSV
$csv = fopen('fixed.csv','wb');
foreach($records as $id=>$name_record)
{
	foreach($name_record as $name=>$times)
	{
		$start = $times['start'];
		$end = $times['end'];
		fputcsv($csv, array($id, $name, $start, $end), "\t");
	}
}

Using a temporary table in the db, and SQL to do further processing could take some of the load off the application.

h4x0rmx

@
I appreciate the time you took to answer my question...

Weedpacket;10909504 wrote:
01	Ernie	20090331	19:30		20090331	00:00
First problem is that the ending date is wrong (unless Ernie managed to finish before he started).

The ending is indeed wrong, but that's how it is stored on the CSV file... if the end_time is 00:00 I must assume that it's midnight of the current day.

Weedpacket;10909504 wrote:
Or you could read all the data at once, fix it, then write it all to the database.
500 records shouldn't take too long (I just tried the below code for 1600 records and it ran in under two seconds). Besides, even if it did take a long time it's possible to adjust timeout whenever necessary.

The current CVS files that I'm working with have an average of 3,000 records (not 500 like I said)... the old script timed out when trying to parse such files, maybe because it was doing to much extra stuff.

Weedpacket;10909504 wrote:
You don't say what the ID relates to. Name? Shift? What I am going to assume is that two records with the same ID belong to the same shift by the same person.

The ID relates to the employee (all the records of a single employee will have the same ID)... I assume that I can then use [FONT="Courier New"]$records[$id][/FONT] instead of [FONT="Courier New"]$records[$id][$name] [/FONT]on the code you provided, right?

Weedpacket;10909504 wrote:
I reckon this could all be done in the database using a temporary table for scratch space.
...
Using a temporary table in the db, and SQL to do further processing could take some of the load off the application.

Using a temporary table sounds like a good idea, but I've never done it. When you mention temporary tables all the complicated things that you have to do on an Oracle DB come to my mind... but since I'm using MySQL it'd seem a lot easier? How do you do it?
(This is my idea...)
Have an empty table to store all the information on the CSV file
Manipulate the info (sorting, etc) with queries and do the parsing
Move it to the real table
Empty the temporary table
Is this more or less what I've got to do? Is there a way to create a temporary table on the fly with pure SQL?
Any more ideas/suggestions?

h4x0rmx

One more question... if I decide to use a temporary table, how should I handle the concurrency?
Usually there's only a couple of users that will have rights to this script, but there's still a risk... and what if the script doesn't end correctly and the table is not left empty?
Are there any other cases I should be aware of?
How should I handle them?

Thanks!

scrupul0us

h4x0rmx;10909524 wrote:
One more question... if I decide to use a temporary table, how should I handle the concurrency?
Usually there's only a couple of users that will have rights to this script, but there's still a risk... and what if the script doesn't end correctly and the table is not left empty?
Are there any other cases I should be aware of?
How should I handle them?

Thanks!

you could get spiffy and use locks and transactions... while a script is running the table is locked for that transaction... if the script fails, then there is no COMMIT passed to the DB and the transaction never completes and everything is rolled back

h4x0rmx

scrupul0us;10909540 wrote:
you could get spiffy and use locks and transactions... while a script is running the table is locked for that transaction... if the script fails, then there is no COMMIT passed to the DB and the transaction never completes and everything is rolled back

Can I do this on MySQL?
Any code examples?
Thanks!

scrupul0us

http://www.databasejournal.com/features/mysql/article.php/3382171/Transactions-in-MySQL.htm

Is a good base... but googling for transaction with mysql will yield plenty more