PHP5 - Large text files comparison.

robroy83

Hello everybody.

I would like to use PHP to make a script which compares two large text files (~500MB each). These files look like:

^L
RecordA

FieldA1 ValueA1
FieldA2 ValueA2
FieldA3 ValueA3

^L
^L
RecordB

FieldB1 ValueB1
FieldB2 ValueB2
FieldB3 ValueB3

^L
...

"^L" is Form Feed (0c hex).

In the above example there are two "records", each one with 3 fields. Order of "records" is different in compared files.

As I see it now I can:
- read both files to memory to two distinct tables, an element of a table would be "record" (data from ^L to ^L);
- compare elements fo those tables based on Field/Value pairs.

Questions:
1) How to read input file into a table so that first element would be data from ^L to ^L, eq.:

RecordA

FieldA1 ValueA1
FieldA2 ValueA2
FieldA3 ValueA3

and second element would be:

RecordB

FieldB1 ValueB1
FieldB2 ValueB2
FieldB3 ValueB3

How to divide input file based on "^L" and put every part of it into table?

2) What tool could you suggest to compare elements of the above table sign by sign? Maybe Unix diff command?

3) Do you know about any standard problems when such large files (2*~500M😎 are read to memory simultaneously?

Thank you very much for any help 🙂

Warm regards,
Rafał.

johanafm

3) If you have enough memory, it should be possible to do this. "Enough memory" would not necessarily mean 1GB free before your script starts, since other things needs resources as well. And especially if you expect to handle your data with built in php functions such as explode. You will end up with two copies of your data (string + array), thus doubling memory requirements. Depending on algorithms used by you or those of the built in functions you use, considerably more than this might be needed as well.

But, if you're not using a production machine, you can always try and see what happens. Worst case scenario is that your script runs out of memory.
If you're running your script via a web server, max execution time has to be altered to allow for long duration as well, and I'd rather recommend you run this from CLI.

But there are various ways to deal with reading the data in chunks.
file_get_contents lets you specify max number of characters to read, which you could set at something more reasonable than half a gig, e.g. 5MB.
But, unless you have fixed size entries, you can't know if you end up with whole records, and as such you'd have to perform a string search to find these. Something along the lines of

$nextPart = '';
while ($str = read next 5 mb of data):
if ($nextPart)
	$str = $nextPart . $str;

$nextPart = substr of $str, from last occurence of ^L
$str = substr of $str, to last occurence of ^L

parse each record and put it into DB. (see 1)
endWhile

Another way would be to use fgets to read one line at the time from the file. Parse each line as you go. When you reach every second ^L, you have a record. Insert to DB and continue. No problems parsing in this way.

1.
If you go with the "read large chunks of data" you will need to parse it as well.

$str = 'chunk of data, starting and ending with ^L, with 2 ^L between each record'.
$str = trim($str, chr(12));	# strip form feed from start and end of string
$records = explode(chr(12).chr(12), $str);

foreach ($records as $r) {
# $r is a string where each line is a field value pair from the current record.
}

2.
First off, I'm assuming that when you wrote

RecordA

FieldA1 ValueA1
FieldA2 ValueA2

RecordB

FieldB1 ValueB1
FieldB2 ValueB2

"FieldA1" and "FieldB1" would actually be the same identical text, such that the file actually look like (example)

^L
name john
age 20
location new york
^L
^L
name jane
age 25
location paris
^L

Assuming you put file1 into table t1, file2 into table t2, and that the field names are col1, col2 and col3

To get all records appearing in file 1, but not in file 2
SELECT t1.col1 AS t1c1, t1.col2 AS t1c2, t1.col3 AS t1c3,
t2.col1 AS t2c1, t2.col2 AS t2c2, t2.col3 AS t2c3
FROM t1
LEFT JOIN t2 ON t1c1 = t2c1 AND t1c2=t2c2 AND t1c3=t3c3
WHERE t2c1 IS NULL OR t2c2 IS NULL OR t2c3 IS NULL

And to get all records appearing in file1 but not in file 2, just replace the columns in the where clause and use FROM t2 LEFT JOIN t1

robroy83

johanafm;10958482 wrote:
3) If you have enough memory, it should be possible to do this. "Enough memory" would not necessarily mean 1GB free before your script starts, since other things needs resources as well. And especially if you expect to handle your data with built in php functions such as explode. You will end up with two copies of your data (string + array), thus doubling memory requirements. Depending on algorithms used by you or those of the built in functions you use, considerably more than this might be needed as well.

But, if you're not using a production machine, you can always try and see what happens. Worst case scenario is that your script runs out of memory.
If you're running your script via a web server, max execution time has to be altered to allow for long duration as well, and I'd rather recommend you run this from CLI.

But there are various ways to deal with reading the data in chunks.
file_get_contents lets you specify max number of characters to read, which you could set at something more reasonable than half a gig, e.g. 5MB.
But, unless you have fixed size entries, you can't know if you end up with whole records, and as such you'd have to perform a string search to find these. Something along the lines of
$nextPart = '';
while ($str = read next 5 mb of data):
if ($nextPart)
	$str = $nextPart . $str;

$nextPart = substr of $str, from last occurence of ^L
$str = substr of $str, to last occurence of ^L

parse each record and put it into DB. (see 1)
endWhile
Another way would be to use fgets to read one line at the time from the file. Parse each line as you go. When you reach every second ^L, you have a record. Insert to DB and continue. No problems parsing in this way.

1.
If you go with the "read large chunks of data" you will need to parse it as well.
$str = 'chunk of data, starting and ending with ^L, with 2 ^L between each record'.
$str = trim($str, chr(12));	# strip form feed from start and end of string
$records = explode(chr(12).chr(12), $str);

foreach ($records as $r) {
# $r is a string where each line is a field value pair from the current record.
}
2.
First off, I'm assuming that when you wrote
"FieldA1" and "FieldB1" would actually be the same identical text, such that the file actually look like (example)
^L
name john
age 20
location new york
^L
^L
name jane
age 25
location paris
^L
Assuming you put file1 into table t1, file2 into table t2, and that the field names are col1, col2 and col3

To get all records appearing in file 1, but not in file 2
SELECT t1.col1 AS t1c1, t1.col2 AS t1c2, t1.col3 AS t1c3,
t2.col1 AS t2c1, t2.col2 AS t2c2, t2.col3 AS t2c3
FROM t1
LEFT JOIN t2 ON t1c1 = t2c1 AND t1c2=t2c2 AND t1c3=t3c3
WHERE t2c1 IS NULL OR t2c2 IS NULL OR t2c3 IS NULL

And to get all records appearing in file1 but not in file 2, just replace the columns in the where clause and use FROM t2 LEFT JOIN t1

Thank you for the reply.

I thought for some time about my problem. Currently I see two solutions:
1 a) read both files into table1 and table2 in RAM using file() function
1 b) then find 1st record (lines from ^L to ^L) in table1 and search for its equivalent in table2 (the order of records in both tables can vary), that "find" would be based on conjunction of two values from the record in table 1, for example on "ValueA1.ValueA2" (string unique in whole file for a record)
1 c) then parse table2 to find relevant "record" (which can be on a different position than in table1
1 d) compare fields/values in both records and output a summary of differences
Questions to method 1:
- I think that if file1 is ~500MB and I read it into table1 and then for every record (~10 lines of text) there would be a search in table2 the whole algorithm could be VERY VERY VERY time consuming, what do you think?

2 a) read both files into database tables dbtable1 and dbtable2
2 b) then query the database for one record in dbtable1, find the relevant (based on "ValueA1.ValueA2" (string unique in whole file for a record - primary key)) record in dbtable2
3) compare values for all fields in both records in PHP and output of differences

Questions to method 2:
- as I see it now querying both tables could be easier in SQL than in PHP but not faster (when compared to storing records in tables in memory), what do you think about that solution?

Warm regards,
Rafał.

johanafm

First off, how often is this supposed to run? Once? Once per month? Once per day? Once per 5 minutes?
If you only run it once, or very rarely, does it really matter if it is slow? If it is easy to implement in a certain way, why not do this?
If, on the other hand, it has to run rather frequently, and perhaps on a system serving other needs as well, then efficiency becomes an issue.

robroy83;10958603 wrote:
- I think that if file1 is ~500MB and I read it into table1 and then for every record (~10 lines of text) there would be a search in table2 the whole algorithm could be VERY VERY VERY time consuming, what do you think?

Yes. If you are talking about keeping the records in memory as a string, i.e. same as in file, then for each record that does have a related entry in the other "table", you will on average be comparing against half the data, or ~250MB. For each record that does not have this, you will have to go through the whole data set.

robroy83;10958603 wrote:
Questions to method 2:
- as I see it now querying both tables could be easier in SQL than in PHP but not faster (when compared to storing records in tables in memory), what do you think about that solution?

It will probably be faster with a SQL intermediary, at least if you implement your comparison as in how I understood your "method 1". But, if efficiency is important, implement it in all three ways and profile them. Then you will know for sure.
In for example MySQL you can specify ENGINE=memory when you create the tables: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

robroy83;10958603 wrote:
2 b) then query the database for one record in dbtable1, find the relevant (based on "ValueA1.ValueA2" (string unique in whole file for a record - primary key)) record in dbtable2
3) compare values for all fields in both records in PHP and output of differences

So, you have a key for each record? And these are consistent across both tables?

If this is so, you could read nothing but these keys from each file, for example with fgets, and store the keys as array keys for direct detection of what's missing

$line = 0;
$f1 = array();
while ($lineData = fgets($file1)) {
	# If all records have the same amount of line entries, then you know that
	# only one out of every 5 lines is a key (5 is just an example), starting on line 2
	if (++$line % 5 == 2) {
		# $value will be the primary key
		$value = getValueFromRow($lineData);
		$f1[$value] = $null;		# assign whatever. only care about primary key
		# optionally, $f1[$value] = array( _rest of the record data here_);

}
}

$line = 0;
$diff = array();
$f2 = array();
while($lineData = fgets($file2)) {
	if (++$line % 5 == 2) {
		$value = getValueFromRow($lineData);
		$f2[$value] = null;
		 # same as above. optionally store rest of record data if needed...

	# See what's in $f2, but not $f1
	if (!isset($f1[$value])) {
		$diff[] = $value;
		# or to keep rest of record data
		# $diff[] = $f2[$value] (assuming you stored the whole record instead of null
	}
}
}

# compare what's in $f1 but not $f2, and add to $diff
# this could also be done with a 
# foreach $f1 as $key => $v)
# 	if (!isset($f2[$key])) ...
# but I believe array_diff_key will be more efficient. once again, if in doubt: profile
$diff = array_merge($diff, array_diff_key($f1, $f2)

# $diff now contains the diff.