random line out of a large file

robertaccettura

Previously I've worked with pretty small text files (under 50k). Now I've got to work with some large ones (200k-a few MB's or more).

The files contain somewhat random data on each line

foo
bar
dog
cat
monkey
love

I need to get 1 random line.

THe method I've always used is just to open, explode by returns, and generate a random number for the array index.

That's not a good idea here, as an array containing MB's of data would be a bad thing 🙁

So what should I do?

dalecosp

Could you use fseek to start at a random point in the file?

$numpoint=rand(0, $somevarhere);

$fp=fopen("$filename","r");
fseek($fp,$numpoint);
$content=fread($fp,$somevalue);
fclose($fp); 

// then, do your usual stuff

So, you first randomly grab a section of the file, then explode and pick a random line from the random section?

Determining how to set the start point randomly escapes me at the moment, though...there would have to be some way of knowing how the file's size relates to the number of lines therein ... 😕

robertaccettura

fseek() is something I never really played with, but seems like the perfect answer.

THe problem of course is knowing how many lines are in the file. I don't see how that could be done with file size, since the lines can all be different lengths.

😕

Weedpacket

Originally posted by dalecosp
Could you use fseek to .... randomly grab a section of the file, then explode and pick a random line from the random section?

rather than read the whole section of the file, you could fseek() to a random point in the file, fgets() to get to the next linebreak, and then fgets() the single line after that.

The big drawback here is that some words would be more common than other words - specifically, if "somereallylongword" is followed by "aword", and "short" is followed by "anotherword", then "aword" would be more likely to be picked than "anotherword", becuse the fseek() is three times as likely to land in the middle of "somereallylongword" as in the middle of "short". This could be avoided by padding each line out until they're all the same, but that (a) bulks out the size of the file, and (b) means examining the file to find the longest line.

I thought of keeping and maintaining an auxiliary index file to track file offsets where each line began - but to make the idea worthwhile was a mission; since the index file itself would be large it would need its own index file and so on - eventually the index turned into a btree and the whole idea turned into "implement a database".

robertaccettura

I guess my other option could be to import the files to mySQL tables, and scan those:

Something like:

  

 $sql = "SELECT * FROM tablename
          ORDER BY RAND()";

Is that more efficient? Might be more overhead with opening closing the socket, and mySQL process.

My goal is efficiency.

Weedpacket

Considering the time it would take to read a multi-megabyte file (and considering memory consumption limits imposed by PHP and/or the web server).... You can use a persistent socket connection if the few milliseconds' difference is important, and as far as the process time goes - well, I guess something has to take the time.....

But the query is not that great. Find out how many rows there are in tablename, and select a random one of those instead of randomising the entire table and then selecting all of the rows - that would be a big performance hit.

"SELECT word FROM tablename LIMIT FLOOR(RAND()*tablesize),1"

Where I guess tablesize was determined with a SELECT COUNT(*) in an earlier query.

robertaccettura

Yea, I've made up my mind. That's the best way.

Thanks for the help.