Quite often we need fast random results. A lot of new developers do it this way:
select * from table order by random() limit 1;
Seems simple enough. And it works just fine, as long as your table isn't too large. But, it doesn't scale well, either in terms of table size or number of users hitting the system at the same time.
At work, I've got a table that summarizes all the requests our systems see every minute. It has data since last August or so. Every service (100 or so) times every machine (4 to 8) times every minute since last august. Running something like order by random() limit 1 on such a database will simply never return. In the month of April alone, for 2 weeks so far, it has 3,046,182 rows.
In the last hour, on a quiet saturday night, it's gotten about 5,000 rows. Doing the order by random() limit 1 on the last hour's data takes 132 mS. The last 4 hours data takes 484 mS. 24 hours takes 3 seconds. 48 hours takes 9 seconds. You can see where this is going. What if I needed a random number from the last month or something? And what if I need it several times a second? order by random() is not going to work for that.
Here's a generalized method to let me get random entries fast. It's not perfect, but most compromises aren't. First, I'll grab the last 48 hours worth of ids into a summary table, and randomize them on the way. Since I'm only materializing the id, and not the whole row, the select for that is only about 4 seconds on this machine. Next I do a select count() on the summary table to find out how many rows it has. This takes 178 mS and I can store it. In this case, the summary table has 352,532 rows. We can run these two queries every hour or 5 minutes, or whatever we need, and store the result of that count() somewhere handy. Here's what I'm gonna do:
create table summary (id serial primary key, refid int);
insert into summary (refid) select summaryid from bigtable where lasttime > now() - interval'48 hour' order by random(); -- NOTE the order by random() is costly and uneeded... Thanks laserlght
INSERT 0 352538
Now I have a lookup table with a sequence from 1 to 352538 in the id column, and a number to match the master table in the refid column.
Let's grab a random row now! Using either php or your db, you can create a random number.
$id = random(1,352538);
// $id = 183965;
We run a query like this:
select * from bigtable where summaryid=(select refid from summary where id=183965);
Time: 49.478 ms
We'll call it 50 ms. That's a little faster than the 9 seconds it took with order by random() limit 1. 180 times faster.
Wait, I just ran it again, since the buffer wasn't full, and it now takes 0.993 ms to run...
Let's call that 1 ms. that's 9000 times faster.
Ran a little more testing. Seems the time to pull any one row out of the db unprimed is about 20-30 ms. The overhead of the select from the random table is very very small, somewhere around 0.4 ms. All the rest of the time is looking up the indexed entry fro 30+million rows and returning it.