Fast Random Results - "ORDER BY RAND()" is bad

Sxooter · Apr 15, 2007

Quite often we need fast random results. A lot of new developers do it this way:

select * from table order by random() limit 1;

Seems simple enough. And it works just fine, as long as your table isn't too large. But, it doesn't scale well, either in terms of table size or number of users hitting the system at the same time.

At work, I've got a table that summarizes all the requests our systems see every minute. It has data since last August or so. Every service (100 or so) times every machine (4 to 8) times every minute since last august. Running something like order by random() limit 1 on such a database will simply never return. In the month of April alone, for 2 weeks so far, it has 3,046,182 rows.

In the last hour, on a quiet saturday night, it's gotten about 5,000 rows. Doing the order by random() limit 1 on the last hour's data takes 132 mS. The last 4 hours data takes 484 mS. 24 hours takes 3 seconds. 48 hours takes 9 seconds. You can see where this is going. What if I needed a random number from the last month or something? And what if I need it several times a second? order by random() is not going to work for that.

Here's a generalized method to let me get random entries fast. It's not perfect, but most compromises aren't. First, I'll grab the last 48 hours worth of ids into a summary table, and randomize them on the way. Since I'm only materializing the id, and not the whole row, the select for that is only about 4 seconds on this machine. Next I do a select count() on the summary table to find out how many rows it has. This takes 178 mS and I can store it. In this case, the summary table has 352,532 rows. We can run these two queries every hour or 5 minutes, or whatever we need, and store the result of that count() somewhere handy. Here's what I'm gonna do:

create table summary (id serial primary key, refid int);
insert into summary (refid) select summaryid from bigtable where lasttime > now() - interval'48 hour' order by random(); -- NOTE the order by random() is costly and uneeded... Thanks laserlght
INSERT 0 352538

Now I have a lookup table with a sequence from 1 to 352538 in the id column, and a number to match the master table in the refid column.

Let's grab a random row now! Using either php or your db, you can create a random number.

$id = random(1,352538);
// $id = 183965;

We run a query like this:

select * from bigtable where summaryid=(select refid from summary where id=183965);
Time: 49.478 ms

We'll call it 50 ms. That's a little faster than the 9 seconds it took with order by random() limit 1. 180 times faster.

Wait, I just ran it again, since the buffer wasn't full, and it now takes 0.993 ms to run...

Let's call that 1 ms. that's 9000 times faster.

Ran a little more testing. Seems the time to pull any one row out of the db unprimed is about 20-30 ms. The overhead of the select from the random table is very very small, somewhere around 0.4 ms. All the rest of the time is looking up the indexed entry fro 30+million rows and returning it.

laserlight · Apr 15, 2007

hmm... but if you are caching primary keys, why bother ordering them randomly in the summary table? After all, the randomness can come from generating a random id in PHP.

Sxooter · Apr 15, 2007

Because there might be holes in the sequence.

OTOH, you can cache the start / stop numbers and use something like:

select * from table where id >= $id limit 1;

Sxooter · Apr 15, 2007

Just read what you wrote (second time) and you're absolutely right.
Will post an adendum

NogDog · Apr 15, 2007

Just a thought I had, no idea if it would perform any better:

Do a SELECT COUNT(*) to get the number of records
Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.
Do a SELECT * FROM table LIMIT 1 OFFSET $offset

I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.

Sxooter · Apr 15, 2007

Thanks to laserlight's point, I ran the same query to populate the summary table, but with the last month's data instead of the last 48 hours. It took quite a while to build the summary table, 450 seconds. And it inserted 6787330 rows. Select time still hovers around 50 milliseconds for the first time on a row, 1 millisecond the second time when it's block is in memory.

Sxooter · Apr 15, 2007

NogDog wrote:
Just a thought I had, no idea if it would perform any better:

Do a SELECT COUNT(*) to get the number of records

Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

Do a SELECT * FROM table LIMIT 1 OFFSET $offset

I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.

Tried it before. It's halfway between order by rand() and my method in terms of performance. Problem is that the db still has to generate the whole data set before hitting the offset and stopping, so the higher the number, the longer it will take to return.

Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.

for instance, on the table listed above, hitting the last month's data, Here's what I go for varying offsets, in milliseconds

1: 1.3
100: 3.5
1000: 4.4
10,000: 23
100,000: 210
500,000: 1280
1,000,000: 3000

You can see where that's heading. At 40+ million rows, it's not gonna scale.

COUNT() pretty much has to traverse through the entire table, right?

laserlight, yes count() has to traverse the whole table in many databases, such as postgresql and oracle and mysql with innodb tables...

laserlight · Apr 15, 2007

Note that select count() is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.

COUNT() pretty much has to traverse through the entire table, right?

NogDog · Apr 15, 2007

I was remembering this regarding MySQL:

COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.

However, it then notes that this is only applicable in MyISAM tables.

Roger_Ramjet · Apr 15, 2007

Instead of COUNT() you can use SHOW TABLE STATUS with myisam tables to get the number of rows. Unfortunately this does not work for other table types which only return an approximation of the row count.. Be interesting if someone could benchmark this against count(). My assumption is that show status will be independent of table size.

Sxooter · Apr 15, 2007

There's something similar in most databases for estimating rows cheaply. In PostgreSQL it's

select reltuples from pg_class where relname='tablename';

klunky but it works. It tells you how many tuples were in the table at the end of the last vacuum.

Roger_Ramjet · Apr 15, 2007

[edit]cleaned up for sticky[/edit]

When I take those 3 steps back and look at the problem, I see that the solution is the domain itself. What we have all forgot is that the web is a random environment. We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners, random users, random messages, etc. What we all forget is a) Random does not mean unique, but more importantly b) it is the users who are already random.

The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

Those cases where RAND() is a necessary and applicable solution are complex apps like MRP/JIT, BI, Risk Analysis/Financial Modelling, Market Research, Scientific Analysis, etc: domains where complex and powerful analysis is involved and performance is secondary to results. When I worked in market research we did not mind if a report took 20 mins to process - just go to lunch, or run it overnight.

Sxooter · Apr 15, 2007

Because with all their wealth of experience and knowledge, this RAND() problem is not one that they have had to solve before. If it were then the solution would have been presented already.

I have run into it quite a bit before, and I believe I have presented the best solution for randomly selecting one row from a large data set.

If you've got a better solution, I'd love to see it.

About the 20 minute reports. I've got some reports that take the better part of a day to run, and that's after massive optimization. When you're aggregating 40 million rows of data, it can only be done so fast. If that same report had an order by random() clause in it, it would, literally, eat every byte of RAM, gigs of storage, and probably not return for weeks, maybe even months.

Roger_Ramjet · Apr 15, 2007

Hey, cool down Sxooter, I was not knocking you. On the contrary, I was trying to say that generally you would not even think of doing this. Those cases where you would have done it would be apps where rand() was essential and the only way to do the job; and you would accept the performance hit just because it was the ONLY way to do the job.

If I've upset you, sorry. All I can say is that sitting drinking beer in the sunshine has got the better of me.

I do have a modification to your method that will not get the performance hit with using LIMIT base,offset on large data sets.

// get the highest 2 ids
$res1 = mysql_query("SELECT id FROM table ORDER BY id DESC LIMIT 2");
// generate random no between 0 and second highest id
$top = mysql_result($res1,1);
$rand = rand(0,$top);
// now query for id greater then random no
$sql = "SELECT * FROM table WHERE id>$rand ORDER BY id DESC LIMIT 1"

Size of dataset should not matter, but I stand to be corrected if you care to benchmark it.

Sxooter · Apr 15, 2007

I'm not upset in the least, honestly. I sorta kinda figured that was where you were heading.

I've posted something similar to your solution and that one also runs very quickly.

I think I might be due for a pub run myself!

Roger_Ramjet · Apr 15, 2007

Cool, Sxooter, go for it. I'm back down there myself in a mo to watch Chelsea v Blackburn in the FA cup semis.

bradgrafelman · Apr 22, 2007

Thread stickied; if any cleaning up needs to be done, Sxooter, let me know.

sarakigroup1 · Apr 30, 2007

Bozebo · Oct 20, 2007

Wow that's a sweet tip... I used to do it that way about 3 years ago when I was starting out and had no idea there was an ORDER BY rand() possibility.

djalecc · Feb 29, 2008

Roger Ramjet wrote:
We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners

I'm stuck on this one now, was going to use RAND() to pick 1 of 3 adverts, but decided to rotate them...

Roger Ramjet wrote:
The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

Can you share this solution/algorithm with us? Please

Fast Random Results - "ORDER BY RAND()" is bad

SSxooter

laserlight

SSxooter

SSxooter

NogDog

SSxooter

SSxooter

laserlight

NogDog

RRoger_Ramjet

SSxooter

RRoger_Ramjet

SSxooter

RRoger_Ramjet

SSxooter

RRoger_Ramjet

Bbradgrafelman

Ssarakigroup1

BBozebo

Ddjalecc