Quite often we need fast random results. A lot of new developers do it this way:

select * from table order by random() limit 1;

Seems simple enough. And it works just fine, as long as your table isn't too large. But, it doesn't scale well, either in terms of table size or number of users hitting the system at the same time.

At work, I've got a table that summarizes all the requests our systems see every minute. It has data since last August or so. Every service (100 or so) times every machine (4 to 8) times every minute since last august. Running something like order by random() limit 1 on such a database will simply never return. In the month of April alone, for 2 weeks so far, it has 3,046,182 rows.

In the last hour, on a quiet saturday night, it's gotten about 5,000 rows. Doing the order by random() limit 1 on the last hour's data takes 132 mS. The last 4 hours data takes 484 mS. 24 hours takes 3 seconds. 48 hours takes 9 seconds. You can see where this is going. What if I needed a random number from the last month or something? And what if I need it several times a second? order by random() is not going to work for that.

Here's a generalized method to let me get random entries fast. It's not perfect, but most compromises aren't. First, I'll grab the last 48 hours worth of ids into a summary table, and randomize them on the way. Since I'm only materializing the id, and not the whole row, the select for that is only about 4 seconds on this machine. Next I do a select count() on the summary table to find out how many rows it has. This takes 178 mS and I can store it. In this case, the summary table has 352,532 rows. We can run these two queries every hour or 5 minutes, or whatever we need, and store the result of that count() somewhere handy. Here's what I'm gonna do:

create table summary (id serial primary key, refid int);
insert into summary (refid) select summaryid from bigtable where lasttime > now() - interval'48 hour' order by random(); -- NOTE the order by random() is costly and uneeded... Thanks laserlght
INSERT 0 352538

Now I have a lookup table with a sequence from 1 to 352538 in the id column, and a number to match the master table in the refid column.

Let's grab a random row now! Using either php or your db, you can create a random number.

$id = random(1,352538);
// $id = 183965;

We run a query like this:

select * from bigtable where summaryid=(select refid from summary where id=183965);
Time: 49.478 ms

We'll call it 50 ms. That's a little faster than the 9 seconds it took with order by random() limit 1. 180 times faster.

Wait, I just ran it again, since the buffer wasn't full, and it now takes 0.993 ms to run...

Let's call that 1 ms. that's 9000 times faster.

Ran a little more testing. Seems the time to pull any one row out of the db unprimed is about 20-30 ms. The overhead of the select from the random table is very very small, somewhere around 0.4 ms. All the rest of the time is looking up the indexed entry fro 30+million rows and returning it.

    hmm... but if you are caching primary keys, why bother ordering them randomly in the summary table? After all, the randomness can come from generating a random id in PHP.

      Because there might be holes in the sequence.

      OTOH, you can cache the start / stop numbers and use something like:

      select * from table where id >= $id limit 1;

        Just read what you wrote (second time) and you're absolutely right.
        Will post an adendum

          Just a thought I had, no idea if it would perform any better:

          1. Do a SELECT COUNT(*) to get the number of records

          2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

          3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

          I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.

            Thanks to laserlight's point, I ran the same query to populate the summary table, but with the last month's data instead of the last 48 hours. It took quite a while to build the summary table, 450 seconds. And it inserted 6787330 rows. Select time still hovers around 50 milliseconds for the first time on a row, 1 millisecond the second time when it's block is in memory.

              NogDog wrote:

              Just a thought I had, no idea if it would perform any better:

              1. Do a SELECT COUNT(*) to get the number of records

              2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

              3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

              I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.

              Tried it before. It's halfway between order by rand() and my method in terms of performance. Problem is that the db still has to generate the whole data set before hitting the offset and stopping, so the higher the number, the longer it will take to return.

              Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.

              for instance, on the table listed above, hitting the last month's data, Here's what I go for varying offsets, in milliseconds

              1: 1.3
              100: 3.5
              1000: 4.4
              10,000: 23
              100,000: 210
              500,000: 1280
              1,000,000: 3000

              You can see where that's heading. At 40+ million rows, it's not gonna scale.

              COUNT() pretty much has to traverse through the entire table, right?


              laserlight, yes count(
              ) has to traverse the whole table in many databases, such as postgresql and oracle and mysql with innodb tables...

                Note that select count() is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.


                COUNT(
                ) pretty much has to traverse through the entire table, right?

                  I was remembering this regarding MySQL:

                  COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.

                  However, it then notes that this is only applicable in MyISAM tables.

                    Instead of COUNT() you can use SHOW TABLE STATUS with myisam tables to get the number of rows. Unfortunately this does not work for other table types which only return an approximation of the row count.. Be interesting if someone could benchmark this against count(). My assumption is that show status will be independent of table size.

                      There's something similar in most databases for estimating rows cheaply. In PostgreSQL it's

                      select reltuples from pg_class where relname='tablename';

                      klunky but it works. It tells you how many tuples were in the table at the end of the last vacuum.

                        [edit]cleaned up for sticky[/edit]

                        When I take those 3 steps back and look at the problem, I see that the solution is the domain itself. What we have all forgot is that the web is a random environment. We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners, random users, random messages, etc. What we all forget is a) Random does not mean unique, but more importantly b) it is the users who are already random.

                        The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

                        Those cases where RAND() is a necessary and applicable solution are complex apps like MRP/JIT, BI, Risk Analysis/Financial Modelling, Market Research, Scientific Analysis, etc: domains where complex and powerful analysis is involved and performance is secondary to results. When I worked in market research we did not mind if a report took 20 mins to process - just go to lunch, or run it overnight.

                          Because with all their wealth of experience and knowledge, this RAND() problem is not one that they have had to solve before. If it were then the solution would have been presented already.

                          I have run into it quite a bit before, and I believe I have presented the best solution for randomly selecting one row from a large data set. 😃

                          If you've got a better solution, I'd love to see it.

                          About the 20 minute reports. I've got some reports that take the better part of a day to run, and that's after massive optimization. When you're aggregating 40 million rows of data, it can only be done so fast. If that same report had an order by random() clause in it, it would, literally, eat every byte of RAM, gigs of storage, and probably not return for weeks, maybe even months.

                            Hey, cool down Sxooter, I was not knocking you. On the contrary, I was trying to say that generally you would not even think of doing this. Those cases where you would have done it would be apps where rand() was essential and the only way to do the job; and you would accept the performance hit just because it was the ONLY way to do the job.

                            If I've upset you, sorry. All I can say is that sitting drinking beer in the sunshine has got the better of me.

                            I do have a modification to your method that will not get the performance hit with using LIMIT base,offset on large data sets.

                            // get the highest 2 ids
                            $res1 = mysql_query("SELECT id FROM table ORDER BY id DESC LIMIT 2");
                            // generate random no between 0 and second highest id
                            $top = mysql_result($res1,1);
                            $rand = rand(0,$top);
                            // now query for id greater then random no
                            $sql = "SELECT * FROM table WHERE id>$rand ORDER BY id DESC LIMIT 1"
                            

                            Size of dataset should not matter, but I stand to be corrected if you care to benchmark it.

                              I'm not upset in the least, honestly. I sorta kinda figured that was where you were heading.

                              I've posted something similar to your solution and that one also runs very quickly.

                              I think I might be due for a pub run myself! 🙂

                                Cool, Sxooter, go for it. I'm back down there myself in a mo to watch Chelsea v Blackburn in the FA cup semis. 😉

                                  7 days later

                                  Thread stickied; if any cleaning up needs to be done, Sxooter, let me know.

                                    8 days later
                                    6 months later

                                    Wow that's a sweet tip... I used to do it that way about 3 years ago when I was starting out and had no idea there was an ORDER BY rand() possibility.

                                      4 months later
                                      Roger Ramjet wrote:

                                      We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners

                                      I'm stuck on this one now, was going to use RAND() to pick 1 of 3 adverts, but decided to rotate them...

                                      Roger Ramjet wrote:

                                      The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

                                      Can you share this solution/algorithm with us? Please 😃