Fast Random Results - "ORDER BY RAND()" is bad
Page 1 of 2 12 LastLast
Results 1 to 15 of 28

Thread: Fast Random Results - "ORDER BY RAND()" is bad

  1. #1
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027

    Fast Random Results - "ORDER BY RAND()" is bad

    Quite often we need fast random results. A lot of new developers do it this way:

    select * from table order by random() limit 1;

    Seems simple enough. And it works just fine, as long as your table isn't too large. But, it doesn't scale well, either in terms of table size or number of users hitting the system at the same time.

    At work, I've got a table that summarizes all the requests our systems see every minute. It has data since last August or so. Every service (100 or so) times every machine (4 to 8) times every minute since last august. Running something like order by random() limit 1 on such a database will simply never return. In the month of April alone, for 2 weeks so far, it has 3,046,182 rows.

    In the last hour, on a quiet saturday night, it's gotten about 5,000 rows. Doing the order by random() limit 1 on the last hour's data takes 132 mS. The last 4 hours data takes 484 mS. 24 hours takes 3 seconds. 48 hours takes 9 seconds. You can see where this is going. What if I needed a random number from the last month or something? And what if I need it several times a second? order by random() is not going to work for that.

    Here's a generalized method to let me get random entries fast. It's not perfect, but most compromises aren't. First, I'll grab the last 48 hours worth of ids into a summary table, and randomize them on the way. Since I'm only materializing the id, and not the whole row, the select for that is only about 4 seconds on this machine. Next I do a select count(*) on the summary table to find out how many rows it has. This takes 178 mS and I can store it. In this case, the summary table has 352,532 rows. We can run these two queries every hour or 5 minutes, or whatever we need, and store the result of that count(*) somewhere handy. Here's what I'm gonna do:

    create table summary (id serial primary key, refid int);
    insert into summary (refid) select summaryid from bigtable where lasttime > now() - interval'48 hour' order by random(); -- NOTE the order by random() is costly and uneeded... Thanks laserlght
    INSERT 0 352538

    Now I have a lookup table with a sequence from 1 to 352538 in the id column, and a number to match the master table in the refid column.

    Let's grab a random row now! Using either php or your db, you can create a random number.

    $id = random(1,352538);
    // $id = 183965;

    We run a query like this:

    select * from bigtable where summaryid=(select refid from summary where id=183965);
    Time: 49.478 ms

    We'll call it 50 ms. That's a little faster than the 9 seconds it took with order by random() limit 1. 180 times faster.

    Wait, I just ran it again, since the buffer wasn't full, and it now takes 0.993 ms to run...

    Let's call that 1 ms. that's 9000 times faster.

    Ran a little more testing. Seems the time to pull any one row out of the db unprimed is about 20-30 ms. The overhead of the select from the random table is very very small, somewhere around 0.4 ms. All the rest of the time is looking up the indexed entry fro 30+million rows and returning it.
    Last edited by Sxooter; 04-14-2007 at 11:47 PM. Reason: wrong number at the end... part 2...
    PostgreSQL, because your data matters.

  2. #2
    PHP Witch laserlight's Avatar
    Join Date
    Apr 2003
    Location
    Singapore
    Posts
    13,899
    hmm... but if you are caching primary keys, why bother ordering them randomly in the summary table? After all, the randomness can come from generating a random id in PHP.
    Use Bazaar for your version control system
    Read the PHP Spellbook
    Learn How To Ask Questions The Smart Way

  3. #3
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    Because there might be holes in the sequence.

    OTOH, you can cache the start / stop numbers and use something like:

    select * from table where id >= $id limit 1;
    PostgreSQL, because your data matters.

  4. #4
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    Just read what you wrote (second time) and you're absolutely right.
    Will post an adendum
    PostgreSQL, because your data matters.

  5. #5
    High Energy Magic Dept. NogDog's Avatar
    Join Date
    Aug 2006
    Location
    Ankh-Morpork
    Posts
    14,704
    Just a thought I had, no idea if it would perform any better:

    1. Do a SELECT COUNT(*) to get the number of records

    2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

    3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

    I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.
    "Well done....Consciousness to sarcasm in five seconds!" ~ Terry Pratchett, Night Watch

    How to Ask Questions the Smart Way (not affiliated with this site, but well worth reading)

    My Blog
    cwrBlog: simple, no-database PHP blogging framework

  6. #6
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    Thanks to laserlight's point, I ran the same query to populate the summary table, but with the last month's data instead of the last 48 hours. It took quite a while to build the summary table, 450 seconds. And it inserted 6787330 rows. Select time still hovers around 50 milliseconds for the first time on a row, 1 millisecond the second time when it's block is in memory.
    PostgreSQL, because your data matters.

  7. #7
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027

    Post

    Quote Originally Posted by NogDog
    Just a thought I had, no idea if it would perform any better:

    1. Do a SELECT COUNT(*) to get the number of records

    2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

    3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

    I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.
    Tried it before. It's halfway between order by rand() and my method in terms of performance. Problem is that the db still has to generate the whole data set before hitting the offset and stopping, so the higher the number, the longer it will take to return.

    Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.

    for instance, on the table listed above, hitting the last month's data, Here's what I go for varying offsets, in milliseconds

    1: 1.3
    100: 3.5
    1000: 4.4
    10,000: 23
    100,000: 210
    500,000: 1280
    1,000,000: 3000

    You can see where that's heading. At 40+ million rows, it's not gonna scale.
    COUNT(*) pretty much has to traverse through the entire table, right?
    laserlight, yes count(*) has to traverse the whole table in many databases, such as postgresql and oracle and mysql with innodb tables...
    Last edited by Sxooter; 04-15-2007 at 12:24 AM.
    PostgreSQL, because your data matters.

  8. #8
    PHP Witch laserlight's Avatar
    Join Date
    Apr 2003
    Location
    Singapore
    Posts
    13,899
    Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.
    COUNT(*) pretty much has to traverse through the entire table, right?
    Use Bazaar for your version control system
    Read the PHP Spellbook
    Learn How To Ask Questions The Smart Way

  9. #9
    High Energy Magic Dept. NogDog's Avatar
    Join Date
    Aug 2006
    Location
    Ankh-Morpork
    Posts
    14,704
    I was remembering this regarding MySQL:
    COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
    However, it then notes that this is only applicable in MyISAM tables.
    "Well done....Consciousness to sarcasm in five seconds!" ~ Terry Pratchett, Night Watch

    How to Ask Questions the Smart Way (not affiliated with this site, but well worth reading)

    My Blog
    cwrBlog: simple, no-database PHP blogging framework

  10. #10
    Senior Member Roger Ramjet's Avatar
    Join Date
    Jul 2004
    Location
    Leeds, UK
    Posts
    4,203
    Instead of COUNT(*) you can use SHOW TABLE STATUS with myisam tables to get the number of rows. Unfortunately this does not work for other table types which only return an approximation of the row count.. Be interesting if someone could benchmark this against count(*). My assumption is that show status will be independent of table size.

  11. #11
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    There's something similar in most databases for estimating rows cheaply. In PostgreSQL it's

    select reltuples from pg_class where relname='tablename';

    klunky but it works. It tells you how many tuples were in the table at the end of the last vacuum.
    PostgreSQL, because your data matters.

  12. #12
    Senior Member Roger Ramjet's Avatar
    Join Date
    Jul 2004
    Location
    Leeds, UK
    Posts
    4,203
    [edit]cleaned up for sticky[/edit]

    When I take those 3 steps back and look at the problem, I see that the solution is the domain itself. What we have all forgot is that the web is a random environment. We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners, random users, random messages, etc. What we all forget is a) Random does not mean unique, but more importantly b) it is the users who are already random.

    The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

    Those cases where RAND() is a necessary and applicable solution are complex apps like MRP/JIT, BI, Risk Analysis/Financial Modelling, Market Research, Scientific Analysis, etc: domains where complex and powerful analysis is involved and performance is secondary to results. When I worked in market research we did not mind if a report took 20 mins to process - just go to lunch, or run it overnight.
    Last edited by Roger Ramjet; 04-22-2007 at 07:37 AM.

  13. #13
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    Because with all their wealth of experience and knowledge, this RAND() problem is not one that they have had to solve before. If it were then the solution would have been presented already.
    I have run into it quite a bit before, and I believe I have presented the best solution for randomly selecting one row from a large data set.

    If you've got a better solution, I'd love to see it.

    About the 20 minute reports. I've got some reports that take the better part of a day to run, and that's after massive optimization. When you're aggregating 40 million rows of data, it can only be done so fast. If that same report had an order by random() clause in it, it would, literally, eat every byte of RAM, gigs of storage, and probably not return for weeks, maybe even months.
    PostgreSQL, because your data matters.

  14. #14
    Senior Member Roger Ramjet's Avatar
    Join Date
    Jul 2004
    Location
    Leeds, UK
    Posts
    4,203
    Hey, cool down Sxooter, I was not knocking you. On the contrary, I was trying to say that generally you would not even think of doing this. Those cases where you would have done it would be apps where rand() was essential and the only way to do the job; and you would accept the performance hit just because it was the ONLY way to do the job.

    If I've upset you, sorry. All I can say is that sitting drinking beer in the sunshine has got the better of me.

    I do have a modification to your method that will not get the performance hit with using LIMIT base,offset on large data sets.

    PHP Code:
    // get the highest 2 ids
    $res1 mysql_query("SELECT id FROM table ORDER BY id DESC LIMIT 2");
    // generate random no between 0 and second highest id
    $top mysql_result($res1,1);
    $rand rand(0,$top);
    // now query for id greater then random no
    $sql "SELECT * FROM table WHERE id>$rand ORDER BY id DESC LIMIT 1" 
    Size of dataset should not matter, but I stand to be corrected if you care to benchmark it.

  15. #15
    Chamberlain Sxooter's Avatar
    Join Date
    Aug 2002
    Location
    Denver, CO
    Posts
    4,027
    I'm not upset in the least, honestly. I sorta kinda figured that was where you were heading.

    I've posted something similar to your solution and that one also runs very quickly.

    I think I might be due for a pub run myself!
    PostgreSQL, because your data matters.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •