query performance profiling

Roger_Ramjet · Nov 12, 2006

Sorry I've not replied sooner - I've spent the last few days finding out how much I've forgotten in the year since I did any php/html/css. Worst part is I've lost all my base-code so I'm starting from scratch.

Anyway, I'm really not up on EXPLAIN output since I never needed to use it myself: I've been doing relational databases since the '80s - which is why I mostly answer posts in the db forum.

My solution to your problem goes like this:

Run a query that selects ALL the data in one go, y that I mean all the candidates AND all the columns you want as well. Do not impose any grouping, distinct, ordering, randomisation, or anything else that will slow the query down. Keep your WHERE conditions to those that are required.

Do not worry about duplication of events since we are going to deal with that in code.

This should return a large resultset in only fractions of a second.

2, Now process the results in code to select and display 10 random events with no duplicates. Now there is code for a function that does this in the RAND() manual, but it is based on fetching all the rows first, which is a lot of unnecessary processing. I'm sure that my solution could be converted into a function that you just pass the resultset reference to.

$sql = "SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1 AND
((eta.start_timestamp >=1162506800 AND eta.start_timestamp >=1163111600)
OR
(eta.start_timestamp <1162506800 AND eta.end_timestamp >=1162510400))";

$result=mysql_query($sql) or die();

// get limit for rand()
$up = mysql_num_rows($result);
// set up count
$c = 10;
// set up array to hold event ids and enable ellimination of duplicates
$ary = array();

while (count($ary ) < $c) {
   // get random number for index - remember result index is zero-based so we minus 1
   $i = rand(1, $up) - 1;
   // move pointer to that row number and fetch the data
   mysql_data_seek($result, $i);
   $row = mysql_fetch_array($result);
  //  check if the event id has already been displayed
  if (!in_array($row['eventid'], $ary)) {
     // display the event data
     echo '<tr><td>' .  $row['eventid'] . 'etc, etc';
    // store the id to prevent duplication
    array_push($ary, $row['eventid']);
   }
}

Now, you may need to change the query if it is not getting the right fields - but don't be tempted to add any conditions. The selection of random events takes care of any events with duplicate listings because it has to take care of the possibility that rand() will return the same number - which everyone forgets that it can.

This approach should be fastest no matter how many records you are dealing with.

Hope this helps.

sneakyimp · Nov 15, 2006

let me just say again that I really appreciate this, RR!

I have basically done what you described in your last post. I have my initial query which fetches all the records and then I have my own routine to choose the results randomly. i'm using mysql_result() instead of mysql_data_seek(). I'm not really sure what the difference is. it seems to be working.

I must confess, I am running two queries and my first one uses 'SELECT DISTINCT e.id'. I have checked and the 2nd query to fetch the other info typically takes a few milliseconds or less. These don't seem to be the problem though as the first query is what's taking all the time.

I popuplated my tables with some random sample data:
events - 50,000 records
event_time_assoc - 115,607 records
zip_codes - 43,000 records.

I've set up a page that keeps updating the target timestamps with the current time so I can be sure that I'm not getting cached results.

My first query typically runs in 1.5 seconds:

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163590783
        AND eta.start_timestamp <=1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383)) ORDER BY NULL

Your recommendation runs in about 2 seconds each time:

SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >=1163591310
          AND eta.start_timestamp >=1164196110)
   OR (eta.start_timestamp <1163591310
          AND eta.end_timestamp >=1163594910))

Seems to me it's these initial queries that are the problem because they seem to be scanning tables rather than using indexes.

this EXPLAIN statement says as much basically:

EXPLAIN
SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >=1163591310
          AND eta.start_timestamp >=1164196110)
   OR (eta.start_timestamp <1163591310
          AND eta.end_timestamp >=1163594910))

the results:

table   type     possible_keys                            key       key_len   ref           rows     Extra
eta     ALL      event_id,start_timestamp,end_timestamp   NULL      NULL     NULL           115607   Using where
e       eq_ref   PRIMARY                                  PRIMARY   4        eta.event_id   1        Using where
z       eq_ref   PRIMARY                                  PRIMARY   5        e.zip          1

The table has indexes which appear to be getting ignored for some reason. too many indexes maybe? I was originally using unix-style timestamps and couldn't decide between those and mysql datetime format. You can recreate the eta table with this:

CREATE TABLE `demo_event_time_assoc` (
  `id` int(15) unsigned NOT NULL auto_increment,
  `event_id` int(11) unsigned NOT NULL default '0',
  `start_datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `start_timestamp` int(12) NOT NULL default '0',
  `end_datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `end_timestamp` int(12) NOT NULL default '0',
  `start_hour` tinyint(3) unsigned NOT NULL default '0',
  `end_hour` tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `event_id` (`event_id`),
  KEY `start_timestamp` (`start_timestamp`),
  KEY `end_timestamp` (`end_timestamp`),
  KEY `start_datetime` (`start_datetime`),
  KEY `end_datetime` (`end_datetime`)
) TYPE=MyISAM AUTO_INCREMENT=115608 ;

Roger_Ramjet · Nov 17, 2006

OK, so it is still down to optimising that first query - and don't thank me too much cos I'm learning as much as you are here.

Now let me see if I have got this right:
1) you have a table of events, some of which occur more than once: so you have an events-dates table to cope with that.
2) you are storing unix time stamps as integers, which you index: since integer indexes are fastest then this should help performance

Now the reason my query is taking longer will be because it is fetching the data and so has to actually read the data in the joined tables, not just parse the indexes.

Now 2 things may help here with your query:

"SQL_BIG_RESULT can be used with GROUP BY or DISTINCT to tell the optimizer that the result set has many rows. In this case, MySQL directly uses disk-based temporary tables if needed, and prefers sorting to using a temporary table with a key on the GROUP BY elements."

Depending on your mysql version you can use USE INDEX (pre MySQL 4.0.9) or FORCE INDEX (MySQL 4.0.9 +)

SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
FORCE INDEX (eta.event_id, eta.start_timestamp, eta.end_timestamp)
WHERE e.active =1
AND ((eta.start_timestamp BETWEEN 1163590783 AND 1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383)) ORDER BY NULL

Be interesting to see how that one goes now.

Now I found this in the notes in the manual (I've modified it for this case) and it will be interesting to compare speeds

SELECT DISTINCT e.id, e.id*0+RAND() as rnd_id 
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163590783
        AND eta.start_timestamp <=1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383))
ORDER BY rnd_id LIMIT 10

Give it a go and see how it performs.

sneakyimp · Nov 17, 2006

Thanks R.

I got really excited at first after implementing the SQL_BIG_RESULT/FORCE INDEX part but then realized that the speed gains were due to the absence of the DISTINCT bit. When I added that back, the force index query is actually about 18% slower with my current data. This is kind of mind-boggling.

When i tried EXPLAINing it, it appears to be using the index and is still just plain slower:

EXPLAIN SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
FORCE INDEX ( event_id, start_timestamp, end_timestamp ) ON e.id = eta.event_id
WHERE e.active =1
AND (
(
eta.start_timestamp
BETWEEN 1163785292
AND 1164390092
)
OR (
eta.start_timestamp <1163785292
AND eta.end_timestamp >=1163788892
)
)
ORDER BY NULL

result:

table   type     possible_keys                            key               key_len   ref            rows    Extra
eta     range    event_id,start_timestamp,end_timestamp   start_timestamp   4         NULL           52501   Using where; Using temporary
e       eq_ref   PRIMARY                                  PRIMARY           4         eta.event_id   1       Using where

I tried with and without SQL_BIG_RESULT and it doesn't seem to make any appreciable difference for the current data set of 50,000 records.

That 2nd query you have is pretty cool. It's much faster than those early ORDER BY RAND() attempts for some reason but is sufficiently slower than my last SELECT DISTINCT e.id query that my PHP code to do the random selecting would be faster.

Here's some sample output from my test page:

Query 1

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163785286
      AND eta.start_timestamp <=1164390086)
  OR
    (eta.start_timestamp <1163785286
     AND eta.end_timestamp >=1163788886))
     ORDER BY NULL

TIME EXPIRED:1.45012593269 seconds
41342 records found.



Query 2

SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
INNER JOIN demo_zip_codes z
ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >= 1163785289
      AND eta.start_timestamp >=1164390089)
  OR
     (eta.start_timestamp < 1163785289
      AND eta.end_timestamp >=1163788889))

TIME EXPIRED:2.23707103729 seconds
71582 records found.



Query 3
sql:
SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
FORCE INDEX (event_id, start_timestamp, end_timestamp)
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp BETWEEN 1163785292 AND 1164390092)
  OR (eta.start_timestamp < 1163785292
      AND eta.end_timestamp >=1163788892))
ORDER BY NULL

TIME EXPIRED:1.7027220726 seconds
41342 records found.



Query 4

SELECT DISTINCT e.id, e.id*0+RAND() as rnd_id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >= 1163786237
     AND eta.start_timestamp <= 1164391037)
  OR (eta.start_timestamp < 1163786237
      AND eta.end_timestamp >= 1163789837))
ORDER BY rnd_id LIMIT 10

TIME EXPIRED:1.75384306908 seconds
10 records found.

Roger_Ramjet · Nov 20, 2006

Hm, not very good at all. Would be ok for a search, but not for random page displays especially with any amount of traffic.

Now, one thing that struck me is the e.active =1.
Is this column indexed, and is it just a yes/no option?
My guess is that this was what was forcing the table-scan/filesort - would have to really if it were not indexed. Since this is the coarsest where condition the query optimiser should apply it last but there is no guarantee about that so move it to the end and see if it has any effect.

I'd like to get down to what is actually taking all the time. Break the query down into parts and see what they perform like. (Really should have done that from the outset) So try a query with the individual timestamp conditions and see how that performs, then combine them and lastly add the e.active=1. Then add things like distinct and order by rand(). Get some benchmarks to clarify where the time is going.

If your version supports sub-queries (4.1 or above) then it may pay to use one here. eg. subquery with just the time stamp conditions and then the e.active = 1 in an outer query - all depends on the benchmarking results.

sneakyimp · Nov 21, 2006

ok...something to keep in mind is that this machine is anything but fast. however, i've tried running it on a fast, modern server running a shared hosting environment and it's faster, but still slow.

the machine:

800 MHZ p3
504 MB memory (i think some used for video)

i ran a script to generate more data. the database currently has 250,000 records, about 90-95% of them have e.active=1.

This is my current champion query:

ORIGINAL QUERY

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164082656 AND eta.start_timestamp <=1164687456) OR (eta.start_timestamp <1164082656 AND eta.end_timestamp >=1164086256))
ORDER BY NULL

TIME EXPIRED:8.34640598297 seconds
67104 records found.

I stripped down to the basic query. Is it surprising this takes almost 6 seconds? Joining a table with 250k records to one with 500k sounds like maybe it would be slow even WITH indexes.

QUERY 1

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id

TIME EXPIRED:5.67353010178 seconds
578265 records found.

Interestingly , a DISTINCT query is slightly faster:

QUERY 2

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id

TIME EXPIRED:3.95994400978 seconds
250000 records found.

Adding e.active=1 appears to slow things down pretty bad. e.active is either 0 or 1. I dont' see much difference when I index or not. As I undertand it, an index doesn't help much unless it splits your database into a pretty large number of pieces. Splitting it in half probably doesn't help much at all.

QUERY 3

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1

TIME EXPIRED:8.91257214546 seconds
549954 records found.

Removing the DISTINCT and e.active parts from my current champion query appear to gain us maybe 2.5 seconds relative to the original query. whoopee. i would still need to check any returned results to make sure they are in fact active.

QUERY 4

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE ((eta.start_timestamp >=1164081742 AND eta.start_timestamp <=1164686542) OR (eta.start_timestamp <1164081742 AND eta.end_timestamp >=1164085342))

TIME EXPIRED:5.8220641613 seconds
72068 records found.

Removing the time limitation of my original query gains me perhaps one second. This is also not helpful:

QUERY 5

SELECT DISTINCT e.id FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id WHERE e.active=1

TIME EXPIRED:7.45040893555 seconds
237623 records found.

If I just remove the DISTINCT part from my original query it actually takes longer:

QUERY 6

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active=1
AND ((eta.start_timestamp >=1164082661 AND eta.start_timestamp <=1164687461) OR (eta.start_timestamp <1164082661 AND eta.end_timestamp >=1164086261))

TIME EXPIRED:8.63395309448 seconds
68486 records found.

Lastly, I tried using a group by without any time clause and it was slower than anything

QUERY 7

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164083534 AND eta.start_timestamp <=1164688334) OR (eta.start_timestamp <1164083534 AND eta.end_timestamp >=1164087134)) GROUP BY e.id

TIME EXPIRED:9.08280897141 seconds
67104 records found.

The prospects of improving this query any more are looking pretty bleak to me. I can remove the e.actve=1 but then I have a pretty gnarly programming task of finding random records. simply checking any random result and checking e.active is simple enough, but there could be situations where we could have real performance problems because we keep pick records at random and we keep getting the 'wrong' records.

Any thoughts?

NogDog · Nov 21, 2006

If it's not important to have the query result ordered by e.id, then you could put an ORDER BY NULL at the end of the query to suppress the sorting that a GROUP BY does by default, then see if that gains you anything.

sneakyimp · Nov 21, 2006

I tried that. Things were generally slower or there was no change

QUERY 2

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
ORDER BY NULL

TIME EXPIRED:4.20983719826 seconds
250000 records found.



QUERY 5

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active=1
ORDER BY NULL

TIME EXPIRED:7.71770000458 seconds
237623 records found.



QUERY 7

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164085602 AND eta.start_timestamp <=1164690402)
OR (eta.start_timestamp <1164085602 AND eta.end_timestamp >=1164089202))
GROUP BY e.id
ORDER BY NULL

TIME EXPIRED:8.98559093475 seconds
66535 records found.

sneakyimp · Nov 25, 2006

Interestingly, I found that if I could avoid the JOIN between my events table and my event_time_assoc table, the query is dramatically faster:

Query 8

SELECT DISTINCT event_id
FROM demo_event_time_assoc eta
WHERE ((eta.start_timestamp >=1164489469 AND eta.start_timestamp <=1165094269)
OR (eta.start_timestamp <1164489469 AND eta.end_timestamp >=1164493069))
ORDER BY NULL

TIME EXPIRED:1.03439688683 seconds
61603 records found.

This is not surprising at all I suppose. Joining a table with 580k records to another table with 250k records sounds like a lot of work even with limits and indexes. For other queries that I'm still working on we'll have to do yet another join to a table with about 400k records in it.

This got me thinking about indexes and the query optimizer. Is it possible to tweak some aspect of the indexes that could improve performance? Here are the details from phpMyAdmin on the event_time_assoc table's indexes:

KEYNAME           TYPE      CARDINALITY   FIELD
PRIMARY           PRIMARY   578265        id
event_id          INDEX     289132        event_id
start_timestamp   INDEX     16062         start_timestamp
end_timestamp     INDEX     16521         end_timestamp

Roger_Ramjet · Nov 29, 2006

First point: if 90% are active then ignore it in the query. You can select extra records with your limit and test for and discard inactive records in your code.

I would do the timestamp selection in a sub-query. If your version does not support sub-queries then upgrade it. Join to the select sub-query just the same as you would to a table.

sneakyimp · Nov 29, 2006

Roger Ramjet wrote:
First point: if 90% are active then ignore it in the query. You can select extra records with your limit and test for and discard inactive records in your code.

I've considered this and it sounds plausible if you can depend on a 90% active rate but the data I have now is bogus...I just generated it at random trying to mimic what I hope we will see after launch. Truth is, I don't know how many events will be active and how many will not.

And while statistically finding some active records at random seems likely, the task of coming up with php code to make sure I get enough randomly selected ones seems pretty daunting and rife with contingencies. In my mind I imagine a situation in which 90% of records are NOT ACTIVE and I have to pick thousands of records at random before finding 10 that I can show.

Complicating matters is that there are 2 other 'zero or one' fields like active that must be checked in other queries and both are paid features - meaning the random selection has to be truly random.

Am I being paranoid or are the boundary possiblities as complicated as I think? I imagine a process like this:

1) use simple query to fetch ALL events in the time frame - both ID and event
2) pick records randomly by
i) generating random number R between 1 and N
ii) if R is not in my list of ids i've already tried and can't use, then examine it
iii) examine it....if it's active, use it. if it's not active, but it in the ALREADY TRIED list
iv) if i don't have enough records or the size of ALREADY TRIED ids is as large as the original record set, repeat

What I see as being a problem is where i just can't seem to find enough active records and my ALREADY TRIED list starts getting really large....just picking a random number i haven't already tried becomes increasingly difficult which would mean terrible performance. I don't think there's any way for me to 'eliminate' records from my record set once i've tried them, right? or maybe there is? Do you see what I mean?

Roger Ramjet wrote:
I would do the timestamp selection in a sub-query. If your version does not support sub-queries then upgrade it. Join to the select sub-query just the same as you would to a table.

I think by 'timestamp selection' you are referring to the event details (date, cost, etc.). I LIKE the idea of subqueries but am a bit confused by what you mean. As it is, the timestamp (an indexed INT field) is one of the best ways to reduce the number of records under consideration. The query to fetch the event details is fast -- typically .002 seconds.

sneakyimp · Dec 7, 2006

i reckon i'm going to have to start thinking about this in a totally different way if i want it to be fast makes me wanna cry.

i was just reading about how google's PageRank works and i can't remember anything about linear algebra and matrices:

http://www.ams.org/featurecolumn/archive/pagerank.html

Sxooter · Jan 18, 2007

OK, here's what you wanna do. Let's suppose that you have ids that range from 0 to 35,000 or so.

If you can have a table with a gapless, or nearly so, sequence, that is best. For instance, I've got a table with 33478 rows, missing every 24th or so row, which goes from 0 to 34999. It also has a date field, and a randomly tacked on text field of about 25 characters.

Here's what I get with order by random:

select random(), * from testtable2 order by 1 limit 1;
Time: 198.268 ms

Note I had to move the random() to the select list since you technically must have something in there to order by on it. MySQL will let you get away without it. I also don't need the distinct since this is a unique field.

If I just pick a number from 0-34999 and use that as an offset, I get different times depending on which part of the table I'm hitting.
select * from testtable2 limit 1 offset 1
Time: 11.983 ms

select * from testtable2 limit 1 offset 34999;
Time: 63.818 ms

Then, I can use just a single value that I've created with rand():

select * from testtable2 where id > 34388 limit 1;
Time: 16.149 ms

Now, how about when I go from 35000 rows to something much bigger, say 350,000 rows?

Sxooter · Jan 18, 2007

Sorry in that last message, I had a very bloated index in the last query. With the index cleaned up, running this query:

select * from testtable2 where id > 34388 limit 1;
Time: 0.880 ms

sub millisecond times. Now, let's see how we scale to 350,000 rows!

select * from testtable4 where id > 134388 limit 1;
Time: 0.438 ms

Times for these two queries ranged from 0.5 to 1.0 milliseconds for both data sets, even though one was 10 times bigger.

Meanwhile,
select random(), * from testtable2 order by 1 limit 1;
Time: 4481.850 ms

The difference is huge.

sneakyimp · Jan 22, 2007

Thanks for helping out, Sxooter! However, we have a reasonable solution to the slowness caused by random selection. We are instead selecting a records set and then we use PHP code to select randomly from it by counting the records returned and using mysql_result() to scan thru the result set.

The crux of the problem (if you read this whole LONG thread) is that the inherent selection and join is too slow and doesn't scale. With a mere 250,000 events (occurring on average a little over 2 times each) we get unacceptably slow results:

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id

TIME EXPIRED:5.67353010178 seconds
578265 records found.

In fact, just today I ran this query and got unacceptably slow results:

SELECT * FROM demo_events
TIME EXPIRED:12.1455380917 seconds
254000 records found.

I'm wondering why this is and guessing it's because i have variable-length TEXT and VARCHAR fields. I should probably move these to a separate db table and this would dramatically reduce the size of the db table and hopefully the speed.

Sxooter · Jan 22, 2007

I'm wondering why this is and guessing it's because i have variable-length TEXT and VARCHAR fields. I should probably move these to a separate db table and this would dramatically reduce the size of the db table and hopefully the speed.

No, that's not the problem. With proper indexing, selecting a single rather wide row should still be very fast.

The problem is that you're asking for a whole file cabinet's worth of data, then throwing out all but a single sheet of paper. It's inherently inefficient.

The methodology I listed is a very fast way to pick a single id from a large set of them, and assuming a fairly normal distribution you'll get a truly random result set without a lot of records showing up too often. And, if you want a perfectly random result set, then you can always keep a table that maps a non-gapped sequence to the ids, but that is transactionally intensive on writes, so don't do it unless you have to.

You did notice that my query response times for the random selection from a 350,000 row table was less than 1 millisecond, not less than 1 second.

sneakyimp · Jan 22, 2007

I don't get the feeling you've read this thread in its entirety which is understandable. I've got stacks of indexes on my database. I've tried forcing particular indexes and join orders.

What's more, this gives me error 1064:

select random(), * from demo_events order by 1 limit 1

as does this

select rand(), * from demo_events order by 1 limit 1

my data is likely to have HUGE gaps in it because I expect to remove expired events to improve performance.

lastly, I don't see anything in these queries that results in a new record each time. Are you sure you aren't receiving the same record over and over? Might your speed be due to query caching?

Sxooter · Jan 22, 2007

Trust me, I've read the whole thread. I'm not sure you're reading my respones completely though.

No amount of indexing is gonna speed up retrieving a whole data set and then throwing all but one row away.

Basically, you need to come up with a way to hit just one row or a short set of rows.

Maybe you could do something like select * from the tables where id between 10000 and 12000 (i.e. 2,000 rows or something) then use a random offset inside that set. As long as there aren't too many huge gaps in your sequence, you should be able to get a fair number of records from that but not have to select from the whole data set.

Sxooter · Jan 22, 2007

lastly, I don't see anything in these queries that results in a new record each time. Are you sure you aren't receiving the same record over and over? Might your speed be due to query caching?

Yes, I'm sure I'm not receiving the same record over and over. And there ain't no query cache in postgresql.

sneakyimp · Jan 22, 2007

Sxooter wrote:
No amount of indexing is gonna speed up retrieving a whole data set and then throwing all but one row away.

Amen! I hear that.

However, I tried your recommended code and it causes an error. That sort of put a damper on my ability to test it.

query performance profiling

RRoger_Ramjet

Ssneakyimp

RRoger_Ramjet

Ssneakyimp

RRoger_Ramjet

Ssneakyimp

NogDog

Ssneakyimp

Ssneakyimp

RRoger_Ramjet

Ssneakyimp

Ssneakyimp

SSxooter

SSxooter

Ssneakyimp

SSxooter

Ssneakyimp

SSxooter

SSxooter

Ssneakyimp