query performance profiling

sneakyimp

I have a site that has a number of queries per page. I can tell that some of the queries are quite slow and I am going to push for a page redesign from my project mgr so some of the more onerous queries can be removed.

That said, I need to get an idea of which queries are causing the slowness.

Fortunately, I have all encapsulated my db query function in a class method which is called like this:

$result = $db->query('SELECT * FROM table_foo');

I was hoping that someone who has done this could give me any pointers. I just realized I will probably have to go back to each instance where this query function gets called and add a user-friendly 'name' for the query so I can tell them apart. and/or add them to another dbase table. Is there other information that would be helpful? like which page they are being called from?

Any tips would be greatly appreciated.

Roger_Ramjet

You need to familiarise yourself with Optimization , particularly with using Explain. The best way to do what you want is to run each query through eg phpmyadmin or whatever you have in place, that way you can see what each query is doing.

sneakyimp

You are definitely right about that, Roger. I have been dabbling with the EXPLAIN statement. My site has a lot of queries so I was hoping to get an idea of what the slowest ones are so I can work on the worst offenders first.

I think I have that sorted. I added a 'name' argument to my query method so I can give each query I execute a friendly name. I then time each query and store the sql, the name, and the execution time in a database table so I can quickly analyze which queries are the worst offenders.

I am now working on offender number one. the EXPLAIN results are greek to me. I don't even know where to begin.

Here's the query:

EXPLAIN  SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e, demo_event_time_assoc eta, demo_zip_codes z
WHERE eta.event_id = e.id
AND z.zip = e.zip
AND e.active =1
AND (
(
(
eta.start_timestamp >=1162506800
)
AND (
eta.start_timestamp <=1163111600
)
)
OR (
(
eta.start_timestamp <1162506800
)
AND (
eta.end_timestamp >=1162510400
)
)
)
GROUP BY e.id
ORDER BY RAND( )
LIMIT 0 , 10

and here's what came out:

table  type    possible_keys                             key               key_len   ref            rows    Extra
eta    range   event_id,start_timestamp,end_timestamp    start_timestamp   4         NULL           28727   Using where; Using temporary; Using filesort
e      eq_ref  PRIMARY,active                            PRIMARY           4         eta.event_id   1       Using where
z      eq_ref  PRIMARY                                   PRIMARY           5         e.zip          1

MarkR

It's possible to have MySQL log all queries and how long they take (beware: this will slow it down and fill your disc up, don't leave it enabled by accident).

It's also possible to have MySQL only log queries which take more than a certain time. See the MySQL docs for more info.

Mark

sneakyimp

i've seen the entry in the mysql cnf file for logging slow queries. I uncommented that, saved the file, and rebooted my server. How do you just restart mysql? Between the slow query log and my changes to the query method of my database class, I think I've got the logging aspect of performance profiling handled pretty well.

I'm now focused on the slow queries and am having some difficulty understanding the results of EXPLAIN as they relate to my slow queries. Having read the mysql docs for EXPLAIN, it seems that I should be concerned about the 'Using filesort' in the last column. I believe this is due to my use of 'ORDER BY RAND()'. I've done some reading about that as well and it appears to have serious performance issues.

Roger_Ramjet

A very poorly constructed query is what you have there. Read up on JOINS as that is the first change you need to make.

Now the Explain tells you that you are using a Temporary table (Using temporary😉 with this query, so that is the first thing that is slowing you down. It is also doing a table scan (Using filesort) and that is very inefficient.

Basically what is wrong here is that you have not done any joins so you are not getting the benefit of indexes, infact, once it has created the temp table with all the rows joined to all the rows in all the tables, there is no index that can be used and so it has to just read the raw data and sort it using fiurther temp files.

Should write it like this

"SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1 AND
((eta.start_timestamp >=1162506800 AND eta.start_timestamp >=1163111600)
OR
(eta.start_timestamp <1162506800 AND eta.end_timestamp >=1162510400))
ORDER BY RAND() LIMIT 10";

I don't see any need for the group by clause, but then I don't know your data.

sneakyimp

The group by clause was in there to sort of cheat...An event might be associated with several different records in the eta table and I only want to show each event once, regardless of how many occurrences it might have.

I really appreciate your help, but I tried your query and it typically takes about 3 times as long as my original query. 😕 When I tried EXPLAIN on your query:

EXPLAIN SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id = eta.event_id
INNER JOIN demo_zip_codes z ON e.zip = z.zip
WHERE e.active =1
AND (
(
eta.start_timestamp >=1162506800
AND eta.start_timestamp >=1163111600
)
OR (
eta.start_timestamp <1162506800
AND eta.end_timestamp >=1162510400
)
)
ORDER BY RAND( )
LIMIT 10

I get essentially the same results as with my query, only with more rows under consideration:

table   type    possible_keys                            key               key_len   ref            rows    Extra
eta     range   event_id,start_timestamp,end_timestamp   start_timestamp   4         NULL           54522   Using where; Using temporary; Using filesort
e       eq_ref  PRIMARY,active                           PRIMARY           4         eta.event_id   1       Using where
z       eq_ref  PRIMARY                                  PRIMARY           5         e.zip          1

As for the data, here is some info on the tables being queried:
demo_events: 32,354 records; indexed on id and zip
demo_event_time_assoc: 75,325 records. indexes on id, event_id, start_timestamp, end_timestamp
demo_zip_codes: 43,104 records, indexed on zip.

Interestingly, if I remove the 'order by rand()' bit, your query runs in 0.0014 seconds. If I remove the ORDER BY and GROUP BY bits of my query, it runs in 0.0071 seconds.

I tried an alternate method of choosing records at random from this set which entailed:
1) run first query to fetch all event ids that match the time critieria
2) take the $count of records returned and choose ten random numbers between 1 and $count
3) loop thru all the ids returned, incrementing $i. if $i is one of my random numbers, keep that record. if we have 10 results, exit the loop.
4) run another query to fetch the details for the 10 randomly found ids.

here's the code which still needs tweaking for situations where there are less than 10 eligible records:

$rows_to_fetch = 10;

// these variables are meaningful within the real page context but here
// we just initialize them to the current server time
$target_zip_timestamp = strtotime('-2 days', time());
$target_zip_timestamp_plus_week = strtotime('+1 week', $target_zip_timestamp);

$start = get_microtime();
// NEW APPROACH
// grab all the ids first
$sql = "SELECT DISTINCT e.id FROM
       " . TABLE_EVENTS . " e,
       " . TABLE_EVENT_TIME_ASSOC . " eta,
       " . TABLE_ZIPS . " z
       WHERE eta.event_id=e.id
       AND z.zip=e.zip
       AND e.active=1
       AND (
            ((eta.start_timestamp >= " . $target_zip_timestamp . ") AND (eta.start_timestamp <= " . $target_zip_timestamp_plus_week . "))
            OR
					((eta.start_timestamp < " . $target_zip_timestamp . ") AND (eta.end_timestamp >= " . ($target_zip_timestamp + REQUIRED_REMAINING_EVENT_TIME) . "))
           )
       ";

$query_start = get_microtime();
$result = $db->query($sql, 'testing_query_2')
  or die('second approach query 1 failed');
echo 'New method Query 1 time:' . (get_microtime() - $query_start) . ' seconds<br>';
$total_ids = $db->numrows($result);
echo $total_ids . ' distinct event ids found.<br>';

// at this point we should know how many rows there are
// let's pick 10 numbers at random between 1 and the total
$r_start = get_microtime();
$rand_nums = array();
do {
  $r = rand(1, $total_ids);
  if (!in_array($r, $rand_nums)) {
    $rand_nums[] = $r;
  }
} while (sizeof($rand_nums) < $rows_to_fetch);
echo 'Random number selection:' . (get_microtime() - $r_start) . ' seconds<br>';
print_r($rand_nums);
echo '<br>';

// loop through all the found ids, keeping only ones that match our random numbers
$loop_start = get_microtime();
$i=0;
$random_ids = array();
while($row = $db->fetchrow($result)) {
  $i++;
  if (in_array($i, $rand_nums)) {
    $random_ids[] = $row['id'];
  }

  // if we found enough, we can break the loop
  if (sizeof($random_ids) >= $rows_to_fetch) {
    break;
  }
}
echo 'Fetch loop time:' . (get_microtime() - $loop_start) . ' seconds<br>';
print_r($random_ids);
echo '<br>';
$db->freeresult($result);

// now fetch the important details that correspond to our random ids
$sql = "SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
  FROM demo_events e, demo_event_time_assoc eta, demo_zip_codes z
  WHERE eta.event_id=e.id
  AND z.zip = e.zip
  AND e.id IN (" . implode(',', $random_ids) . ")
  AND (
       ((eta.start_timestamp >= " . $target_zip_timestamp . ") AND (eta.start_timestamp <= " . $target_zip_timestamp_plus_week . "))
        OR
       ((eta.start_timestamp < " . $target_zip_timestamp . ") AND (eta.end_timestamp >= " . ($target_zip_timestamp + REQUIRED_REMAINING_EVENT_TIME) . "))
      )
  ORDER BY RAND()";

$query_start = get_microtime();
$result = $db->query($sql, 'testing_query_2b')
  or die('final fetch failed');
echo 'New method Query 2 time:' . (get_microtime() - $query_start) . ' seconds<br>';
echo $db->numrows($result) . ' records found.<br>';

echo '<table width="100%">';
$prev_id = '';
while ($row = $db->fetchrow($result)) {
  if ($row['id'] != $prev_id) {
    echo '<tr>';
    echo '<td>' . $row['id'] . '</td>';
    echo '<td>' . $row['title'] . '</td>';
    echo '<td>' . $row['subheading'] . '</td>';
    echo '<td>' . $row['zip'] . '</td>';
    echo '<td>' . date('m-d-Y, H:i:s', $row['start_timestamp']) . '</td>';
    echo '<td>' . date('m-d-Y, H:i:s', $row['end_timestamp']) . '</td>';
    echo '<td>' . $row['city'] . '</td>';
    echo '<td>' . $row['st'] . '</td>';
    echo '</tr>';
  } // if id is new
  $prev_id = $row['id'];
}
echo '</table>';
$db->freeresult($result);
echo 'TIME ELAPSED NEW METHOD:' . (get_microtime() - $start) . ' seconds<br>';

this approach was faster if the results of step 1 were already cached, but was actually slower than my original approach if the query had new timestamp values - which it ALWAYS will because the timestamp values are based on the current time which is constantly changing.

I'm really worried there isn't any way to make this query significantly faster. I was hoping to have this work in under a second when dealing with millions of event records. sigh. Is this hopeless?

Roger_Ramjet

OK so you get more records to consider because the group by elliminates some candidates in your query that are included in mine.

The overall improvement in my form of the query is proven by your figures .0014 vs .0071 - and that is all down to the correct use of JOINS. So learning to code an efficient base query is really worth the effort.

The core of the problem is the ORDER BY RAND(). Only to be expected really. The query first has to find all of the candidates and then order them randomly. That is probably the cause of the use of a temp table: (simple to test, take it out and see what explain returns).

So now you need to think outside the box.

First, how important is the random order: does it really matter that much?

If it does then you can do it in php very easily. Just return all of the results in a query without an ORDER BY or LIMIT, then use the count to set an upper limit for the use of the php RAND function along with mysql_result to fetch 10 random rows from the full result set. I'll bet that the query will be so much faster while the extra php overhead is minimal.

Reading your last code I see that you had got some way towards this, so try my way and see how it works.

sneakyimp

progress!

Apologies, Roger. I realized that in implementing your approach I had a typo...there was a >= that should have instead been a <=

That said, I am beginning to realize much faster results--especially as the total number of events increases. The biggest improvement comes from your tip on using mysql_result. which allows me to just jump directly to a particular row in the query results rather than having to loop through all the intermediate rows returned.

I have attached a graph that shows performance for these 4 different approaches:
1) my ORIGINAL APPROACH:

$sql = "SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st FROM
       demo_events e,
       demo_event_time_assoc eta,
       demo_zip codes z
       WHERE eta.event_id=e.id
       AND z.zip=e.zip
       AND e.active=1
       AND (
            ((eta.start_timestamp >= " . $start_timestamp . ") AND (eta.start_timestamp <= " . $end_timestamp . "))
            OR
					((eta.start_timestamp < " . $start_timestamp . ") AND (eta.end_timestamp >= " . ($start_timestamp + ONE_HOUR) . "))
           )
       GROUP BY e.id ORDER BY RAND() LIMIT 0, 10";

2) my NEW APPROACH which uses the query above as a basis but rather than using ORDER BY RAND() it grabs all the ids and uses mysql_result() to directly select 10 items at random. The code is a bit involved but I'll post it if anybody wants it.

3) ROGER'S ORIGINAL approach using his INNER JOIN version of my query and ORDER BY RAND()

4) ROGER'S LATEST suggestion in which i use a simpler version of his INNER JOIN query that doesn't use ORDER BY RAND and only fetches distinct event ids and pick results at random from the total returned and index directly to them using mysql_result():

$sql = "SELECT DISTINCT e.id
  FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
  WHERE e.active =1 AND
  ((eta.start_timestamp >=" . $start_timesamp . " AND eta.start_timestamp <=" . $end_timestamp . ")
  OR
  (eta.start_timestamp <" . $start_timestamp . " AND eta.end_timestamp >=" . ($start_timestamp + ONE_HOUR) . "))";

Again, the random choosing of event ids from the result set using mysql_result() is a bit involved but I can share with anyone who wants it.

If you can see the graph, you can see that Roger's approach wins hand down across the board. It starts to get a bit too slow once we get past 40,000 eligible records - by 'eligible' I mean not the total number of events but events that satisfy the timestamp and active constraints within the query.

Even though the machine I'm running this on is only a Pentium 3 - 800 Mhz server with 512 MB of RAM I still think I might have to talk the project manager into removing this query. If the query time increases linearly, I might reasonably expect a query with 400k eligible events to take 20 seconds - way too long!

Unless there is an even faster way.....

Roger_Ramjet

Well, not sure why a select without ordering should take so long - except for this timestamp stuff. I've not paid much attention to that part yet. What are you doing there exactly? Selecting events within a date range I expect. Is the timestamp column indexed?

I see from your new query that you are just selecting the e.id : I hope you are not then running further queries to get the event data. Not neccessary, just get all required columns in that 1 query and then you have the data when you do the rand/mysql_result bit - jump to a random row and display it.

Basically, I hate this random listing stuff. Always a big overhead for not a lot of return - just some designer's fad. If you could just limit 10 on an indexed field then it would fly.

You could try creating a random field that you index - that would fake the randomisation for you and the limit would apply so your query would fly no matter how many candidates there are, since the ordering on the 'random' column elliminates them at the outset.

sneakyimp

Thanks a lot for your guidance Roger.

Roger Ramjet wrote:
Well, not sure why a select without ordering should take so long - except for this timestamp stuff. I've not paid much attention to that part yet. What are you doing there exactly? Selecting events within a date range I expect. Is the timestamp column indexed?

I'm obviously no expert, but I think the extra time might be due to the DISTINCT part. As the dbase engine builds the result, it has to check each new id against the existing set. OH - I also forgot...the server is an 800 MHZ p3 with some kind of oldish hard drive.

And yes, I am selecting events within a date range. The timestamp columns contain unix timestamps and are indexed. I just realized, though, that I had restored my dbase to an older version in which they were not indexed. This improves the speed when many event records are considered, but makes things slower when there are very few eligible records. Basically, it looks like 1.6 seconds and regardless of the number of eligible records.

As for having a separate query to fetch the details, I can't really think of any other way to do it. We want each event to appear only once if possible and since there might be several records in the eta table for each event, these queries would tend to have multiple records for each event. I was under the impression that two eta records with different timestamps would in fact be DISTINCT from each other and could result in an event appearing more than once.

Roger Ramjet wrote:
Basically, I hate this random listing stuff. Always a big overhead for not a lot of return - just some designer's fad. If you could just limit 10 on an indexed field then it would fly.

No doubt! But then you'd sort of be getting the same old stuff over and over again right? I can kind of appreciate the idea behind it but it is a total shame how much processing power it takes. It had occurred to me to try and fine some other pseudorandom way to always be showing new data. Like something based on the least significant digits of microtime() or something.

sneakyimp

so i'm still tweaking this query (and tweakin' out too i might add) and i decided to run another explain on the query I modified from your inner join suggestion:

EXPLAIN
SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1154250470 AND eta.start_timestamp <=1309770470)
   OR (eta.start_timestamp <1154250470 AND eta.end_timestamp >=1154254070))
ORDER BY NULL

And I find the type=ALL for the event table?? Does that mean a table scan??

table   type   possible_keys                            key        key_len   ref    rows    Extra
e       ALL    PRIMARY                                  NULL       NULL      NULL   50186   Using where; Using temporary
eta     ref    event_id,start_timestamp,end_timestamp   event_id   4         e.id   2       Using where; Distinct

the query will return 48k records in about 1.5 seconds.

I'm wondering here if I might force the query optimizer to use the start_timestamp index on the eta table maybe? some info on the tables involved:

demo_events:  50,186 records.  NOTE:  table is dynamic format.  might this have anything to do wit it?
demo_event_time_assoc:  115,944 records.
  indexes:
    event_id, cardinality of 57972
    start_timestamp, cardinality of 3410
    end_timestamp, cardinality of 3410

Roger_Ramjet

Sorry I've not replied sooner - I've spent the last few days finding out how much I've forgotten in the year since I did any php/html/css. Worst part is I've lost all my base-code so I'm starting from scratch.

Anyway, I'm really not up on EXPLAIN output since I never needed to use it myself: I've been doing relational databases since the '80s - which is why I mostly answer posts in the db forum.

My solution to your problem goes like this:

Run a query that selects ALL the data in one go, y that I mean all the candidates AND all the columns you want as well. Do not impose any grouping, distinct, ordering, randomisation, or anything else that will slow the query down. Keep your WHERE conditions to those that are required.

Do not worry about duplication of events since we are going to deal with that in code.

This should return a large resultset in only fractions of a second.

2, Now process the results in code to select and display 10 random events with no duplicates. Now there is code for a function that does this in the RAND() manual, but it is based on fetching all the rows first, which is a lot of unnecessary processing. I'm sure that my solution could be converted into a function that you just pass the resultset reference to.

$sql = "SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1 AND
((eta.start_timestamp >=1162506800 AND eta.start_timestamp >=1163111600)
OR
(eta.start_timestamp <1162506800 AND eta.end_timestamp >=1162510400))";

$result=mysql_query($sql) or die();

// get limit for rand()
$up = mysql_num_rows($result);
// set up count
$c = 10;
// set up array to hold event ids and enable ellimination of duplicates
$ary = array();

while (count($ary ) < $c) {
   // get random number for index - remember result index is zero-based so we minus 1
   $i = rand(1, $up) - 1;
   // move pointer to that row number and fetch the data
   mysql_data_seek($result, $i);
   $row = mysql_fetch_array($result);
  //  check if the event id has already been displayed
  if (!in_array($row['eventid'], $ary)) {
     // display the event data
     echo '<tr><td>' .  $row['eventid'] . 'etc, etc';
    // store the id to prevent duplication
    array_push($ary, $row['eventid']);
   }
}

Now, you may need to change the query if it is not getting the right fields - but don't be tempted to add any conditions. The selection of random events takes care of any events with duplicate listings because it has to take care of the possibility that rand() will return the same number - which everyone forgets that it can.

This approach should be fastest no matter how many records you are dealing with.

Hope this helps.

sneakyimp

let me just say again that I really appreciate this, RR!

I have basically done what you described in your last post. I have my initial query which fetches all the records and then I have my own routine to choose the results randomly. i'm using mysql_result() instead of mysql_data_seek(). I'm not really sure what the difference is. it seems to be working.

I must confess, I am running two queries and my first one uses 'SELECT DISTINCT e.id'. I have checked and the 2nd query to fetch the other info typically takes a few milliseconds or less. These don't seem to be the problem though as the first query is what's taking all the time.

I popuplated my tables with some random sample data:
events - 50,000 records
event_time_assoc - 115,607 records
zip_codes - 43,000 records.

I've set up a page that keeps updating the target timestamps with the current time so I can be sure that I'm not getting cached results.

My first query typically runs in 1.5 seconds:

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163590783
        AND eta.start_timestamp <=1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383)) ORDER BY NULL

Your recommendation runs in about 2 seconds each time:

SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >=1163591310
          AND eta.start_timestamp >=1164196110)
   OR (eta.start_timestamp <1163591310
          AND eta.end_timestamp >=1163594910))

Seems to me it's these initial queries that are the problem because they seem to be scanning tables rather than using indexes.

this EXPLAIN statement says as much basically:

EXPLAIN
SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
INNER JOIN demo_zip_codes z ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >=1163591310
          AND eta.start_timestamp >=1164196110)
   OR (eta.start_timestamp <1163591310
          AND eta.end_timestamp >=1163594910))

the results:

table   type     possible_keys                            key       key_len   ref           rows     Extra
eta     ALL      event_id,start_timestamp,end_timestamp   NULL      NULL     NULL           115607   Using where
e       eq_ref   PRIMARY                                  PRIMARY   4        eta.event_id   1        Using where
z       eq_ref   PRIMARY                                  PRIMARY   5        e.zip          1

The table has indexes which appear to be getting ignored for some reason. too many indexes maybe? I was originally using unix-style timestamps and couldn't decide between those and mysql datetime format. You can recreate the eta table with this:

CREATE TABLE `demo_event_time_assoc` (
  `id` int(15) unsigned NOT NULL auto_increment,
  `event_id` int(11) unsigned NOT NULL default '0',
  `start_datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `start_timestamp` int(12) NOT NULL default '0',
  `end_datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `end_timestamp` int(12) NOT NULL default '0',
  `start_hour` tinyint(3) unsigned NOT NULL default '0',
  `end_hour` tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `event_id` (`event_id`),
  KEY `start_timestamp` (`start_timestamp`),
  KEY `end_timestamp` (`end_timestamp`),
  KEY `start_datetime` (`start_datetime`),
  KEY `end_datetime` (`end_datetime`)
) TYPE=MyISAM AUTO_INCREMENT=115608 ;

Roger_Ramjet

OK, so it is still down to optimising that first query - and don't thank me too much cos I'm learning as much as you are here.

Now let me see if I have got this right:
1) you have a table of events, some of which occur more than once: so you have an events-dates table to cope with that.
2) you are storing unix time stamps as integers, which you index: since integer indexes are fastest then this should help performance

Now the reason my query is taking longer will be because it is fetching the data and so has to actually read the data in the joined tables, not just parse the indexes.

Now 2 things may help here with your query:

"SQL_BIG_RESULT can be used with GROUP BY or DISTINCT to tell the optimizer that the result set has many rows. In this case, MySQL directly uses disk-based temporary tables if needed, and prefers sorting to using a temporary table with a key on the GROUP BY elements."

Depending on your mysql version you can use USE INDEX (pre MySQL 4.0.9) or FORCE INDEX (MySQL 4.0.9 +)

SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
FORCE INDEX (eta.event_id, eta.start_timestamp, eta.end_timestamp)
WHERE e.active =1
AND ((eta.start_timestamp BETWEEN 1163590783 AND 1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383)) ORDER BY NULL

Be interesting to see how that one goes now.

Now I found this in the notes in the manual (I've modified it for this case) and it will be interesting to compare speeds

SELECT DISTINCT e.id, e.id*0+RAND() as rnd_id 
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163590783
        AND eta.start_timestamp <=1164195583)
   OR (eta.start_timestamp <1163590783
        AND eta.end_timestamp >=1163594383))
ORDER BY rnd_id LIMIT 10

Give it a go and see how it performs.

sneakyimp

Thanks R.

I got really excited at first after implementing the SQL_BIG_RESULT/FORCE INDEX part but then realized that the speed gains were due to the absence of the DISTINCT bit. When I added that back, the force index query is actually about 18% slower with my current data. 😕 This is kind of mind-boggling.

When i tried EXPLAINing it, it appears to be using the index and is still just plain slower:

EXPLAIN SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
FORCE INDEX ( event_id, start_timestamp, end_timestamp ) ON e.id = eta.event_id
WHERE e.active =1
AND (
(
eta.start_timestamp
BETWEEN 1163785292
AND 1164390092
)
OR (
eta.start_timestamp <1163785292
AND eta.end_timestamp >=1163788892
)
)
ORDER BY NULL

result:

table   type     possible_keys                            key               key_len   ref            rows    Extra
eta     range    event_id,start_timestamp,end_timestamp   start_timestamp   4         NULL           52501   Using where; Using temporary
e       eq_ref   PRIMARY                                  PRIMARY           4         eta.event_id   1       Using where

I tried with and without SQL_BIG_RESULT and it doesn't seem to make any appreciable difference for the current data set of 50,000 records.

That 2nd query you have is pretty cool. It's much faster than those early ORDER BY RAND() attempts for some reason but is sufficiently slower than my last SELECT DISTINCT e.id query that my PHP code to do the random selecting would be faster.

Here's some sample output from my test page:

Query 1

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1163785286
      AND eta.start_timestamp <=1164390086)
  OR
    (eta.start_timestamp <1163785286
     AND eta.end_timestamp >=1163788886))
     ORDER BY NULL

TIME EXPIRED:1.45012593269 seconds
41342 records found.



Query 2

SELECT e.id, e.title, e.subheading, e.zip, eta.start_timestamp, eta.end_timestamp, z.city, z.st
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
INNER JOIN demo_zip_codes z
ON e.zip=z.zip
WHERE e.active =1
AND ((eta.start_timestamp >= 1163785289
      AND eta.start_timestamp >=1164390089)
  OR
     (eta.start_timestamp < 1163785289
      AND eta.end_timestamp >=1163788889))

TIME EXPIRED:2.23707103729 seconds
71582 records found.



Query 3
sql:
SELECT SQL_BIG_RESULT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
FORCE INDEX (event_id, start_timestamp, end_timestamp)
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp BETWEEN 1163785292 AND 1164390092)
  OR (eta.start_timestamp < 1163785292
      AND eta.end_timestamp >=1163788892))
ORDER BY NULL

TIME EXPIRED:1.7027220726 seconds
41342 records found.



Query 4

SELECT DISTINCT e.id, e.id*0+RAND() as rnd_id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >= 1163786237
     AND eta.start_timestamp <= 1164391037)
  OR (eta.start_timestamp < 1163786237
      AND eta.end_timestamp >= 1163789837))
ORDER BY rnd_id LIMIT 10

TIME EXPIRED:1.75384306908 seconds
10 records found.

Roger_Ramjet

Hm, not very good at all. Would be ok for a search, but not for random page displays especially with any amount of traffic.

Now, one thing that struck me is the e.active =1.
Is this column indexed, and is it just a yes/no option?
My guess is that this was what was forcing the table-scan/filesort - would have to really if it were not indexed. Since this is the coarsest where condition the query optimiser should apply it last but there is no guarantee about that so move it to the end and see if it has any effect.

I'd like to get down to what is actually taking all the time. Break the query down into parts and see what they perform like. (Really should have done that from the outset) So try a query with the individual timestamp conditions and see how that performs, then combine them and lastly add the e.active=1. Then add things like distinct and order by rand(). Get some benchmarks to clarify where the time is going.

If your version supports sub-queries (4.1 or above) then it may pay to use one here. eg. subquery with just the time stamp conditions and then the e.active = 1 in an outer query - all depends on the benchmarking results.

sneakyimp

ok...something to keep in mind is that this machine is anything but fast. however, i've tried running it on a fast, modern server running a shared hosting environment and it's faster, but still slow.

the machine:

800 MHZ p3
504 MB memory (i think some used for video)

i ran a script to generate more data. the database currently has 250,000 records, about 90-95% of them have e.active=1.

This is my current champion query:

ORIGINAL QUERY

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164082656 AND eta.start_timestamp <=1164687456) OR (eta.start_timestamp <1164082656 AND eta.end_timestamp >=1164086256))
ORDER BY NULL

TIME EXPIRED:8.34640598297 seconds
67104 records found.

I stripped down to the basic query. Is it surprising this takes almost 6 seconds? Joining a table with 250k records to one with 500k sounds like maybe it would be slow even WITH indexes.

QUERY 1

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id

TIME EXPIRED:5.67353010178 seconds
578265 records found.

Interestingly , a DISTINCT query is slightly faster:

QUERY 2

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id

TIME EXPIRED:3.95994400978 seconds
250000 records found.

Adding e.active=1 appears to slow things down pretty bad. e.active is either 0 or 1. I dont' see much difference when I index or not. As I undertand it, an index doesn't help much unless it splits your database into a pretty large number of pieces. Splitting it in half probably doesn't help much at all.

QUERY 3

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1

TIME EXPIRED:8.91257214546 seconds
549954 records found.

Removing the DISTINCT and e.active parts from my current champion query appear to gain us maybe 2.5 seconds relative to the original query. whoopee. i would still need to check any returned results to make sure they are in fact active.

QUERY 4

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE ((eta.start_timestamp >=1164081742 AND eta.start_timestamp <=1164686542) OR (eta.start_timestamp <1164081742 AND eta.end_timestamp >=1164085342))

TIME EXPIRED:5.8220641613 seconds
72068 records found.

Removing the time limitation of my original query gains me perhaps one second. This is also not helpful:

QUERY 5

SELECT DISTINCT e.id FROM demo_events e INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id WHERE e.active=1

TIME EXPIRED:7.45040893555 seconds
237623 records found.

If I just remove the DISTINCT part from my original query it actually takes longer:

QUERY 6

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active=1
AND ((eta.start_timestamp >=1164082661 AND eta.start_timestamp <=1164687461) OR (eta.start_timestamp <1164082661 AND eta.end_timestamp >=1164086261))

TIME EXPIRED:8.63395309448 seconds
68486 records found.

Lastly, I tried using a group by without any time clause and it was slower than anything

QUERY 7

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164083534 AND eta.start_timestamp <=1164688334) OR (eta.start_timestamp <1164083534 AND eta.end_timestamp >=1164087134)) GROUP BY e.id

TIME EXPIRED:9.08280897141 seconds
67104 records found.

The prospects of improving this query any more are looking pretty bleak to me. I can remove the e.actve=1 but then I have a pretty gnarly programming task of finding random records. simply checking any random result and checking e.active is simple enough, but there could be situations where we could have real performance problems because we keep pick records at random and we keep getting the 'wrong' records.

Any thoughts?

NogDog

If it's not important to have the query result ordered by e.id, then you could put an ORDER BY NULL at the end of the query to suppress the sorting that a GROUP BY does by default, then see if that gains you anything.

sneakyimp

I tried that. Things were generally slower or there was no change 🙁

QUERY 2

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
ORDER BY NULL

TIME EXPIRED:4.20983719826 seconds
250000 records found.



QUERY 5

SELECT DISTINCT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active=1
ORDER BY NULL

TIME EXPIRED:7.71770000458 seconds
237623 records found.



QUERY 7

SELECT e.id
FROM demo_events e
INNER JOIN demo_event_time_assoc eta
ON e.id=eta.event_id
WHERE e.active =1
AND ((eta.start_timestamp >=1164085602 AND eta.start_timestamp <=1164690402)
OR (eta.start_timestamp <1164085602 AND eta.end_timestamp >=1164089202))
GROUP BY e.id
ORDER BY NULL

TIME EXPIRED:8.98559093475 seconds
66535 records found.