I'm working on a query that uses four tables, comma-joined. Actually two queries, with only one simple difference. This one is fast (0.17 sec):

SELECT distinct l.listingId, l.title, l.slug, l.blurb, l.address1, l.address2, l.city, l.state, l.zip, l.recipientPhone, l.bbbLink, l.environixCertified, l.noEmail, l.serviceAreaCoords, l.dontDisplayAddress, l.latitude, l.longitude, SQRT(POWER(ABS((l.latitude * 69) - (47.611679 * 69)),2) + POWER(ABS((l.longitude * 60) - (-122.333261 * 60)),2)) AS distance 
FROM listings AS l, listingsServices AS ls, listingsCities AS lc, services AS s 
WHERE s.slug='mold-inspection' AND ls.serviceId=s.serviceId AND l.listingId=ls.listingId AND l.active > 0 AND ((lc.listingId=l.listingId AND lc.cityId=24001) OR (l.sourceCityId=24001)) AND l.recipientPhone IS NOT NULL 
ORDER BY rand();

and this one is slow - a lot slower (4.2 sec):

SELECT distinct l.listingId, l.title, l.slug, l.blurb, l.address1, l.address2, l.city, l.state, l.zip, l.recipientPhone, l.bbbLink, l.environixCertified, l.noEmail, l.serviceAreaCoords, l.dontDisplayAddress, l.latitude, l.longitude, SQRT(POWER(ABS((l.latitude * 69) - (47.611679 * 69)),2) + POWER(ABS((l.longitude * 60) - (-122.333261 * 60)),2)) AS distance 
FROM listings AS l, listingsServices AS ls, listingsCities AS lc, services AS s 
WHERE s.slug='mold-inspection' AND ls.serviceId=s.serviceId AND l.listingId=ls.listingId AND l.active > 0 AND ((lc.listingId=l.listingId AND lc.cityId=24001) OR (l.sourceCityId=24001)) AND l.recipientPhone IS NULL 
ORDER BY rand();

The only difference is the last parameter of the WHERE clause - we're looking for IS NULL in the slow one, and IS NOT NULL in the fast one.
All tables are indexed, at least as far as I can work it out. I ran EXPLAIN and get the same for both queries:

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	l 	ALL 	PRIMARY,id active recip city 	NULL 	NULL 	NULL 	1208 	Using where; Using temporary; Using filesort
1 	SIMPLE 	ls 	ref 	listing id service id 	listing id service id 	5 	environix_directory.l.listingId 	1 	Using where; Using index; Distinct
1 	SIMPLE 	lc 	index 	listing city 	listing city 	10 	NULL 	90 	Using where; Using index; Distinct
1 	SIMPLE 	s 	eq_ref 	PRIMARY,id slug 	PRIMARY 	4 	environix_directory.ls.serviceId 	1 	Using where; Distinct

As you can see the first "entry" is using temporary and filesort, but is also using where. Not sure how to get rid of the temporary/filesort, since I do have an index on the fields being queried.

Is there a clear reason why looking for an "is null" would slow things down so much? I know 4 seconds may not seem that slow, but I only have about 2% of my data in the table yet.

Thanks!

  • Bob

    I'm no DBA guru, but the first thing my eyes are drawn to is the "ORDER BY RAND()" (and the subsequent "Using filesort").

    Have you by chance read the stickied thread in this forum entitled Fast Random Results - "ORDER BY RAND()" is bad?

    EDIT: Quick experiment; is the second query still just as slow if you remove the "ORDER BY" clause altogether?

      How many rows are returned by the two queries? It might be that ORDER BY if the number of rows is a lot different

        bradgrafelman;10971450 wrote:

        I'm no DBA guru, but the first thing my eyes are drawn to is the "ORDER BY RAND()" (and the subsequent "Using filesort").

        Have you by chance read the stickied thread in this forum entitled Fast Random Results - "ORDER BY RAND()" is bad?

        EDIT: Quick experiment; is the second query still just as slow if you remove the "ORDER BY" clause altogether?

        Thanks Brad - I have now. I tried bypassing the ORDER BY clause and dropped my time on the second query from 9.something seconds (which it seems to have crept up to - hmm - not sure why) to 4.97 seconds. Which is still too much.

        So clearly I will have to work on my randomizing.

        One thing I did was replace the "is null" functionality by adding a field that can have a defined value - 1 or 0 - depending on whether it's a paid listing or not. So no more need to search on IS NULL or IS NOT NULL. Seems to save a little.

        Here's the current query:

        SELECT distinct l.listingId, l.title, l.slug, l.blurb, l.address1, l.address2, l.city, l.state, l.zip, l.recipientPhone, l.bbbLink, l.environixCertified, l.noEmail, l.serviceAreaCoords, l.dontDisplayAddress, l.latitude, l.longitude, SQRT(POWER(ABS((l.latitude * 69) - (47.611679 * 69)),2) + POWER(ABS((l.longitude * 60) - (-122.333261 * 60)),2)) AS distance 
        FROM listings AS l, listingsServices AS ls, listingsCities AS lc, services AS s 
        WHERE s.slug='mold-inspection' AND ls.serviceId=s.serviceId AND l.listingId=ls.listingId AND l.active > 0 AND ((lc.listingId=l.listingId AND lc.cityId=24001) OR (l.sourceCityId=24001)) AND l.paidListing=0

        And the EXPLAIN (sorry, don't know how to get the columns to line up):

        id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
        1 	SIMPLE 	lc 	index 	listing_city 	listing_city 	10 	NULL 	1180 	Using index; Using temporary
        1 	SIMPLE 	ls 	index 	listing service 	listing service 	10 	NULL 	1478 	Using index
        1 	SIMPLE 	l 	eq_ref 	PRIMARY,active recipPhone 	PRIMARY 	4 	greenbiz_environixDirectory.ls.listingId 	1 	Using where
        1 	SIMPLE 	s 	eq_ref 	PRIMARY,service id slug 	PRIMARY 	4 	greenbiz_environixDirectory.ls.serviceId 	1 	Using where; Distinct

        Seems like that "Using temporary" is the problem, right? But I have that table indexed and it also seems to be using the index. So...I don't understand.

        The faster query (the one where "paidListing=1" - used to be "recipientPhone IS NOT NULL") only has 51 rows in the first EXPLAIN line and 5 in the second, so perhaps that's the only difference. But in Brad's link to the sticky about rand(), the OP talked about millions of records - I have less than 10,000 so far.

        I also tried dropping the "DISTINCT" from the query and that shaved 2.5 seconds. But then I get duplicate results. Perhaps there's a similar issue with DISTINCT as there is with rand()?

        I wonder if I'm not better off just doing a bunch of separate queries, rather than trying to join all these tables together. I seem to recall some slowness in the past doing multi-table joins with DISTINCT.

        Also wonder if I should be doing left joins in some of these cases. I admit to not understanding join types very well.

        • Bob

          OKAY - I think I have found a pretty major bottleneck. It's the "OR" in the WHERE clause. Basically, I need to check whether a listing either belongs to the chosen city directly, or is associated with it in the listingsCities table. Trying to check both of those in a single query is killing the speed.

          So - how should I go about rewriting this? Should I do one of the queries (i.e. the associative listingsCities one) and then a second query of just listings where the listingId is not in the first result (i.e. " and l.listingId NOT IN(1, 2, 3, 4, 5...etc - could get long)"). On paper that sounds very brute force.

          Or perhaps I do two separate queries, walk thru one and create a PHP array indexed by listingId, then go thru the other and only add entries whose listingId isn't already indexed? Also kinda brute force but if that's what it takes...

          Any suggestions?

            Write a Reply...