query performance profiling

sneakyimp · 2006-11-02T00:43:37+00:00

I have a site that has a number of queries per page. I can tell that some of the queries are quite slow and I am going to push for a page redesign from my p...

query performance profiling

sneakyimp

Sxooter wrote:
I'm guessing that you might get better performance with a star schema

http://en.wikipedia.org/wiki/Star_schema

I read that entry. I'm not sure I understand what it means unless
a) we are reducing the size of tables by breaking them into smaller and smaller bits.
b) that somehow different tables get stored on different machines to bring more processing power to bear on the situation.

I suspect point A is what the intent is. Sadly, a JOIN between two large tables (e and eta in my examples) is one of our biggest performance killers.

Sxooter, If you could spoon feed me by paraphrasing that article I'd sure appreciate it.

Roger, I hear what you are saying. But the big JOINs are troublesome. I calculate proximity to a zip code at the moment with an elaborate query that performs a great circle distance calculation from the latitude and longitude of a particular zip code. I'll probably reduce this to a much rougher but faster simple comparison of long and lat. Additionally, rather than trying to JOIN my event table to the zip table, I will simply de-normalize my DB yet again by storing the long and lat in the main event table.

Roger_Ramjet

sneakyimp wrote:
I calculate proximity to a zip code at the moment with an elaborate query that performs a great circle distance calculation from the latitude and longitude of a particular zip code.

When and where are you doing that? Is that data stored in a table or are you doing it on the fly?

Sxooter

From what you've posted recently, I'm not so sure a star schema would help. The primary reason to go with one is that you can easily view various dimensions of your data by including various tables for each one. For what you're doing, I'm not sure it would help.

OK, which is the MOST selective of your criteria, generally? Is it zip code, category, etc???

We need to focus on that first. If it's zip code, then we need an efficient and fast way to find all the entries in a particular zip code.

To do that, you can use some kind of simple lat/long based bounding box to do an initial search. I.e. every zip gets a rough max_lat,min_lat,max_long,min_long and you see if the event's lat/long get it inside a box or not with a very simple AND/OR in a where clause. (or use a db which can store geometric data types and use the overlap operator... ahem... 🙂 )

If it's something else, let me know. Basically, whichever thing will reduce your number of options the mostest the fastest, we want to optimize the firstest...

Sxooter

I suspect point A is what the intent is. Sadly, a JOIN between two large tables (e and eta in my examples) is one of our biggest performance killers.

This is rather unqualified statement. If you've got a selective enough where clause, then it will likely be fast if it's indexed properly. It's when it's unconstrained that it's always gonna be slow.

select * from table1 join table2 on table1.id=table2.t1id

is not the same as

select * from table1 join table2 on table1.id=table2.t1id where table1.id between 10 and 12

Roger_Ramjet

I expect that zip code will be the most granular selction condition and should be the first used. Given that zip code lat and long are static there is no need to calculate anything on the fly. Simply pre-calculate and store the zip codes within particular bands of distance for each specific zip code eg. those within 5, 10, 15 miles etc up to a cut-off distance of say 50 miles. You may need to vary this by region as I am sure that they vary from place to place in individual size eg city and country codes.

Personally I would have them all stored in multiple tables according to the band, so you simply look up a zip code and get all the codes within eg 10 miles. (I'm in the UK where we are used to short distances, in the US I'm sure you expect to travel further so adjust to suit). Very simple table join will reduce your search space immensely and you can then apply the more complex conditions.

If you insist on a calulated condition then a quick and dirty approxmimation is to convert lat and long to seconds of arc and you can then do basic integer maths with pythagoras's theorem. Again you store the zip code lat and long as seconds of arc.
Given that zips are not points in space but have area then using this approximation is good enough untill you get up near the arctic circle.

sneakyimp

so i did some research and have verified that even my super slow dev server can get really fast results pick records at random from a table with contiguous keys (no gaps). I put 100 million records into this table:

CREATE TABLE `test_simple_assoc` (
  `id` int(12) unsigned NOT NULL auto_increment,
  `f_id` int(12) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`)
) TYPE=MyISAM AUTO_INCREMENT=1 ;

Even with 10⁸ records, I could do the following operations in an average of 0.111 seconds:
1) query the table to find the total number of records, N
2) generate 5 random numbers between 1 and N
3) query the database to retrieve the records with ids corresponding to those 5 numbers.

This hints that Sxooter is totally right about maybe using distilled/condensed tables for my random selection stuff. I could totally live with that kind of performance. The data curve i plotted suggests that even with a billion records on my SLOW dev server, we could pick 5 at random in about .15 seconds. Execution time seems to vary linearly with Log N (see graph).

The trick then is to make sure I can populate that table reliably.

My next effort will be to check if I can incorporate a zip/proximity search into this random selection. I like Roger's suggestion of using banded, pre-calculated tables. Roger, what might you suggest for the structure of this table?

Sxooter

I'd look at using the "banded" table by having a foursome of ints in it that have max_x, min_x, max_y, and min_y and all four indexed (min_x,max_x) and (min_y, max_y) or possibly all four at once, but that would make for a really big index.

If you were using PostgreSQL I'd say to create functional indexes on the main table instead of making a banding table, cause you could use max(x) and min(x) in a functional index there, and I'd create four separate indexes cause it's got bit map indexing. Not sure what's most efficient for MySQL though.

Roger_Ramjet

Well mysql has its Spatial Extentions which are a non-standard and partial implementation of spatial geometry. In particular it has the functions for minimal bounding rectangles that work on minx maxx miny maxy.

Trouble with this approach is that you are talking about a rectangle and I'm sure that most zip codes are not in any way rectangular areas. However, it does give a rough method for finding places that are within a certain distance of each other.

Personally, as I said, I would just use trigonometry on zip lat/long to find out which zips were within 5, 10, 15, etc.

proximity table
zip_id zip_id distance
1 - 2 - 5
1 - 4 - 5
1 - 7 - 5
1 - 8 - 10
1 - 12 - 15
1 - 32 - 15
1 - 45 - 20

etc

Just query for zips were distance <= 10 and you have your list (2,4,7,8). Takes more processing to set up and requires additional table but should be a lot faster at run-time. Of course it is only suitable for local searches whereas the geospatial approach is far more generalised and will be far more efficient over wider areas.

sneakyimp

OK....two approaches...

Sxooter wrote:
I'd look at using the "banded" table by having a foursome of ints in it that have max_x, min_x, max_y, and min_y and all four indexed (min_x,max_x) and (min_y, max_y) or possibly all four at once, but that would make for a really big index.

If you were using PostgreSQL I'd say to create functional indexes on the main table instead of making a banding table, cause you could use max(x) and min(x) in a functional index there, and I'd create four separate indexes cause it's got bit map indexing. Not sure what's most efficient for MySQL though.

I think i understand. We give each zip code one record for each band which describes the limits--i.e, 1 record for 5 mile limit, 1 record for 10 mile limit, etc. This allows us to run queries with a pretty simple integer comparison. That 2nd paragraph kind of escapes me...perhaps you could translate to stupidese for me?

Also, I'm wondering if that approach might not be identical on some weird math plane to this approach i'm using for my current testing (i will have results very soon):

$zip_lat = 0.72889590272; // target zip code's latitude in radians
$zip_long = -1.33418679246; // target zip code's long in radians
$proximity_limit = 50; // choose only zips within this many miles
$max_dist_radians = $proximity_limit/MILES_PER_RADIAN; // convert miles to radians
$sql = "SELECT id FROM test_zip_assoc WHERE
	lat_radians > " . ($zip_lat - $max_dist_radians) . " AND
	lat_radians < " . ($zip_lat + $max_dist_radians) . " AND
	long_radians > " . ($zip_long - $max_dist_radians) . " AND
	long_radians < " . ($zip_long + $max_dist_radians);

I have no problem AT ALL with a really rough approximation. The performance so far has been so bad that accuracy is not my greatest concern at this point.

Roger - your approach had occurred to me but my dbase has 43,000 zip codes. Aren't we talking about nearly 2 billion records there? I'm dying to try that one just for hell of it.

Sxooter

What I'm talking about is bounding boxes.

Say we've got some exact geometric types to define our zip codes. We can also have maximum boxes that define them. If a box that contains a zip code can be defined in miles as x_max=330, x_min=280, y_max=600, y_min=440. now the actual zip code is not that big, but the furthest north, south, east and west points lie on the those max/min lines.

If you make your where clause check to see if the point you're checking on is near / within that bounding box and then AND it with the more expensive geometric equation, the query executor will short circuit with the cheap x/y coords first. If it's not in there, then it doesn't bother with the expensive geometric function.

Sxooter

Let me just say I'm not explaining this all that well, cause I'm used to using a database that does most of the math pretty easily. i.e. in pgsql I can define a polygon and ask the db is a point is inside it. I can ask a lot of other geometric questions as well. And none of them require me to use where clauses with 18 terms to determine if the point is inside / outside / near to / far from a polygon.

Sxooter

PS, take a look at the geometric operators for pgsql to know what I'm talking about here:

http://www.postgresql.org/docs/8.2/interactive/functions-geometry.html

sneakyimp

Sxooter wrote:
Let me just say I'm not explaining this all that well, cause I'm used to using a database that does most of the math pretty easily. i.e. in pgsql I can define a polygon and ask the db is a point is inside it. I can ask a lot of other geometric questions as well. And none of them require me to use where clauses with 18 terms to determine if the point is inside / outside / near to / far from a polygon.

The bounding box concept i get...pretty straightforward 2d euclidean geometry.

it's more the database aspect i don't understand as well. Although I do understand that PostGRE offers the geometric functions you linked and MySQL offers the spatial extensions Roger linked.

In my reading about query optimization, I understand it was a good idea to avoid performing a function operation on any particular data field because this would force an entire tablescan. I'm wondering if these geometric/spatial functions are 'index aware' - meaning that they would take advantage of an index rather than computing some value on every record in the database for comparison.

Sxooter

If you figure out the max box size, you can index those 4 digits and then use them in your where clause first, and then AND them with the geometric type.

One of the advantages of pgsql here is that you can use the newer gist indexes and index the functions themselves. i.e. something like:

create index zip_bound_box using gist on (box(polygon_field));

and from then on references in a where clause to a box(polygon_field) will be able to hit an index. box(polygon), btw, returns the bounding box for a polygon.

functional indexes are very useful for gis stuff.

Roger_Ramjet

Hey guys, as I said in my earlier post, mysql has spatial extensions and some of the functions for geospatial queries. Not as many or as full as pgsql but enough for this problem since it only requires the Minimum Bounding Rectangle MRB. Depending on your version you will find major differences in what is available, ver 5.1 is getting there now.

Now, I understood this to be a matter of local searches - find a gig within 20 miles of my zip code - and my solution is only offered on that basis. I don't know how far a zip code extends cos I'm in the uk, but I assumed that a 50 mile radius could only encompass less than 50 even in a dense urban area. Out in the sticks then yes you would probably want to expand the radius cos people expect to travel futher anyway.

One thing that bugs me about this sort of subject is, I googled for info on spatial representations of zip codes and found sweet FA. Now, there are all sorts of fed projects but no actual data set. Maybe you can find one in which case go with that.

As Sxooter says, the spatial data types exist and one can only assume that the indexes and functions have been designed and optimised for decent performance.

Sxooter

And you don't necessarily need the geo types. if you can figure out the bounding box, you can just use that to narrow it down enough that you only have a handful of values the query planner has to do any serious math on.

Note that without specialized indexes, I wouldn't expect MySQL to be able to index geospatial types in a useful manner. But if they've added GIS type indexing, that would be very cool. Is 5.1 released yet? Or is it still considered beta?

Roger_Ramjet

5.1 is still Beta, but 5.2 is already in Alpha so it won't be long. They've used the OpenGIS geometry model so 5.0 has spatial indexes, indeed 4.1 had them. Certainly 4.1 was very patchy on GIS and limited, but they saw the light and implemented most of the standard model in 5.
I've not actually had any excuse to try them out, or any of the GIS stuff. Yet another thing I don't seem to find the time for.

sneakyimp

So I created this table:

CREATE TABLE `test_zip_assoc` (
  `id` int(12) unsigned NOT NULL auto_increment,
  `f_id` int(12) unsigned NOT NULL default '0',
  `zip` varchar(5) NOT NULL default '',
  `lat_radians` decimal(12,11) NOT NULL default '0.00000000000',
  `long_radians` decimal(12,11) NOT NULL default '0.00000000000',
  PRIMARY KEY  (`id`),
  KEY `lat_radians` (`lat_radians`,`long_radians`)
) TYPE=MyISAM AUTO_INCREMENT=10000001 ;

and I populated it with records. Sadly, performance is way way too slow once we pass about a million records with a query like this:

  $zip_lat = 0.72889590272; // target zip code's latitude in radians
  $zip_long = -1.33418679246; // target zip code's long in radians
  $proximity_limit = 50; // choose only zips within this many miles
  $max_dist_radians = $proximity_limit/MILES_PER_RADIAN; // convert miles to radians 
  $sql = "SELECT id FROM test_zip_assoc WHERE
	lat_radians > " . ($zip_lat - $max_dist_radians) . " AND
	lat_radians < " . ($zip_lat + $max_dist_radians) . " AND
	long_radians > " . ($zip_long - $max_dist_radians) . " AND
	long_radians < " . ($zip_long + $max_dist_radians);

I set up a script to run about 20 iterations of that query using randomly selected zip codes. it works pretty well up to about 100,000 records but anything higher is too slow. With 10 million records in the table, the query averages 1600 seconds.

It's worth noting that the table with 1 mil is about 48 MB with 42MB index file. The 10 mil table has 480MB data file with 492MB index file.

I had the zip field in there for a sanity check...it was only used for the output results. I just noticed the lat and long are in a combined index. I'll do a test now with the lat and long indexed independently and the zip field removed so all fields are fixed-length. I think I'll also shorten the lat/long fields a bit to shrink the table further.

sneakyimp

fyi, this dev server is CRAWLING with these large tables. Simply removing the combined long/lat index from the table above took nearly 300 seconds. Adding the lat index took 380 seconds:

ALTER TABLE test_zip_assoc ADD INDEX ( lat_radians )

this took nearly 600 secs:
REPAIR TABLE test_zip_assoc

sneakyimp

Ok before I even get into any geosptial calculations or extra libraries, I think I am bumping up against a pretty fundamental wall even with the simplest approach. My performance is about ten times faster now that I removed the zip field so that all i have is 4 fixed-length fields: id, f_id, longitude, and latitude:

CREATE TABLE `test_zip_assoc` (
  `id` int(12) unsigned NOT NULL auto_increment,
  `f_id` int(12) unsigned NOT NULL default '0',
  `lat_radians` decimal(6,5) NOT NULL default '0.00000',
  `long_radians` decimal(6,5) NOT NULL default '0.00000',
  PRIMARY KEY  (`id`),
  KEY `lat_radians` (`lat_radians`),
  KEY `long_radians` (`long_radians`)
) TYPE=MyISAM;

I also reduced longitude and latitude to only 6 significant digits (roughly 1/25th of a mile).

I set up a script to choose a zip code at random and then randomly select 5 f_id'swithin 50 miles of it and then repeat 50 times. The time required for the main query is about 1000 times faster than what I was originally trying to do:

  $sql = "SELECT id FROM test_zip_assoc WHERE
	lat_radians > " . ($zip_lat - $max_dist_radians) . " AND
	lat_radians < " . ($zip_lat + $max_dist_radians) . " AND
	long_radians > " . ($zip_long - $max_dist_radians) . " AND
	long_radians < " . ($zip_long + $max_dist_radians);

HOWEVER It starts to become too slow once 10 million records are in the database. This is very encouraging but I still need to do better. I suspect that the slowness might be related to the fact that the number of records returned from that query grows to around 50,000 when there are 10 million records in the db and so the large result set being copied from mysql to php could be my problem. On the other hand, fifty thousand 12-digit numbers doesn't sound like a hell of a lot of memory to me. The problem might be the growing size of the database and indexes.

I've attached some graphs and data u might find interesting.

Can anyone suggest further improvements? I've managed a many-fold increase in speed over my original approach but I want more more more. I'm starting to wonder if mysql configuration stuff (like memory limits and such) might be tweaked to get better response with growing database.