selecting 5 random events within X miles

sneakyimp

i'm working on an event listing site. Users choose a zip/postal code and we can show them what's happening. i have a db with about 50,000 zips in it. On the event listing page i want to list at random 5 events within a radius of N miles.

This might seem easy enough...I can associate each event with a zip code, i have the latitude and longitude of each zip code, i can do a great circle distance calculation to find the distance between the current $zip and the event.

I'm starting to wonder if this will scale though. With these assumptions:

50,000 zip codes
5 events per hour all day on average per zip code (more in NYC, less in Bald Knob, AR)
24 hours a day
30 days a month
3 months worth of listings active at any given time

that's 540,000,000 events that would need to be searched in the database! that's probably not a big deal if things are indexed BUT if i'm going to be doing a great circle distance calculation on ALL those records i could have one really SLOW QUERY on my hands. the logic might look something like this:

// 1 - find a complete list of ids that are with in $x miles

$x = 100;
// DISTANCE CALCULATIONS
define('EARTH_CIRC_MILES', 24901.55);
define('EARTH_CIRC_KM', 40075.16);
define('MILES_PER_RADIAN', EARTH_CIRC_MILES/deg2rad(360));
define('KM_PER_RADIAN', EARTH_CIRC_KM/deg2rad(360));

$my_latitude_radians = 0.59533458956;
$my_longitude_radians = -2.06592416764;
$sql = " SELECT e.id
        FROM events e, zips z WHERE
        e.zip = z.zip
        AND (" . MILES_PER_RADIAN . 
"*(2*asin(sqrt(POWER(sin((" . $my_latitude_radians . " -z.lat_rad)/2),2) + cos(" . $my_latitude_radians . ")*cos(z.lat_rad)*POWER(sin((" . $my_longitude_radians . " - z.long_rad)/2),2)))) < " . $x . ")";
$result = mysql_query($sql)
  or die('damn query failed.');

// 2 - put them in an array

$ids = array();
while($row = mysql_fetch_assoc($result)) {
  $ids[] = $row['id'];
}
mysql_free_result($result);

// 3  pick two at random...this could be thousands of ids!  millions?  who knows?  Might take a lot of RAM
if (count($ids) < 6) {
  // take what we can find
  $random_event_ids = $ids;
} else {
  // pick 5
  $total_ids = count($ids);
  $random_event_ids = array();
  do {
    $rnd = rand(0, ($total_ids-1));
    $random_eid = $ids[$rnd];
    if (!in_array($random_eid, $random_event_ids)) {
      $random_event_ids[] = $random_eid;
    }
  } while (count($random_event_ids) < 5);

}

4 // run another query to get the event details
$sql = "SELECT * FROM events WHERE id in(" . implode(',', $random_event_ids) . ")";
//etc

As you might imagine, doing that great circle distance calculation 500 million times for every user who visits that page might not work so well.

Might there be some more efficient way to do this? I considered trying to pre-calculate the distances between any two zips...my stats are a little rusty but that sounds like:
n= 50,000 zips
r = 2 zips

you can repeat a zip so it would be:
n! / r!(n-r)!

(50,00!) / ((2!) * (49,998!))

which is

(50,000*49,999)/2

which is

1,249,975,000

a billion and a quarter records in the zip to zip distance table. That would make for one nasty join.

On the other hand, it might be possible to dramatically reduce the number of zips in the zip/zip distance table by weeding out any records for zips more than 100 miles apart. i don't expect to offer searches for anything beyond 100 miles.

Thoughts?

mjax

You could select them randomly from the SQL statement by adding this to the end of the SQL:

ORDER BY RAND() LIMIT 0,5

sneakyimp

That sounds really useful. I'm wondering how the ORDER BY RAND() gets implemented...would i still need to run the distance calc on the entire record set?

Weedpacket

An observation: you can assume the Earth is flat. Not only is the distance only going to be approximate anyway (zip code regions are not geometric points), but your users aren't likely to be making the trip in a geometrically straight line (or great circle, for that matter). So inaccuracies from using a Euclidean metric aren't going to mean the end of the world. You can be approximate.

Plasma

Weedpacket brings up some interesting points.

The query you pasted looks quite expensive in terms of doing it often (if you are planning to).

You could maybe use the basic lat/long of the points and the simple distance formula between the two points to locate any relevant area - and just restrict the searching to X km/units from each point.

You may also want to look into summary tables, where you pre-calculate some data and reference those instead of your raw information.

Also:

// 3 pick two at random...this could be thousands of ids! millions? who knows? Might take a lot of RAM

Be sure you apply the LIMIT keyword to your query - if your only ever going to be showing at most 5 results, limit the query to five!

This will allow the query server to end the query when it has enough results.

As previously suggested, you can use the ORDER BY rand technique to randomize the returned results.

Additionally, you cant really index 540,000,000 records when your performing those calculations - I would suspect, even with indexing, your query will take long enough for the user to notice and get annoyed with waiting and just leave.

That is my own assumption though, do you happen to have a table already?

You should compile a test table and benchmark your code against it - create a script to inject some (valid) test data in there so you have something to work with!

Weedpacket

Plasma wrote:
You could maybe use the basic lat/long of the points and the simple distance formula between the two points to locate any relevant area - and just restrict the searching to X km/units from each point.

A precondition for (x1,y1) being within n miles of (x2,y2) is that the difference between x1 and x2 cannot be more than n miles itself. Ditto for y1 and y2. This adds additional tests that can be made to filter the potential results.

sneakyimp

THANKS guys

Weedpacket wrote:
An observation: you can assume the Earth is flat. Not only is the distance only going to be approximate anyway (zip code regions are not geometric points), but your users aren't likely to be making the trip in a geometrically straight line (or great circle, for that matter). So inaccuracies from using a Euclidean metric aren't going to mean the end of the world. You can be approximate.

That's a good point. I was originally doing the calc with simple euclidean distance calculation but then someone made me paranoid by pointing out that miles per degree of longitude varies with your latitude. My minimum latitude is 0.32 rads in Puerto Rico (65 miles per deg. of long) and maximum is 1.23 rads (22.7 miles per deg. of long) in Alaska. Seems to me an approximate euclidean calc might be good enough...at least good enough to limit the query return for a more accurate calc.

Plasma wrote:
You may also want to look into summary tables, where you pre-calculate some data and reference those instead of your raw information.

That's more what i was thinking but i'm kind of at a loss as to where to start. Most important to me seems to be avoiding either kind of distance calc on the entire db. That's why i inquired about how ORDER BY RAND might get implemented? Does it pick records at random before checking the WHERE clause? Or does it perform all the nasty calculations first and then pick a resulting item? In other words, does order by RAND really help?

i'm also worried about RAND bugs and performance. Some of the posts I've seen suggest ORDER BY RAND() doesn't always work. I'm using MySQL V 4.0.27 so I'm thinking i'm safe. On the other hand, some of the workarounds I've seen look like the code I came up with:

http://bugs.mysql.com/bug.php?id=817
http://www.greggdev.com/web/articles.php?id=6

Weedpacket

sneakyimp wrote:
miles per degree of longitude varies with your latitude.

That's very true; when you make a plot of it it becomes obvious (in the extreme case, the pole stretches out to become as long as the equatorial circumference) even at middling latitudes (the end closer towards the poles ends up looking squished).

Rather than store latitudes and longitudes (which won't be of much help unless your visitors want GPS coordinates, and the coordinates you have won't be any use, being only of (I'm guessing) the central post office), you could convert them to northings/eastings.

How you go about that is fairly dull, but straightforward and you only need to do it once 🙂. I found this which at a glance looks like it will do you well enough (especially since you're looking for distances between two points, rather than an actual position).

sneakyimp

right on. eastings...northings. leave it to the military to come up with a practical simplification. i've been reading those links about UTM.

If I understand correctly that would seem to reduce this nasty bit:

$sql = " SELECT e.id 
        FROM events e, zips z WHERE 
        e.zip = z.zip 
        AND (" . MILES_PER_RADIAN . 
"*(2*asin(sqrt(POWER(sin((" . $my_latitude_radians . " -z.lat_rad)/2),2) + cos(" . $my_latitude_radians . ")*cos(z.lat_rad)*POWER(sin((" . $my_longitude_radians . " - z.long_rad)/2),2)))) < " . $x . ")";

to the considerably less nasty:

$sql = " SELECT e.id 
        FROM events e, zips z WHERE 
        e.zip = z.zip 
        AND (sqrt(POWER((" . $my_easting . " - z.easting), 2) + POWER((" . $my_northing . " - z.northing), 2)) < $distance_factor)";

and we would not have to compromise too much accuracy.

On the other hand, we still have the issue of possibly running the calculation on every single event record in my database to determine which are within a certain distance, right? I'm really wondering if the ORDER BY RAND() combined with a LIMIT 0, 5 is really going to reduce the amount of time spent on this query. Seems to me that MySQL would need to first calculate the distances for all the records before knowing which records would be included so that it could then proceed to order them. Can anyone point me to implementation details of ORDER BY RAND and LIMIT?

In the 2nd article I linked, the author reported this:

For most purposes on smaller database tables, the following will work fine:

$random_row = mysql_fetch_row(mysql_query("select * from YOUR_TABLE order by rand() limit 1"));

$random_row will be an array containing the data extracted from the random row. However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row.

2.4 million records is considerably less than 540 million. Seems to me I need something to reduce the number of records under consideration before even attempting a random row type thing.

I really like weedpacket's point here:

A precondition for (x1,y1) being within n miles of (x2,y2) is that the difference between x1 and x2 cannot be more than n miles itself. Ditto for y1 and y2. This adds additional tests that can be made to filter the potential results.

If I could first use simple comparisons like that--or better yet check indexed values in my event table to weed out events before the distance calculation gets applied, that might speed things up much more dramatically.

for instance:

$max_distance = 50; // find everything within 50 miles
$max_longitude_difference = $max_distance / MILES_PER_LONGITUDE_RADIAN; // use some safe value for MPLR
$max_latitude_difference = $max_distance / MILES_PER_LATITUDE_RADIAN; // this should be constant

$sql = " SELECT e.id 
        FROM events e, zips z
        WHERE e.zip = z.zip
        AND index_field < 5
        AND (ABS(" . $my_longitude_radians . "-z.longitude_radians) < " . $max_longitude_difference . ")
        AND (ABS(" . $my_latitude_radians . "-z.latitude_radians) < " . $max_latitude_difference . ")
        AND (" . MILES_PER_RADIAN . 
"*(2*asin(sqrt(POWER(sin((" . $my_latitude_radians . " -z.lat_rad)/2),2) + cos(" . $my_latitude_radians . ")*cos(z.lat_rad)*POWER(sin((" . $my_longitude_radians . " - z.long_rad)/2),2)))) < " . $x . ")
        ORDER BY RAND()
        LIMIT 0,5";

Would those extra checks before the nasty calculation cause mysql to skip the calculation for records that are obviously out of bounds?

There seem to be several components to the performance problem here. One is reducing the nastiness of the calculations being done. Hopefully this can be done by checking easier calculations as in that last bit of code i just typed. A second is to use ORDER BY RAND() and LIMIT to hopefully skip the calculation on all the records. I'm skeptical that approach will actually limit the amount of calculations that MySQL will perform. A third approach is summary tables of some kind.

I have no experience at all with temporary tables, but it had occurred to me that when a user selects a zip code Z and radius R, I could create a temporary table that calculated the distance to all zip codes within R miles of Z. If they choose no radius, then no distance calculation is necessary.

dschreck

just cache them....

ie:

get your 5 results. put them into a different db. group them by numbers or sets if you will..
If someone puts in a zip, you search the cache DB for that zip, if it comes back with something, grab that group set. if there's no set for that zip yet, create one.

then you could do the " SORT BY RAND() LIMIT 0,5 " once you have a cache db.

kthx.

sneakyimp

dschreck wrote:
just cache them....

Sadly, the client has indicated that the randomly displayed events should change with each page access so I can't cache the returned events.

However, I have considered caching another type of data. When a user selects a zip code and radius for events to view, I could create a table containing all zip codes within that radius.

Then, when finding events, i could find the appropiate zip/radius table and just do a join on the zips in there or select all the zips in the zip/radius table and create a WHERE ZIP IN (ZIP1, ZIP2, ZIP3) clause in my query that would limit the event results automatically by distance.

EDIT: I have never once in my life created a temporary table in a MySQL database. The ones I have seen have all had totally random names. How might I keep track of which temporary table belongs to zip code 90026 and radius 20 miles?