Need a smart person to solve zip code distance (long,lat) lookup performance issue.

jeff_borden · Aug 24, 2004

Hi, I am in need of some help. I'm currently working on some code that allows me to determine the zip codes within a certain radius of a user-given zip code. In my DB I have close to 4000 records of zip codes and their city, state, longitude, and latitude.

My problem is that I am having to query the DB for each zip code and then run the function below to determine it's proximity to the given zip code. Obviously, this takes quite some time. It eventually gives me the maximum execution timeout error. I don't want to increase the execution time. I am hoping someone out there is smart enough to possibly provide a solution. I tried limiting it to a singular state, but this is not optimal if someone provides a zipcode close to a state border.

The function I am using for determining the # of miles between the two zipcodes is:

function distance($lat1, $lon1, $lat2, $lon2, $unit) 
{ 
  $theta = $lon1 - $lon2; 
  $dist = sin(deg2rad($lat1)) * sin(deg2rad($lat2)) +  cos(deg2rad($lat1)) * cos(deg2rad($lat2)) * cos(deg2rad($theta)); 
  $dist = acos($dist); 
  $dist = rad2deg($dist); 
  $miles = $dist * 60 * 1.1515;
  $unit = strtoupper($unit);

  return $miles;
}

Thanks for any help/thoughts you can offer.

melody · Aug 24, 2004

If you're using Postgres or any decent DB it's probably best to rewrite this in a stored procedure in the database. If the database executes the function, not only will it be a LOT faster (unless it's lousy code of course!) but it will mean (as far as my understanding goes) that php won't time out as it's waiting for the db to return to it. You can even write stored procedures in php for postgres: google for plphp. It will then be as easy to call as AVG() or COUNT().

HTH
Melody

AstroTeg · Aug 24, 2004

I highly recommend doing a search in this forum (well, this one and the Code one). You'll find ALL sorts of nifty code snippets for doing this. I hate to say it, but you've picked on of the less optimal ways to do it (as you'll note, it works, but takes too long).

For uber-fast performance, I'd recommend a MySQL C function you can build in (or hopefully your ISP could for you) or you might want to look at PostgreSQL and some of the tricks it can do.

But there are code snippets out there that do some initial calculations based on the initial zip and then narrows down those zips which need to be tested. This isn't the best, but would probably be a bit quicker than what you have right now (you're on the right track though).

jeff_borden · Aug 25, 2004

Thanks, I've included my query in case anyone ever needs it:

$query2k = "SELECT city, state, zipcode, ((DEGREES(ACOS(SIN(RADIANS($lat))*SIN(RADIANS(latitude))+COS(RADIANS($lat))*COS(RADIANS(latitude))*COS(RADIANS($long - longitude)))))*60)*1.1515 AS Distance FROM zipcodes WHERE ((DEGREES(ACOS(SIN(RADIANS($lat))*SIN(RADIANS(latitude))+COS(RADIANS($lat))*COS(RADIANS(latitude))*COS(RADIANS($long - longitude)))))*60)*1.1515 <= $radius ORDER BY Distance";

amagrace · Aug 27, 2004

Well, I don't think asking mySQL to do the calculation is a wise way. Definitely it will cause time out. It might be true that mySQL may have a compiled calculating module which runs faster. But it is more important that mySQL needs to calculate every single row before returning u the result. This is unnecessary.

The trick you can play is like this:

The longtitude and latitude stored in database is some sorts of angular measurement, be degree or radient. Either way, you can translate it to linear distance.

Take this example as a illustration. Don't comment on the numerical value as I don't have real data. The order of magnitude is wrong as well. But I'm sure u know what I'm saying.

Let your location be at (5.0 North, 15.0 East)
Let earth radius be 1000 km (for ez cal)
Say you want to find all records within 1Km radius,

To begin, 1000km radius translates to 0.001 radiant corresponds to 1 Km distance on earth surface.

So your target will be limited in the square area between (4.999 North to 5.001 North, S-N directoin) and (14.999 East and 15.001 East E-W direction)

Than make your SQL to select only postal points within this area:

$query2k = "SELECT city, state, zipcode, latitude, longitude FROM zipcodes where (longtitude is between 14.999 and 15.001) and (latitude is between 4.999 and 5.001)"

[The WHERE clause is only pseudo code, finetune the syntax please]

This will return you a finite subset of locations. Then invoke php function to loop through the recordset to eliminate addresses inside the square but outside the embedded circle.

Try this. It should work.

But please, be kindly post back the result. Just interest to know.

Fred

drawmack · Aug 27, 2004

why not rewrite the math so that your function returns a min and a max logitude and latitude that is x distance from the provided longitude and latitude. Then you can just query the db for the long and lat of the provided zip, run your function and then query the db again for all the long and lats that fall within your parameters.

AstroTeg · Aug 27, 2004

Originally posted by amagrace
[...] It might be true that mySQL may have a compiled calculating module which runs faster. But it is more important that mySQL needs to calculate every single row before returning u the result. This is unnecessary.
Fred

I'm not sure if you're referring to the C code (its compiled in as a module to MySQL). Have you seen it run? Even if it wasn't 100% optimized, its uber fast. Check it out:

http://www.axtime.com/zip/geodist.php

It doesn't have all the zip codes and the zip codes are not postal code structured. But check out how quick it returns the results.

If you're not referring to the C code, then yeah, checking each row by hand is very ugly...

amagrace · Aug 27, 2004

Well I do not work with mySQL a lot and do not really have any knowledge about C code. But this is not the main point of my post.

The key point I was trying to say is to try to limit the scope of processing needed by the database. The timeout is not due to the complexity of the calculation alone, but largely multiplied by the unnecessary scope. SQL is easy but sometimes SQL alone does not provide the best solution.

drawmack · Aug 28, 2004

Check this out: http://www.enderswebdev.com/test/zipdistance.php

I did all the zip codes within 1000 miles of my zip code and it took about 20 seconds to print is all out, but it took less then 1 second to process. I did it by moving the math inside the query like this:

<?php
function getZipsWithin($zip,$miles) {
    if(($zipInfo = getZipInfo($zip)) === FALSE)
        return FALSE;

$sql  = "SELECT zip_code,\n";
$sql .= "ROUND((ACOS((SIN(" . $zipInfo->lattitude . "/57.2958) * SIN(lattitude/57.2958)) +\n";
$sql .= "(COS(" . $zipInfo->lattitude . "/57.2958) * COS(lattitude/57.2958) *\n";
$sql .= "COS(longitude/57.2958 - " . $zipInfo->longitude . "/57.2958)))) * 3963, 3) AS distance\n";
$sql .= "FROM zip_codes\n";
$sql .= "WHERE (lattitude >= " . $zipInfo->lattitude . " - (" . $miles . "/111))\n";
$sql .= "AND (lattitude <= " . $zipInfo->lattitude . " + (" . $miles . "/111))\n";
$sql .= "AND (longitude >= " . $zipInfo->longitude . " - (" . $miles . "/111))\n";
$sql .= "AND (longitude <= " . $zipInfo->longitude . " + (" . $miles . "/111))\n";
$sql .= "ORDER BY distance";

if(($query = mysql_query($sql)) === FALSE)
    return FALSE;

$retval = array();
$i=0;
while(($retval[] = mysql_fetch_object($query)) !== FALSE) {}

return $retval;
} //end zipsWithin
?>

For the truly curious here is getZipInfo as well

<?php
function getZipInfo($zip) {
    $sql  = "SELECT * FROM zip_codes WHERE zip_code='" . $zip . "'";
    $query = mysql_query($sql);
    if(mysql_num_rows($query) < 1)
        return FALSE;

$zipInfo = mysql_fetch_object($query);    
return $zipInfo;  
} //end getZipInfo
?>

A couple of sources worth noting:
Where I got the zip codes - http://www.census.gov/geo/www/tiger/zip1999.html

Where I got the query - http://forums.devshed.com/t119637/s.html (I always knew cold fusion was good for something)

drawmack · Aug 28, 2004

P.S. I misspelled latitude in my database as lattitude. You'll have to correct this in the queries in my code.

kkobashi · Aug 28, 2004

I think you are making this more difficult than it really is needed. Suppose you have all the memory in the world to work with on your HTTP server. What's the most simplest solution that is the fastest?

A lookup table.

Now of course, it would be dumb to store a huge lookup table of all the proximities in memory. So what do you do? You write a program that takes the longitude and latitude of every combination of zip codes and stores the distance in another database.

Then, your web application would simply query on this precalculated data not having to do it at runtime (because its already calculated). This will result in the number of database hits to be kept to a minimum at the cost of having to take up disk space for the database.

In other words, don't write the PHP code to do the calculation at run time on the webserver. Eliminate it entirely. Precalculate what you are looking for ONE TIME in a custom separate program and store it in a database.

Examples:

"Give me the distance in miles from zipcode 123456 and 56789":

SQL SELECT DistanceMiles WHERE FromZip="123456" and ToZip="56789"

or

"Find all zipcodes from zipcode 123456 where the distance is less than 300 miles":

SQL SELECT ToZip WHERE FromZip="123456" and DistanceMiles <= 300

So your database will have all precalculated combinations looking something like this:

FromZip ToZip DistanceMiles

123456 56789 500
123456 23231 300
.
.
etc.

There will be less than 16,000,000 of these records in the database (which isn't large at all)

Hope this makes sense.

AstroTeg · Aug 28, 2004

Originally posted by kkobashi

There will be less than 16,000,000 of these records in the database (which isn't large at all)

16,000,000 x (medium int (3 bytes) x 3 fields) = 144,000,000 bytes of storage, not including indexes.

From an upkeep standpoint, this approach is rather ugly. Although from a raw performance standpoint, this should work great.

I haven't tried it yet, but I have a really good feeling PostgreSQL could actually do this task way better than MySQL (by using PostgreSQL's user defined functions). In a few weeks I wish to sit down and give it a whirl...

drawmack · Aug 28, 2004

The US cen bureau zip codes database from 1999 has 42,192 entries in it. If you're storing the distance from every point A to every point B then you would have 42,192^42,192 entries in the database. If you make your program smart enough to realize that the distance from point a to point b is the same as the distance from point b to point a then you would only have to store 42,192! entries in this look up table. However the retrieving logic would be twice as complex.

This is not a good solution to the problem, but rather what I've posted above is. If you look closely at the where clause you'll see that the math to decide which zip codes to retrieve is very simple and quick to do expecially when it's done right in the dbms.

P.S. As a proof of concept that you're numbers were way off here:
select count(*) from zip_codes as a, zip_codes as b returned 1,780,164,864 which is the number of records stored if you store from every point a to every point b

After 4:38 minutes the second proof of concept query is still running.

P.P.S. After 18:23 minutes of execution on select count(a.zip_code) from zip_codes as a, zip_codes as b where b.zip_code > a.zip_code I decided that proving your stupidity wasn't worth my processor time.

BTW: don't be offended just ask Weedpacket how many times he's proved my stupidity.

kkobashi · Aug 28, 2004

Originally posted by AstroTeg
16,000,000 x (medium int (3 bytes) x 3 fields) = 144,000,000 bytes of storage, not including indexes.

From an upkeep standpoint, this approach is rather ugly. Although from a raw performance standpoint, this should work great.

I haven't tried it yet, but I have a really good feeling PostgreSQL could actually do this task way better than MySQL (by using PostgreSQL's user defined functions). In a few weeks I wish to sit down and give it a whirl...

How from an upkeep standpoint is this ugly. You are writing a program ONCE . After you get it working, where's the maintenace?
Its just disk space fellas.

drawmack · Aug 28, 2004

Originally posted by kkobashi
How from an upkeep standpoint is this ugly. You are writing a program ONCE . After you get it working, where's the maintenace?
Its just disk space fellas.

The USPS zip code database is rereleased every month and almost always contains changes. So you're running this code every month to calculate over 1.7 billion values.

Do you have any idea the strain that you'd be putting your db server under to even implement that?

kkobashi · Aug 28, 2004

Originally posted by drawmack
The US cen bureau zip codes database from 1999 has 42,192 entries in it. If you're storing the distance from every point A to every point B then you would have 42,192^42,192 entries in the database. If you make your program smart enough to realize that the distance from point a to point b is the same as the distance from point b to point a then you would only have to store 42,192! entries in this look up table. However the retrieving logic would be twice as complex.

This is not a good solution to the problem, but rather what I've posted above is. If you look closely at the where clause you'll see that the math to decide which zip codes to retrieve is very simple and quick to do expecially when it's done right in the dbms.

P.S. As a proof of concept that you're numbers were way off here:
select count(*) from zip_codes as a, zip_codes as b returned 1,780,164,864 which is the number of records stored if you store from every point a to every point b

After 4:38 minutes the second proof of concept query is still running.

P.P.S. After 18:23 minutes of execution on select count(a.zip_code) from zip_codes as a, zip_codes as b where b.zip_code > a.zip_code I decided that proving your stupidity wasn't worth my processor time.

BTW: don't be offended just ask Weedpacket how many times he's proved my stupidity.

This morning I was taking a shower and thinking of GMT times for another problem I was working on. Then it hit me. A fixed zip code where the other two zip codes are relative from (that is, reduce the longitude/latitude problem into miles/km relative from the fixed zip code). And that one could figure out the distance from the other side of the triangle connecting the two zip codes based on trigonometry.

Duh. My bad.

drawmack · Aug 28, 2004

Originally posted by kkobashi
This morning I was taking a shower and thinking of GMT times for another problem I was working on. Then it hit me. A fixed zip code where the other two zip codes are relative from (that is, reduce the longitude/latitude problem into miles/km relative from the fixed zip code). And that one could figure out the distance from the other side of the triangle connecting the two zip codes based on trigonometry.

Duh. My bad.

But how is implementing the pathagorian theorum any less work on the db then what I have done above? If you look at my post that includes code you'll see a where clause like this:

latitude > y - x/111 and latitude < y + x/111 and
longitude > z - x/111 and latitude < z - x/111

y = latitude of the starting zip code
z = longitude of the starting zip code
x = number of miles you want zip codes within

So basically what's being done is an orb is created with the starting zip code as the center point and miles/111 as the radius.

The complex and costly math is only completed after the DBMS has used these simple formulas to decide if this zip code is within the acceptable range of distances.

kkobashi · Aug 29, 2004

Originally posted by drawmack
The USPS zip code database is rereleased every month and almost always contains changes. So you're running this code every month to calculate over 1.7 billion values.

Do you have any idea the strain that you'd be putting your db server under to even implement that?

Like I said earlier, this is pre-calculated. You would generate the result on a separate machine and transfer it over to the production machine (or switch).

I was assuming the poster had only a 4000 zip code range to work with, not the 42,000+ U.S. zipcode. I knew nothing about what country he was working with.

I simply offered a solution given what he said in his original question that precalculated all the point to point locations in miles, then use the database as a lookup table. The solution was quick and easy to implement, takes up little disk space (150MB is hardly high maintenance), and offers fast performance.

If indeed he is using a larger zip code range, then my solution would not BE PRACTICAL if you don't have the resources to pull it off. That doesn't say that it wont work.

There is always more than one way to skin a cat.

drawmack · Aug 29, 2004

Originally posted by kkobashi
Like I said earlier, this is pre-calculated. You would generate the result on a separate machine and transfer it over to the production machine (or switch).

Okay, but right now you're talking about taking three cipcodes and triangulating them. This triangulation needs to be turned into a distance between with the pathagorean theorum. When are you going to do this? If you're going to do it ahead of time for all the zip codes it doesn't make any sense cause it's easier to get one distance (point a to point b) then to get two distances (point a to point c, point b to point c) then use the two of those to calculate a third distance point a to point b.

I was assuming the poster had only a 4000 zip code range to work with, not the 42,000+ U.S. zipcode. I knew nothing about what country he was working with.

Why? I usually assume the worst case scenario not the best case scenario. If you write code that will work for the 42,000+ entry database you sure as hell won't have any problems with it on the 4000 entry database. Here is my code working on the 42,000 entry database: http://www.enderswebdev.com/test/zipdistance.php of course since I posted that link before I'm sure you've already gone there and seen it work. Try the zip 18331 and enter 1000 for the miles. You see it's over 2 megs of data and it starts loading in less then three seconds.

I simply offered a solution given what he said in his original question that precalculated all the point to point locations in miles, then use the database as a lookup table. The solution was quick and easy to implement, takes up little disk space (150MB is hardly high maintenance), and offers fast performance.

But it only offers fast performance on small datasets. That's not a good solution.

If indeed he is using a larger zip code range, then my solution would not BE PRACTICAL if you don't have the resources to pull it off. That doesn't say that it wont work.

If you solution has problems with scalability then it is not a solution it is simply a different problem.

There is always more than one way to skin a cat.

but only one of them minimizes pain and bloodloss to make the process relatively quite and easy to clean up after.

kkobashi · Aug 29, 2004

Didn't I just say that my solution may not be practical depending on his computing environment? It is however doable and gets the job done, doesn't it? If you do the numbers, its several GB of pre-calculated data (based on 42K zipcodes and each combination). Again, it is a solution and one of many.

Now take a deep breathe and exhale. Life is too short to get your panties all wedged up.

Need a smart person to solve zip code distance (long,lat) lookup performance issue.

Jjeff_borden

Mmelody

AAstroTeg

Jjeff_borden

Aamagrace

Ddrawmack

AAstroTeg

Aamagrace

Ddrawmack

Ddrawmack

Kkkobashi

FromZip ToZip DistanceMiles

AAstroTeg

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi