Need a smart person to solve zip code distance (long,lat) lookup performance issue.

drawmack · Aug 28, 2004

The US cen bureau zip codes database from 1999 has 42,192 entries in it. If you're storing the distance from every point A to every point B then you would have 42,192^42,192 entries in the database. If you make your program smart enough to realize that the distance from point a to point b is the same as the distance from point b to point a then you would only have to store 42,192! entries in this look up table. However the retrieving logic would be twice as complex.

This is not a good solution to the problem, but rather what I've posted above is. If you look closely at the where clause you'll see that the math to decide which zip codes to retrieve is very simple and quick to do expecially when it's done right in the dbms.

P.S. As a proof of concept that you're numbers were way off here:
select count(*) from zip_codes as a, zip_codes as b returned 1,780,164,864 which is the number of records stored if you store from every point a to every point b

After 4:38 minutes the second proof of concept query is still running.

P.P.S. After 18:23 minutes of execution on select count(a.zip_code) from zip_codes as a, zip_codes as b where b.zip_code > a.zip_code I decided that proving your stupidity wasn't worth my processor time.

BTW: don't be offended just ask Weedpacket how many times he's proved my stupidity.

kkobashi · Aug 28, 2004

Originally posted by AstroTeg
16,000,000 x (medium int (3 bytes) x 3 fields) = 144,000,000 bytes of storage, not including indexes.

From an upkeep standpoint, this approach is rather ugly. Although from a raw performance standpoint, this should work great.

I haven't tried it yet, but I have a really good feeling PostgreSQL could actually do this task way better than MySQL (by using PostgreSQL's user defined functions). In a few weeks I wish to sit down and give it a whirl...

How from an upkeep standpoint is this ugly. You are writing a program ONCE . After you get it working, where's the maintenace?
Its just disk space fellas.

drawmack · Aug 28, 2004

Originally posted by kkobashi
How from an upkeep standpoint is this ugly. You are writing a program ONCE . After you get it working, where's the maintenace?
Its just disk space fellas.

The USPS zip code database is rereleased every month and almost always contains changes. So you're running this code every month to calculate over 1.7 billion values.

Do you have any idea the strain that you'd be putting your db server under to even implement that?

kkobashi · Aug 28, 2004

Originally posted by drawmack
The US cen bureau zip codes database from 1999 has 42,192 entries in it. If you're storing the distance from every point A to every point B then you would have 42,192^42,192 entries in the database. If you make your program smart enough to realize that the distance from point a to point b is the same as the distance from point b to point a then you would only have to store 42,192! entries in this look up table. However the retrieving logic would be twice as complex.

This is not a good solution to the problem, but rather what I've posted above is. If you look closely at the where clause you'll see that the math to decide which zip codes to retrieve is very simple and quick to do expecially when it's done right in the dbms.

P.S. As a proof of concept that you're numbers were way off here:
select count(*) from zip_codes as a, zip_codes as b returned 1,780,164,864 which is the number of records stored if you store from every point a to every point b

After 4:38 minutes the second proof of concept query is still running.

P.P.S. After 18:23 minutes of execution on select count(a.zip_code) from zip_codes as a, zip_codes as b where b.zip_code > a.zip_code I decided that proving your stupidity wasn't worth my processor time.

BTW: don't be offended just ask Weedpacket how many times he's proved my stupidity.

This morning I was taking a shower and thinking of GMT times for another problem I was working on. Then it hit me. A fixed zip code where the other two zip codes are relative from (that is, reduce the longitude/latitude problem into miles/km relative from the fixed zip code). And that one could figure out the distance from the other side of the triangle connecting the two zip codes based on trigonometry.

Duh. My bad.

drawmack · Aug 28, 2004

Originally posted by kkobashi
This morning I was taking a shower and thinking of GMT times for another problem I was working on. Then it hit me. A fixed zip code where the other two zip codes are relative from (that is, reduce the longitude/latitude problem into miles/km relative from the fixed zip code). And that one could figure out the distance from the other side of the triangle connecting the two zip codes based on trigonometry.

Duh. My bad.

But how is implementing the pathagorian theorum any less work on the db then what I have done above? If you look at my post that includes code you'll see a where clause like this:

latitude > y - x/111 and latitude < y + x/111 and
longitude > z - x/111 and latitude < z - x/111

y = latitude of the starting zip code
z = longitude of the starting zip code
x = number of miles you want zip codes within

So basically what's being done is an orb is created with the starting zip code as the center point and miles/111 as the radius.

The complex and costly math is only completed after the DBMS has used these simple formulas to decide if this zip code is within the acceptable range of distances.

kkobashi · Aug 29, 2004

Originally posted by drawmack
The USPS zip code database is rereleased every month and almost always contains changes. So you're running this code every month to calculate over 1.7 billion values.

Do you have any idea the strain that you'd be putting your db server under to even implement that?

Like I said earlier, this is pre-calculated. You would generate the result on a separate machine and transfer it over to the production machine (or switch).

I was assuming the poster had only a 4000 zip code range to work with, not the 42,000+ U.S. zipcode. I knew nothing about what country he was working with.

I simply offered a solution given what he said in his original question that precalculated all the point to point locations in miles, then use the database as a lookup table. The solution was quick and easy to implement, takes up little disk space (150MB is hardly high maintenance), and offers fast performance.

If indeed he is using a larger zip code range, then my solution would not BE PRACTICAL if you don't have the resources to pull it off. That doesn't say that it wont work.

There is always more than one way to skin a cat.

drawmack · Aug 29, 2004

Originally posted by kkobashi
Like I said earlier, this is pre-calculated. You would generate the result on a separate machine and transfer it over to the production machine (or switch).

Okay, but right now you're talking about taking three cipcodes and triangulating them. This triangulation needs to be turned into a distance between with the pathagorean theorum. When are you going to do this? If you're going to do it ahead of time for all the zip codes it doesn't make any sense cause it's easier to get one distance (point a to point b) then to get two distances (point a to point c, point b to point c) then use the two of those to calculate a third distance point a to point b.

I was assuming the poster had only a 4000 zip code range to work with, not the 42,000+ U.S. zipcode. I knew nothing about what country he was working with.

Why? I usually assume the worst case scenario not the best case scenario. If you write code that will work for the 42,000+ entry database you sure as hell won't have any problems with it on the 4000 entry database. Here is my code working on the 42,000 entry database: http://www.enderswebdev.com/test/zipdistance.php of course since I posted that link before I'm sure you've already gone there and seen it work. Try the zip 18331 and enter 1000 for the miles. You see it's over 2 megs of data and it starts loading in less then three seconds.

I simply offered a solution given what he said in his original question that precalculated all the point to point locations in miles, then use the database as a lookup table. The solution was quick and easy to implement, takes up little disk space (150MB is hardly high maintenance), and offers fast performance.

But it only offers fast performance on small datasets. That's not a good solution.

If indeed he is using a larger zip code range, then my solution would not BE PRACTICAL if you don't have the resources to pull it off. That doesn't say that it wont work.

If you solution has problems with scalability then it is not a solution it is simply a different problem.

There is always more than one way to skin a cat.

but only one of them minimizes pain and bloodloss to make the process relatively quite and easy to clean up after.

kkobashi · Aug 29, 2004

Didn't I just say that my solution may not be practical depending on his computing environment? It is however doable and gets the job done, doesn't it? If you do the numbers, its several GB of pre-calculated data (based on 42K zipcodes and each combination). Again, it is a solution and one of many.

Now take a deep breathe and exhale. Life is too short to get your panties all wedged up.

AstroTeg · Aug 31, 2004

Originally posted by drawmack
Check this out: http://www.enderswebdev.com/test/zipdistance.php

Drawmack, I was poking around and noticed this odd behavior from your code:

zip 44137 and 44138. I'll admit, these are two towns I've lived in and I have a good idea how far apart these two are.

Using your results, it shows these two zips being 11.715 miles apart. If you do 44137 and a radius of 20 miles, Olmsted Falls (44138) doesn't show up. But if you do 25 or 30 miles, it then appears.

Weedpacket · Aug 31, 2004

Of course you could save a bit of the work by doing all the degree-to-radians conversion and some of the trigonometry during the initial insertion: cos(38°N) isn't going to be changing any time soon, so you might as well just store 0.788011 in the database and use that instead of recalculating it every time.

Maybe if you're prepared to tolerate some amount of error (e.g., "give or take 5km") (seeing has how geographic regions aren't geometric points, this seems like it could be reasonable), it might be possible to project the relevant section of the Earth's surface onto a flat plane in such a way that the old Euclidean metric sqrt((lat1-lat2)^{2+(lon1-lon2)²⁾} could be used to get a "good enough" measure of the distance. Then you'd be storing the projected coordinates.

(If you'd been using PostgreSQL, I'd've cited this but that would probably be too much for your purposes, as all you'd be using would be its distance operator.

kkobashi · Aug 31, 2004

Originally posted by AstroTeg
Drawmack, I was poking around and noticed this odd behavior from your code:

zip 44137 and 44138. I'll admit, these are two towns I've lived in and I have a good idea how far apart these two are.

Using your results, it shows these two zips being 11.715 miles apart. If you do 44137 and a radius of 20 miles, Olmsted Falls (44138) doesn't show up. But if you do 25 or 30 miles, it then appears.

He snagged the code off the Devshed website link and reworked it calling it "his". Whether that code was good or not, who knows. Apparently it isn't according to your analysis AstroTeg?

You know, when you get headstrong, take someone else's crappy code, then scream that you got the Holy Grail only to find out that you don't, you tend to get arrows thrown at your back

AstroTeg · Aug 31, 2004

I'm not interested in bashing one's code - regardless of how it was created. My bigger concern is if one is using funky/buggy code to perform production operations, they may not be aware there is a problem.

Weekpacket: thanks for the link. I'd like to give that code a whirl in the future.

Although yesterday I had fun and converted a C++ latitude/longitude distance finding function for MySQL to a C flavored one for PostgreSQL. I got it to work and can be found in action here. Its doing the ugly "calculate every zip code distance" trick before handing out the results (the difference is that its all done in C code). It seems to go pretty quick...

Manat · Aug 31, 2004

drawmack,

I read your query and it seems pretty straightfoward.
But if I have a table with store names and addresses(and zip code), how do I cross reference those store zip codes with the stores I have in my store table?

Thx

AstroTeg · Aug 31, 2004

Originally posted by Manat
drawmack,

I read your query and it seems pretty straightfoward.

I'd be cautious using that query. It looks like it me have a bug. I haven't had time to figure out the source, but its not bringing back all the results it is supposed to.

drawmack · Aug 31, 2004

Originally posted by AstroTeg
Drawmack, I was poking around and noticed this odd behavior from your code:

zip 44137 and 44138. I'll admit, these are two towns I've lived in and I have a good idea how far apart these two are.

Using your results, it shows these two zips being 11.715 miles apart. If you do 44137 and a radius of 20 miles, Olmsted Falls (44138) doesn't show up. But if you do 25 or 30 miles, it then appears.

I snagged by bd from the census beareu which is not a garenteed acruate source. But for garenteed acuracy you have to pay $20.00 for the data and I didn't feel like paying for something that I was only using as a sample.

drawmack · Aug 31, 2004

Originally posted by kkobashi
He snagged the code off the Devshed website link and reworked it calling it "his". Whether that code was good or not, who knows. Apparently it isn't according to your analysis AstroTeg?

I posted links to the devshed page in my original post and stated that this is where I got the query the php code is mine as the page I got it from was a cold fusion page.

You know, when you get headstrong, take someone else's crappy code, then scream that you got the Holy Grail only to find out that you don't, you tend to get arrows thrown at your back

Of course you know that my system runns off of a db and this is a problem with the data not the code. Of course when you make assumptions and then go posting inflamatory statements based on those assumptions you've got to expect this soemtimes.

drawmack · Aug 31, 2004

Originally posted by Manat
drawmack,

I read your query and it seems pretty straightfoward.
But if I have a table with store names and addresses(and zip code), how do I cross reference those store zip codes with the stores I have in my store table?

Thx

Just feed the store's zip and the distance radius you want in miles to getZipsWithin function.

Take the zips from the array returned and feed those back into your locations table by selecting the stores where zip in (...) and you'll get a list of stores with the matching zips.

AstroTeg · Aug 31, 2004

Originally posted by drawmack
I snagged by bd from the census beareu which is not a garenteed acruate source. But for garenteed acuracy you have to pay $20.00 for the data and I didn't feel like paying for something that I was only using as a sample.

Right. But you missed my point.

Your distance algorithm correctly calculates the distance.

It does not correctly calculate the radius.

Test it.

Enter 44137 as the zip and search in 10 mile, 15 mile, 20, and 25 mile radius.

Look at the results and find when 44138 appears. Also note the calculated distance. Its about 11 miles. Notice when it appears. It should show up in any query over 10 miles. But you have to query with 24 miles for this town to show up.

Sounds like your code has a bug in its radius detection (or its doing too much trimming on the lat/long and removing some records prematurely from the results). Make sense?

AstroTeg · Aug 31, 2004

Originally posted by drawmack
Of course you know that my system runns off of a db and this is a problem with the data not the code. Of course when you make assumptions and then go posting inflamatory statements based on those assumptions you've got to expect this soemtimes.

Actually, re-read my post. Its a problem with your radius calculation(s), not the data.

If there was a data problem, then the mileage would just be wrong. But your mileage appears correct as calculated for your data set. The problem you're dropping valid records which are supposed to be part of the results.

drawmack · Sep 1, 2004

Originally posted by AstroTeg
Enter 44137 as the zip and search in 10 mile, 15 mile, 20, and 25 mile radius.

Look at the results and find when 44138 appears. Also note the calculated distance. Its about 11 miles. Notice when it appears. It should show up in any query over 10 miles. But you have to query with 24 miles for this town to show up.

Sounds like your code has a bug in its radius detection (or its doing too much trimming on the lat/long and removing some records prematurely from the results). Make sense?

okay I got ya now, I'll look into that. I thought you were pointing out a bug I already found that was not a bug in the code but rather faulty data which give out the wrong distance between some towns.

Need a smart person to solve zip code distance (long,lat) lookup performance issue.

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi

Ddrawmack

Kkkobashi

AAstroTeg

Weedpacket

Kkkobashi

AAstroTeg

MManat

AAstroTeg

Ddrawmack

Ddrawmack

Ddrawmack

AAstroTeg

AAstroTeg

Ddrawmack