MySQL Subquery help please

Operandi · Jan 30, 2007

Hi

I have a tracking table that logs ip addresses and the numerical equivalent of the ip address. I need to run a report that joins the tracking table to a lead conversions table and also looks up the country of each record from its numerical ip address.

I can perform a joined select query and then perform another select lookup query for each record from the country/ip mapping table, however this would result in 100's of query per report.

I think a single sub query would solve this database overload, but I cannot get it to work. I'm using MySQL v 4.1.21

The tables involved are tracking, conversions and ipcountry (which holds the ip vs country code data)

Every tracking record must be displayed whether or not an entry can be found in the ipcountry table.

This is the query that isn't working:

SELECT tracking.*, conversions.lead_type, conversions.posted, (ipcountry.countryshort WHERE tracking.ip_number BETWEEN ipcountry.ipfrom AND ipcountry.ipto as countryshort ) FROM tracking LEFT JOIN conversions ON tracking.id=conversions.track_id WHERE month='Jan' AND year = '07' ORDER BY rec_timestamp DESC LIMIT 0, 800

Any ideas gratefully received

Thanks

Sxooter · Jan 30, 2007

why can't you just join all three tables in one query?

Operandi · Jan 30, 2007

why can't you just join all three tables in one query?

The problem is that there is no foreign key in the ipcountry table that matches records in the tracking table. The only relationship between these two tables is that the ip_number in the tracking table may or may not lie between two values in the ipcountry table.

I've tried joining the tables in one query, but couldn't get the query to work:

SELECT tracking.*, conversions.lead_type, conversions.posted, FROM tracking LEFT JOIN conversions ON tracking.id=conversions.track_id LEFT JOIN ipcountry ON tracking.ip_number BETWEEN ipcountry.ipfrom AND ipcountry.ipto WHERE month='Jan' AND year = '07' ORDER BY rec_timestamp DESC LIMIT 0, 800

Sxooter · Jan 30, 2007

If you were using a db that had an IP type (see my sig) the between might work like that.

since IP addresses are really just base 16 numbers, you could always convert your IP table to a series of base 10 ints (or base 16 if mysql has base 16 capability) and use use between that way.

The other way is to use regex matching. But that will likely be slow.

Operandi · Jan 30, 2007

Thanks for your help, Sxooter

In fact I am converting the IP addresses as you suggest and even got the query to work with joining the three tables, HOWEVER the query was so slow that I had to stop it in fear of the server ( dual xeons, 2GB RAM).

The tracking table has 196,000 records and the location look up table has 86,000 records. I stopped the query after about a minute

This is the query that took so long:

SELECT tracking.*, conversions.lead_type, conversions.posted
FROM tracking
LEFT JOIN conversions ON tracking.id=conversions.track_id
LEFT JOIN ipcountry ON (tracking.ip_number BETWEEN ipcountry.ipfrom AND ipcountry.ipto)
WHERE month='Jan' AND year = '07' ORDER BY rec_timestamp DESC LIMIT 0, 800

The ipcountry lookup table has a primary key on the ipfrom and ipto fields

Any suggestions?

Sxooter · Jan 30, 2007

Do you have indexes on the fields on both sides of those joins, like tracking.ip_number ?

Operandi · Jan 30, 2007

Actually I didn't have an index on the tracking.ip_number field (I had deleted it when I was messing around with the table structures)

I've put an index on it now, however the query is still excruciatingly slow. I'm getting the database checked by our hosting company to make sure its OK.

I know you can use a specific index, but I'm not sure of the syntax for joins.

Sxooter · Jan 30, 2007

Try it with a regular join once just to see if the query gets faster. MySQL's query planner has some issues with trying to optimize outer joins, and that may be what's causing your problem here.

It may be that a subselect will make the query planner make the right decision.

bradgrafelman · Jan 31, 2007

Just to jump in here:

Sxooter wrote:
If you were using a db that had an IP type (see my sig) the between might work like that.

MySQL has such a type - it's called integer!

In MySQL, localhost = 2130706433. Why? Because:

 (first octet * 256³) + (second octet * 256²) + (third octet * 256) + (fourth octet) 
=  (first octet * 16777216) + (second octet * 65536) + (third octet * 256) + (fourth octet) 
=  (127 * 16777216) + (0 * 65536) + (0 * 256) + (1) 
=  2130706433

Sxooter · Jan 31, 2007

bradgrafelman wrote:
Just to jump in here:

MySQL has such a type - it's called integer!

In MySQL, localhost = 2130706433. Why? Because:
 (first octet * 256³) + (second octet * 256²) + (third octet * 256) + (fourth octet) 
=  (first octet * 16777216) + (second octet * 65536) + (third octet * 256) + (fourth octet) 
=  (127 * 16777216) + (0 * 65536) + (0 * 256) + (1) 
=  2130706433 

So, if you feed it an octet like 234.435.17.87 it throws an error?
pgsql can also contain the subnet masks as well. And does IPv6 as well as IPv4.
Oh, and there's a mac address type too

http://www.postgresql.org/docs/8.2/static/datatype-net-types.html

There are a range of comparison functions too

http://www.postgresql.org/docs/8.2/static/functions-net.html

But, if you wanna do it all by hand, and reinvent the wheel, feel free.

I know what you're saying, just pointing out that a proper type takes more than a bit if simple maths to get right.

bradgrafelman · Jan 31, 2007

B..but.. oh fine switch to the almighty PostgreSQL :p

Sxooter · Jan 31, 2007

Hey, you can still collate different columns by difference locales, something postgresql can't do.

For me, for full text indexing and content management, mysql is still king. But for anything involving math (and getting the right answer) pgsql is king.

Operandi · Jan 31, 2007

Hi Guys

I tried the query with regular joins and it was still very slow. The hosting company has checked the database and its OK, so I've decided to abandon this approach and look up the location when the entry is logged, rather than during report generation.

Its the first time that MySQL has let me down.

Thanks very much for all your help.

Chris

Sxooter · Jan 31, 2007

You might do better making it into a subselect.

But this is the kind of problem that mysql's rather limited query planner is known to have issues with.

If you wanna test this query with PostgreSQL, give me some info on your distribution and number of entries etc and I'd be happy to build a test data set (or download one from you if that's ok) and test it on a postgresql db.

Operandi · Jan 31, 2007

If you wanna test this query with PostgreSQL, give me some info on your distribution and number of entries etc and I'd be happy to build a test data set (or download one from you if that's ok) and test it on a postgresql db.

Thats a very kind offer - thanks. I'm not sure that I can give you a copy of the data as the tracking data is commercially sensitive and the ip/geographical localisation table is under licence, however there is a table of 196,000 ip addresses, each with their INT(10) equivalent that I was looking up against a table containing 88,000 ip/location rows, though I was limiting the recordset to 800 for the report that caused all the problems.

MySQL vs PostgeSQL is a very valid decision for us at this stage, though, as these problems have been encountered during the development of a prototype system that we want to release as a commercial product.

We intend to release the product as a hosted, rather than distributed application, and hence have to consider server sizing, choice of database and the overall feasibility of the project from a database size and handling point of view.

There could potentially be 10,000 web sites to log, with each web site generating 10,000 log entries, though this is an end goal that would take time to achieve.

So decisions about running a single table with potentially 100 Million entries or splitting the data into many databases have to me made, and server sizing and infrastructure design have to be taken into account.

I've no experience with postgreSQL, so I'd be interested in your views about which would be more suitable for database applications of this size. We don't want to use Oracle as it would be too expensive.

The entry logging is actually easy and not load intensive - a simple lookup against the location table (now we are doing it that way, after these issues) and then an insert query, however the reporting could be more challenging as we will need joined queries against large data sets.

Thanks
Chris

Sxooter · Jan 31, 2007

I would definitely look at PostgreSQL version 8.2.1 or even Bizgress / Bizgress MPP from greenplum. Those guys have done some amazing work on the reporting capability of pgsql.

My experience has been that as the queries become more complex, PostgreSQL generally does better. Plus, MySQL seems to still have some issues with scaling in 5.x that they need to address. For an example of this behaviour on big machines, read this article:

http://pda.tweakers.net/?reviews/649

Note that that is a worst case scenario, and it's a (mostly) transactional, not reporting application.

I'd test both dbs side by side and see which one seems to work better for what you're doing.

Operandi · Jan 31, 2007

I'll follow those suggestions up.

Thanks for all the help and advice.

Roger_Ramjet · Feb 1, 2007

Just a point: this is one example of when and why you should de-normalise your db.
Reason 1
Doing the lookup when you store the data is a one-time operation while doing it every time you run a report is just nuts; it is extra disk space against extra processing and that is a no-brainer.
Reason 2
Even if there were no problem with the lookup overhead, this is historic data and you want to know what the location is NOW, not what it may be when you run your report some time in the future. OK IP locations may not change that much but in other situations the data can be very dynamic so your reports would be subject to untracked aberation and bias.

Just something you may like to think about.

For me, mysql started out as a very fast content repository and was excelent at it. Then people wanted to use it for transaction processing and it was not so good at that, certainly not as good as other dbs that were designed specifically for that. Now people have started trying to do BI and data wharehouse with it and as you can see mysql is not very good at it.

Half the art of being a software enginer, or any kind of engineer for that matter, is choosing the right tools for the job. There is nothing to say that you can not use mysql for your data capture and content presentation, and another db for BI reporting. Just dump the data into another db with the right functionality and use that as your data wharehouse for reporting.

Operandi · Feb 1, 2007

Thanks, Roger, for this sound advice.

I had moved from performing the lookups during report production to performing them at data capture for the very reasons that you mentioned, as well as the performance problems with MySQL, even though it goes against the normalisation grain.

I've been looking into PostgreSQL and checking out a few sites that use it. This is mainly because the complexities of our app reside in reporting rather than data capture.

One site - http://www.flightaware.com displays an impressive range of reports and BI from large volumes of data. The reports are fast and the site is a good advertisement for the powers of PostgreSQL.

Sxooter · Feb 1, 2007

Wow! Very cool website there. I gotta show my work buddies that one.

MySQL Subquery help please

OOperandi

SSxooter

OOperandi

SSxooter

OOperandi

SSxooter

OOperandi

SSxooter

Bbradgrafelman

SSxooter

Bbradgrafelman

SSxooter

OOperandi

SSxooter

OOperandi

SSxooter

OOperandi

RRoger_Ramjet

OOperandi

SSxooter