Performance on LEFT JOIN

ratamacue

Hello, I am having some performance issues with a certain Postgres database query joining 2 tables (containing 1 record each) and 1 view (containing 3 million records). Here are the table definitions, simplified as much as possible:

table1 (contains 1 record):
record_id SERIAL
table2_id INT4 (foreign key, can be null)
view1_id INT4 (foreign key, can be null)

table2 (contains 1 record):
record_id SERIAL

view1 (contains 3 million records, joins 3 other tables):
record_id SERIAL
column2 TEXT NOT NULL

Note that table1.view1_id is actually a foreign key to a table containing 3 million records, which is reconstructed by view1. Indexes have been created on all columns shown above, and VACUUM ANALYZE has been run. The tables are normalized. Here is the query (simplified):

SELECT *
FROM (table1 LEFT JOIN table2 ON table1.table2_id = table2.record_id)
LEFT JOIN view1 ON table1.view1_id = view1.record_id
WHERE view1.column2 = 'TEST';

As shown above, the query takes well over 1 minute to execute. If I substitute "INNER JOIN" for the second "LEFT JOIN" (the one involving table3), performance improves dramatically and the query executes in about 1 second.

Here is my EXPLAIN output. As you can see I have a table of cities, a table of countries, and a table of provinces. The table containing cities is the one with 3 million records.

EXPLAIN output for LEFT JOIN version of query (slow)

Nested Loop (cost=182.85..576063.89 rows=220 width=629)
-> Hash Join (cost=1.02..2.09 rows=3 width=377)
-> Seq Scan on table1 t1 (cost=0.00..1.03 rows=3 width=276)
-> Hash (cost=1.02..1.02 rows=2 width=101)
-> Seq Scan on table2 t2 (cost=0.00..1.02 rows=2
width=101)
-> Subquery Scan v1 (cost=181.83..148003.47 rows=2938516
width=105)
-> Hash Join (cost=181.83..148003.47 rows=2938516 width=105)
-> Seq Scan on table_city t3 (cost=0.00..59666.16
rows=2938516 width=47)
-> Hash (cost=170.93..170.93 rows=4360 width=58)
-> Hash Join (cost=6.12..170.93 rows=4360
width=58)
-> Seq Scan on table_province t2
(cost=0.00..77.60 rows=4360 width=27)
-> Hash (cost=5.50..5.50 rows=250
width=31)
-> Seq Scan on table_country t1
(cost=0.00..5.50 rows=250 width=31)

EXPLAIN output for INNER JOIN version of query (fast)

Hash Join (cost=477.17..488.27 rows=1 width=482)
-> Merge Join (cost=475.07..486.12 rows=10 width=105)
-> Sort (cost=434.49..434.49 rows=4360 width=58)
-> Hash Join (cost=6.12..170.93 rows=4360 width=58)
-> Seq Scan on table_province t2
(cost=0.00..77.60 rows=4360 width=27)
-> Hash (cost=5.50..5.50 rows=250 width=31)
-> Seq Scan on table_country t1
(cost=0.00..5.50 rows=250 width=31)
-> Sort (cost=40.58..40.58 rows=10 width=47)
-> Index Scan using table_city_city_name_key on
trek_city t3 (cost=0.00..40.43 rows=10 width=47)
-> Hash (cost=2.09..2.09 rows=3 width=377)
-> Hash Join (cost=1.02..2.09 rows=3 width=377)
-> Seq Scan on table1 t1 (cost=0.00..1.03 rows=3
width=276)
-> Hash (cost=1.02..1.02 rows=2 width=101)
-> Seq Scan on table2 t2 (cost=0.00..1.02 rows=2
width=101)

Could anyone offer any suggestions on how to improve the performance of the LEFT JOIN? I can't figure out why it won't use the index like the INNER JOIN does. Thank you for your time.

tomhath

I don't know postgres well enough to answer the question, but since you are referring to view1 in the WHERE clause, the INNER JOIN and LEFT JOIN should return the same result set (any NULL rows from the LEFT JOIN will be excluded by the WHERE)

ratamacue

You're right, the WHERE clause should read

WHERE view1.column2 = 'TEST' OR table1.view1_id IS NULL

Thanks for pointing that out. Anyone have any suggestions on the performance issue?

Sxooter

Well, I just tried the same thing, and it ran quite fast. But I was using indexes. Have you indexed the join field on your large table? My query and plan looked like this:

explain select * from j1 join j2 on (j1.id=j2.id2) left join logs on (j2.id2=logs.logid);
NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..4.27 rows=1 width=369)
  ->  Nested Loop  (cost=0.00..2.08 rows=1 width=28)
        ->  Seq Scan on j1  (cost=0.00..1.01 rows=1 width=12)
        ->  Seq Scan on j2  (cost=0.00..1.01 rows=1 width=16)
  ->  Index Scan using logs_pkey on logs  (cost=0.00..2.13 rows=1 width=341)

EXPLAIN

Note the index scan on the logs table. And the use of the nested loop. you can turn nested loops and such on / off with a session var, try this first:

show all;

from the psql prompt. Then, you can use something like:

set enable_mergejoin = off;

and see what happens.

P.s. you should probably put the name of the dbms you're using in the title, as performance tuning is quite different for each engine.

Hope this helps.

Sxooter

Quick followup, what flavor of pgsql are you running? I'm not sure indexes were using under views before 7.3 came out.

ratamacue

Thanks for the reply. Yes, all join fields are indexed. When I use the INNER JOIN, the index is used, but when I use the LEFT JOIN the index is not used. I am using Postgres 7.2.1.

Sxooter

I'm testing on 7.3 and it seems to work, I think there were some issues in 7.2 using indexes under a view.

Have you tried:

set enable_seqscan=off
select (your query here)

yet?

ratamacue

Yes I have tried that and there is no significant difference in performance. I will see if I can test the query on Postgres 7.3 later today. The query planner seems to want to select all 3 million records from the view and join that with the other tables and then apply the filter, instead of filtering the view first and then performing the join. As far as I can tell that is the root of the problem. It seems like there would be a simple fix but I can't figure it out.

chriskl

Sounds like the problem that was fixed in 7.3.

Try this:

SELECT
FROM (table1 LEFT JOIN table2 ON table1.table2_id = table2.record_id)
LEFT JOIN (SELECT FROM view1 WHERE column2='TEST') AS sub ON table1.view1_id = sub.record_id ;

That way you're FORCING postgres to filter first.

Chris

Sxooter

Oh yeah, a point I point I kept forgetting to make is that a left join is almost always gonna be slower than a standard join, since it is going to return every row in the left hand table, whether it matches one in the right table. A standard join is only going to return the rows fromt he left table that match the right, so since it returns more rows and data, it's usually slower, anywhere from a little to a bunch, mostly depending on the difference in the size of the returned dataset.

ratamacue

Thanks for the info. I tried the query on 7.3 but unfortunately it still follows the same query plan where the filter is applied to the view after the join, instead of before the join. Then I tried the subquery version that chriskl posted and it worked, it definitely forced the query planner to filter the view first. So I'll probably go with that for now, but I was thinking that it would be possible to bypass the left joins completely and just use special values (-1) in the join fields (type INT4) instead of NULL. The fields would then have the NOT NULL constraint and I could use the inner join all around. Would that be a good idea, or would I do better to use NULL to indicate a missing foreign key?

Sxooter

Could you post your postgresql.conf and tell us what kind of memory / cpu / hard drive setups you are running? Especially look at sort_mem, buffers, effect cache size, random page cost, and the three cpu***cost variables.

Change those to your best guess, run a full vacuum and an analyze and see if things change. also try issueing a "set enable_seqscan=off" before running your query and see if that helps.