Re: Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Performance improvement for joins where outer side is unique |
Date | |
Msg-id | 55DA1F9F.5040204@2ndquadrant.com Whole thread Raw |
In response to | Re: Performance improvement for joins where outer side is unique (Erik Rijkers <er@xs4all.nl>) |
Responses |
Re: Performance improvement for joins where outer side is unique
|
List | pgsql-hackers |
Hi, I did some initial performance evaluation of this patch today, and I see a clear improvement on larger joins. The scenario I've chosen for the experiments is a simple fact-dimension join, i.e. a small table referenced by a large table. So effectively something like this: CREATE TABLE dim (id INT PRIMARY KEY, ...); CREATE TABLE fact (dim_d INT REFERENCES dim(id), ...); except that I haven't used the foreign key constraint. In all the experiments I've used a fact table 10x the size of the dimension, but I believe what really matters most is the size of the dimension (and how the hash table fits into the L2/L3 cache). The query tested is very simple: select count(1) from ( select * from f join d on (f.fact_id = d.dim_id) ) foo; The outer aggregation is intentional - the join produces many rows and formatting them as string would completely eliminate any gains from the patch (even with "\o /dev/null" or such). The following numbers come current master, running on E5-2630 v3 (2.40GHz), 64GB of RAM and this configuration: checkpoint_timeout = 15min effective_cache_size = 48GB maintenance_work_mem = 1GB max_wal_size = 4GB min_wal_size = 1GB random_page_cost = 1.5 shared_buffers = 4GB work_mem = 1GB all the other values are set to default. I did 10 runs for each combination of sizes - the numbers seem quite consistent and repeatable. I also looked at the median values. dim 100k rows, fact 1M rows --------------------------- master patched ------- ------- 1 286.184 265.489 2 284.827 264.961 3 281.040 264.768 4 280.926 267.720 5 280.984 261.348 6 280.878 261.463 7 280.875 261.338 8 281.042 261.265 9 281.003 261.236 10 280.939 261.185 ------- ------- med 280.994 261.406 (-7%) dim 1M rows, fact 10M rows -------------------------- master patched -------- -------- 1 4316.235 3690.373 2 4399.539 3738.097 3 4360.551 3655.602 4 4359.763 3626.142 5 4361.821 3621.941 6 4359.205 3654.835 7 4371.438 3631.212 8 4361.857 3626.237 9 4357.317 3676.651 10 4359.561 3641.830 -------- -------- med 4360.157 3648.333 (-17%) dim 10M rows, fact 100M rows ---------------------------- master patched -------- -------- 1 46246.467 39561.597 2 45982.937 40083.352 3 45818.118 39674.661 4 45716.281 39616.585 5 45651.117 40463.966 6 45979.036 41395.390 7 46045.400 41358.047 8 45978.698 41253.946 9 45801.343 41156.440 10 45720.722 41374.688 --------- --------- med 45898.408 40810.203 (-10%) So the gains seem quite solid - it's not something that would make the query an order of magnitude faster, but it's well above the noise. Of course, in practice the queries will be more complicated, making the improvement less significant, but I don't think that's a reason not to apply it. Two minor comments on the patch: 1) the 'subquery' variable in specialjoin_is_unique_join is unused 2) in the explain output, there should probably be a space before the '(inner unique)' text, so Hash Join (inner unique) ... instead of Hash Join(inner unique) but that's just nitpicking at this point. Otherwise the patch seems quite solid to me. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: