Home > mailing lists

Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: tweaking NTUP_PER_BUCKET
Date	July 19, 2014 21:08:06
Msg-id	53CADE1D.3010601@fuzzy.cz Whole thread Raw
In response to	Re: tweaking NTUP_PER_BUCKET (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: tweaking NTUP_PER_BUCKET
List	pgsql-hackers

Tree view

On 19.7.2014 20:28, Tomas Vondra wrote:
> On 19.7.2014 20:24, Tom Lane wrote:
>> Tomas Vondra <tv@fuzzy.cz> writes:
>>> I've reviewed the two test cases mentioned here, and sadly there's
>>> nothing that can be 'fixed' by this patch. The problem here lies in the
>>> planning stage, which decides to hash the large table - we can't fix
>>> that in the executor.
>>
>> We've heard a couple reports before of the planner deciding to hash a
>> larger table rather than a smaller one.  The only reason I can think of
>> for that is if the smaller table has many more duplicates, so that the
>> planner thinks the executor might end up traversing long hash chains.
>> The planner's estimates could easily be off in this area, of course.
>> estimate_hash_bucketsize() is the likely culprit if it's wrong.
>>
>> Which test case are you seeing this in, exactly?
> 
> The two reported by Stephen here:
> 
> 
> http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net
> 
> Just download this (I had to gunzip it):
> 
>   http://snowman.net/~sfrost/test_case.sql
>   http://snowman.net/~sfrost/test_case2.sql

Anyway, looking a bit at the distributions, I don't think the
distributions are significantly skewed. In the first testscase, I get this

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;cnt | count
-----+-------  1 |    23  2 | 18795  3 |    13  4 |   726  5 |     4  6 |    93  8 |     4 10 |     1
(8 rows)

and in the second one I get this:

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;cnt | count
-----+--------  1 | 182161  2 |   5582  3 |    371  4 |     28  5 |      2  6 |      3  9 |      1
(7 rows)

So while #1 contains most values twice, #2 is almost perfectly distinct.
In the big_table, the values are unique.

For the first case, a WARNING at the end of estimate_hash_bucketsize
says this:

WARNING:  nbuckets=8388608.00 estfract=0.000001
WARNING:  nbuckets=65536.00 estfract=0.000267

There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>= 1e-6), and ~10 tuples/bucket for the small_table (42k rows).

For the second case, I get this:

WARNING:  nbuckets=16777216.00 estfract=0.000001
WARNING:  nbuckets=262144.00 estfract=0.000100

i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.

That sounds really strange, because small_table in the second test case
is almost perfectly unique. And in the first test case, there are only
very few keys with >2 occurences.


By explicitly setting the selectivity in estimate_hash_bucketsize to
1e-6, I got these plans:

Test case #1
                         QUERY PLAN
---------------------------------------------------------------------Hash Join  (cost=1229.46..79288.95 rows=41176
width=24)(actual
 
time=50.397..634.986 rows=13731 loops=1)  Hash Cond: (big_table.id_short = small_table.id_short)  ->  Seq Scan on
big_table (cost=0.00..61626.71 rows=4272271 width=4)
 
(actual time=0.018..229.597 rows=4272271 loops=1)  ->  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual
time=31.653..31.653 rows=41176 loops=1)        Buckets: 65536  Batches: 1  Memory Usage: 3086kB        ->  Seq Scan on
small_table (cost=0.00..714.76 rows=41176
 
width=24) (actual time=0.012..13.341 rows=41176 loops=1)Planning time: 0.597 msExecution time: 635.498 ms
(8 rows)

Without explain analyze: 486 ms (original plan: >850ms).

Test case #2
                         QUERY PLAN
---------------------------------------------------------------------Hash Join  (cost=5240.21..220838.44 rows=194587
width=4)(actual
 
time=88.252..2143.457 rows=149540 loops=1)  Hash Cond: (big_table.id_short = small_table.id_short)  ->  Seq Scan on
big_table (cost=0.00..169569.72 rows=11755372
 
width=4) (actual time=0.018..533.955 rows=11755475 loops=1)  ->  Hash  (cost=2807.87..2807.87 rows=194587 width=4)
(actual
time=87.548..87.548 rows=194587 loops=1)        Buckets: 262144  Batches: 1  Memory Usage: 8889kB        ->  Seq Scan
onsmall_table  (cost=0.00..2807.87 rows=194587
 
width=4) (actual time=0.012..29.929 rows=194587 loops=1)Planning time: 0.438 msExecution time: 2146.818 ms
(8 rows)

Without explain analyze: 1600 ms (original plan: >2300ms)

The differences are fairly small - well, it's 30-40% faster, but it's
less than 1s in both cases. I'm wondering whether we could get into
similar problems with much longer queries and still get 30-40% improvement.

Tomas

pgsql-hackers by date:

From: John Cochran
Date: 19 July 2014, 18:30:55
Subject: Re: Proposal for updating src/timezone

From: Tomas Vondra
Date: 19 July 2014, 22:13:24
Subject: Re: tweaking NTUP_PER_BUCKET

Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

Previous

Next