Home > mailing lists

Re: eqjoinsel_semi still sucks ... - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: eqjoinsel_semi still sucks ...
Date	June 28, 2023 16:27:59
Msg-id	1235050.1687969679@sss.pgh.pa.us Whole thread Raw
In response to	Re: eqjoinsel_semi still sucks ... (Alena Rybakina <lena.ribackina@yandex.ru>)
List	pgsql-hackers

Tree view

Alena Rybakina <lena.ribackina@yandex.ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.

>> I got stuck in some cases where (due to a tree of filters) the planner
>> underestimates the JOIN just because the ndistinct conveys a huge
>> number to the selectivity estimation formula. However, the estimation
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf,
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug
> into the above commits and the topic . I hope this letter will help to
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread.  The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2
andb.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..445.00 rows=10000 width=244) (actual time=0.003..0.502 rows=10000 loops=1)
   ->  Hash  (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 240kB
         ->  Seq Scan on tenk1 b  (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1)
               Filter: (two = 0)
               Rows Removed by Filter: 5000
 Planning Time: 0.760 ms
 Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=532.50..1127.50 rows=10000 width=488) (actual time=2.000..5.742 rows=10000 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous
andb.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=532.50..1127.50 rows=10000 width=244) (actual time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand =
b.fivethousand b.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix.  I've not had time to pursue
the idea I expressed at the start of the thread.  Also, it's a
bit scary to change this logic with only a few examples to look
at.  If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

            regards, tom lane

pgsql-hackers by date:

From: Peter Eisentraut
Date: 28 June 2023, 14:52:27
Subject: several attstattarget-related improvements

From: Peter Eisentraut
Date: 28 June 2023, 16:30:14
Subject: tablecmds.c/MergeAttributes() cleanup

Re: eqjoinsel_semi still sucks ... - Mailing list pgsql-hackers

Previous

Next