Re: eqjoinsel_semi still sucks ... - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: eqjoinsel_semi still sucks ... |
Date | |
Msg-id | 1235050.1687969679@sss.pgh.pa.us Whole thread Raw |
In response to | Re: eqjoinsel_semi still sucks ... (Alena Rybakina <lena.ribackina@yandex.ru>) |
List | pgsql-hackers |
Alena Rybakina <lena.ribackina@yandex.ru> writes: > On 23.06.2023 14:28, Andrey Lepikhov wrote: >> On 2/5/2012 20:34, Tom Lane wrote: >>> On reflection I think that the idea of clamping ndistinct beforehand is >>> just wrong, and what we ought to do instead is apply a multiplier to the >>> selectivity estimate afterwards. >> I got stuck in some cases where (due to a tree of filters) the planner >> underestimates the JOIN just because the ndistinct conveys a huge >> number to the selectivity estimation formula. However, the estimation >> of both input relations is made correctly and is limited. >> I've tried to understand the logic through commits 0d3b231eebf, >> 97930cf578e and 7f3eba30c9d. But it is still not clear. >> So, why the idea of clamping ndistinct is terrible in general? Could >> you explain your reasons a bit more? > Hi, I also have an interest in understanding the same issue, and I dug > into the above commits and the topic . I hope this letter will help to > sort out this issue. Note that nothing's actually been done about the idea I expressed at the start of this thread. The behavior is better now: if you try the example in v12 or later you get regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2 andb.two = 0); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1) Hash Cond: (a.thousand = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..445.00 rows=10000 width=244) (actual time=0.003..0.502 rows=10000 loops=1) -> Hash (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 240kB -> Seq Scan on tenk1 b (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1) Filter: (two = 0) Rows Removed by Filter: 5000 Planning Time: 0.760 ms Execution Time: 4.087 ms (10 rows) I believe this improvement just stems from commit a314c3407 ("Clamp semijoin selectivity to be not more than inner-join selectivity"), and so this example looks good only because the actual semijoin output size happens to be the same as the inner-join output size. With a slight adjustment, that's no longer true and the estimate still sucks: regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Hash Join (cost=532.50..1127.50 rows=10000 width=488) (actual time=2.000..5.742 rows=10000 loops=1) ... regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous andb.two = 0); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=532.50..1127.50 rows=10000 width=244) (actual time=1.529..3.474 rows=5000 loops=1) ... regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.fivethousand b.two = 0); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1) ... So we still have an issue to fix. I've not had time to pursue the idea I expressed at the start of the thread. Also, it's a bit scary to change this logic with only a few examples to look at. If you could reduce the problem cases you're looking at to something sharable, that could help move things along. regards, tom lane
pgsql-hackers by date: