Thread: partition table optimizer join cost misestimation

partition table optimizer join cost misestimation

From

James Pang

Date:

02 April, 13:18:57

Hi,

Postgresq v14.8, we found optimizer doest not take "merge append" cost into sql plan total cost and then make a bad sql plan. attached please find details.

Query: masking table and column names :

select ....
from tablea aa
inner join tableb bb on aa.ind1 = bb.ind1
inner join tablec cc on aa.ind2 = cc.ind2
where aa.ind3 in ($1)
order by cc.id3;

optimizer, it choose merge join with full index scan with a lot of rows and huge IO needed,
looks like optimizer does not take into account the "Merge Append" huge cost into the whole SQL plan cost,
so it thought only Sort and Merge join cost and make it to choose mergejoin instead of nestloop.

--bad plan, it took hundreds of seconds to complete.
Sort (cost=9624.33..9677.60 rows=21306 width=981)
Sort Key: cc.id3
-> Nested Loop (cost=99.78..2717.44 rows=21306 width=981)
-> Merge Join (cost=99.35..2009.19 rows=21306 width=915)
Merge Cond: (cc.ind2 = aa.id2)
-> Merge Append (cost=8.15..17046177.32 rows=98194074 width=903) <<< merge append huge cost but looks like optimizer does not take this into total plan cost yet.

--good plan, off mergejoin to make it chose nestloop and it took only 20 milliseconds to complete.

Sort (cost=19618.71..19671.98 rows=21306 width=981)
Sort Key: cc.ind2
-> Nested Loop (cost=0.72..12711.82 rows=21306 width=981)
-> Nested Loop (cost=0.29..12003.57 rows=21306 width=915)
-> Append (cost=0.29..42.46 rows=30 width=28)

Thanks,

James

Attachment

Re: partition table optimizer join cost misestimation

From

Andrei Lepikhov

Date:

02 April, 14:03:37

On 4/2/25 12:18, James Pang wrote:
> Hi,
>     Postgresq v14.8, we found optimizer doest not take "merge append" 
> cost into sql plan total cost and then make a bad sql plan. attached 
> please find details.
I suppose there is a different type of issue.
MegeJoin sometimes doesn't need to scan the whole inner or outer side 
(see the MergeScanSelCache structure and how it is used in the cost 
estimation routine).

So, the cost can be less because the optimizer predicted that only a 
small part of the Append will be scanned and used some sort of 
interpolation between startup cost and total cost.

But to be sure, could you send the results of EXPLAIN ANALYZE VERBOSE? 
If you also send the data to reproduce the case, we may find the source 
of the problem more precisely.

-- 
regards, Andrei Lepikhov

Re: partition table optimizer join cost misestimation

From

James Pang

Date:

03 April, 03:46:28

Andrei,

Yes, from explain output, since optimizer already get the merge_append cost but not take account into total cost, that make a big difference. I shared table DDLs and explain analyze,buffers output , I think the data maybe generated by other way to reproduce this issue. sorry for not sharing the commercial production data here.

Andrei Lepikhov <lepihov@gmail.com> 於 2025年4月2日週三下午7:03寫道：

On 4/2/25 12:18, James Pang wrote:
> Hi,
> Postgresq v14.8, we found optimizer doest not take "merge append"
> cost into sql plan total cost and then make a bad sql plan. attached
> please find details.
I suppose there is a different type of issue.
MegeJoin sometimes doesn't need to scan the whole inner or outer side
(see the MergeScanSelCache structure and how it is used in the cost
estimation routine).

So, the cost can be less because the optimizer predicted that only a
small part of the Append will be scanned and used some sort of
interpolation between startup cost and total cost.

But to be sure, could you send the results of EXPLAIN ANALYZE VERBOSE?
If you also send the data to reproduce the case, we may find the source
of the problem more precisely.

--
regards, Andrei Lepikhov

Attachment

table_and_explain_buffers.txt

Re: partition table optimizer join cost misestimation

From

Andrei Lepikhov

Date:

03 April, 09:32:44

On 4/3/25 02:46, James Pang wrote:
> Andrei,
>     Yes, from explain output, since optimizer already get the 
> merge_append cost but not take account into total cost, that make a big 
> difference.  I shared table DDLs and explain analyze,buffers output , I 
> think  the data maybe generated by other way to reproduce this issue. 
> sorry for not sharing the commercial production data here.
I think it is almost enough to identify the issue. Let me ask you the 
following questions:
1. Can you provide min and max values in columns cc.sss_id and aa.mmm_id?
2. How often do you analyze your *parent* tables tablea and tablec? 
Remember, if you want to build statistics on a partitioned table (not a 
partition), you have to explicitly call

ANALYZE tablea,tablec;

mentioning these tables in the analyze list.

3. May you provide a dump of pg_statistic on attributes cc.sss_id and 
aa.mmm_id?
4. Is there a possibility of changing a single code line and rebuilding 
your DB instance to check a conjecture?

-- 
regards, Andrei Lepikhov

Re: partition table optimizer join cost misestimation

From

James Pang

Date:

03 April, 15:38:49

Follow your suggestion to increase statistics_target (I increase target_statistic to 1000 for aa.mmm_id and cc.sss_id ,analyze tablea, tablec again), optimizer choose the good SQL plan.

Thanks,

James

Andrei Lepikhov <lepihov@gmail.com> 於 2025年4月3日週四下午4:44寫道：

On 4/3/25 10:04, James Pang wrote:
> one more comments, for vacuum/analyze, we enable autovacuum=on, that may
> sometimes automatically analyze part of partition table directly.
I see some incoherence in data provided. The ranges of joining columns
intersects only partially:

cc.sss_id: 5 100 001 101 - 7 999 999 601
aa.mmm_id: 2 005 242 651 - 5 726 786 022

So, the intersection range 5100001101 - 5726786022 - is about 10% of the
whole range.
But I don't see it in the column statistics you provided. And Postgres
may do the same.
So, at first, I do recommend increasing default_statistics_target or
just statistics_target on partitioned tables only. For such big tables I
usually set it at least to the 2500.
Also, don't trust in autovacuum on partitioned table - to make an
analyse it needs to lock each partition which is highly unlikely to happen.
So, increase stat target, make ANALYZE tablea, tablec and let me know
what will happen. May be after the analyse statistics will be more
consistent.

--
regards, Andrei Lepikhov