Thread: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Hello Team,
There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade.
- In 12.4 version, Sort Operation Group Aggregate is selected which results to Merge Join. Query takes ~5 seconds.
- In 13.5 version, optimizer wrongly estimates and due to new Disk Based Hash Aggregate feature, it prefers Hash Aggregate instead of Sort Operation which finally blocks merge-join and chooses Nested Loop Left Join. Query takes ~5 minutes.
When we increase work_mem to 23 MB, Disk Usage gets cleared from Query Plan but still Optimizer estimates Hash Aggregate-Nested Loop Left Join (compared to Sort-Merge Join) causing slowness. Query takes ~22 seconds.
Version 13 query plan has lower estimated cost than that of 12.4 which implies 13.5 planner thought it found a better plan, but it is running slower.
12.4 Version:
"Merge Right Join (cost=202198.78..295729.10 rows=1 width=8) (actual time=1399.727..5224.574 rows=296 loops=1)"
13.5 version:-
"Nested Loop Left Join (cost=196360.90..287890.45 rows=1 width=8) (actual time=3209.577..371300.693 rows=296 loops=1)"
Thanks & Regards,
Prajna Shetty
Technical Specialist,
Data Platform Support & Delivery
http://www.mindtree.com/email/disclaimer.html
Attachment
Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Prajna Shetty <Prajna.Shetty@mindtree.com> writes: > There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade. Standard upgrade methods don't transfer statistics from the old version, so the first question to ask is have you ANALYZE'd the relevant tables since upgrading? If you have, then to offer useful help with this we'll need to see all the details described in https://wiki.postgresql.org/wiki/Slow_Query_Questions In any case, this is unlikely to be a bug. The pgsql-performance list would be a more suitable place to discuss it. regards, tom lane
Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
- In 12.4 version, Sort Operation Group Aggregate is selected which results to Merge Join. Query takes ~5 seconds.
- In 13.5 version, optimizer wrongly estimates and due to new Disk Based Hash Aggregate feature, it prefers Hash Aggregate instead of Sort Operation which finally blocks merge-join and chooses Nested Loop Left Join. Query takes ~5 minutes.
- It is AWS Aurora-Postgresql RDS instance. We have raised case with AWS and since this issue is a regression coming from the community PostgreSQL code, we would like to raise bug here.
- We were upgrading from 12.4 version to (13.4 and later)
- vCPU: 2 , RAM: 8 GB
- Attached Stats for all tables in this schema for your reference.
- Attached is metadata for one of the table person for your reference.
- We have performed Vacuum/Analyze/Reindex post Upgrade.
- Tweaked work_mem so it does not spill to Disk. We can Disk Usage But it is still using Hash Aggregate and came down from 5 minutes to 20 seconds. (Expected ~5 seconds). Attached plan after modifying work_mem
- Disabled Seqcan/ nestedloop
- Tweaked random_page_cost/seq_page_cost
- Set default_statistics_target=1000 and then run vacuum(analyze,verbose) on selected tables.
- We have also tested performance by increasing resources up to 4 vCPU and 32 GB RAM.
Technical Specialist,

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, March 21, 2022 7:29 PM
To: Prajna Shetty <Prajna.Shetty@mindtree.com>
Cc: pgsql-bugs@lists.postgresql.org; Beenu Sharma <Beenu.Sharma@mindtree.com>
Subject: Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Attachment
Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
On 3/22/22 13:57, Prajna Shetty wrote: > ++ _pgsql-performance@postgresql.org_ > <https://www.postgresql.org/list/pgsql-performance/> > > Hello Team, > > There is change in query plan in 12.4 version and Version 13 resulting > in performance slowness post upgrade. > > > * In 12.4 version, Sort Operation Group Aggregate is selected which > results to Merge Join. Query takes ~5 seconds. > * In 13.5 version, optimizer wrongly estimates and due to new Disk > Based Hash Aggregate feature, it prefers Hash Aggregate instead of > Sort Operation which finally blocks merge-join and chooses Nested > Loop Left Join. Query takes ~5 minutes. > > > *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to > choose merge operation but such instance level modification is not > possible in terms of Application Functionality. > > This performance issue is on all over most of queries. Attached one of > the query and its plan in both version for reference in case that helps > for recreating the issue. > It's impossible to comment those other queries, but chances are the root cause is the same. > Version 13 query plan has lower estimated cost than that of 12.4 which > implies 13.5 planner thought it found a better plan, but it is running > slower and actual cost show more. > > 12.4 Version: > "Merge Right Join (cost=*202198.78..295729.10* rows=1 width=8) (actual > time=1399.727..*5224.574* rows=296 loops=1)" > > 13.5 version:- > "Nested Loop Left Join (cost=*196360.90..287890.45* rows=1 width=8) > (actual time=3209.577..*371300.693* rows=296 loops=1)" > This is not a costing issue, the problem is that we expect 1 row and calculate the cost for that, but then get 296. And unfortunately a nested loop degrades much faster than a merge join. I'm not sure why exactly 12.4 picked a merge join, chances are the costing formular changed a bit somewhere. But as I said, the problem is in bogus row cardinality estimates - 12.4 is simply lucky. The problem most likely stems from this part: -> GroupAggregate (cost=0.43..85743.24 rows=1830 width=72) (actual time=1.621..3452.034 rows=282179 loops=3) Group Key: student_class_detail.aamc_id Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text) Rows Removed by Filter: 76060 -> Index Scan using uk_student_class_detail_aamcid_classlevelcd on student_class_detail (cost=0.43..74747.61 rows=1284079 width=6) (actual time=1.570..2723.014 rows=1272390 loops=3) Filter: (class_level_start_dt IS NOT NULL) Rows Removed by Filter: 160402 The filter is bound to be misestimated, and the error then snowballs. Try replacing this part with a temporary table (with pre-aggregated results) - you can run analyze on it, etc. I'd bet that'll make the issue go away. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
On 3/22/22 13:57, Prajna Shetty wrote: > ++ _pgsql-performance@postgresql.org_ > <https://www.postgresql.org/list/pgsql-performance/> > > Hello Team, > > There is change in query plan in 12.4 version and Version 13 resulting > in performance slowness post upgrade. > > > * In 12.4 version, Sort Operation Group Aggregate is selected which > results to Merge Join. Query takes ~5 seconds. > * In 13.5 version, optimizer wrongly estimates and due to new Disk > Based Hash Aggregate feature, it prefers Hash Aggregate instead of > Sort Operation which finally blocks merge-join and chooses Nested > Loop Left Join. Query takes ~5 minutes. > > > *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to > choose merge operation but such instance level modification is not > possible in terms of Application Functionality. > > This performance issue is on all over most of queries. Attached one of > the query and its plan in both version for reference in case that helps > for recreating the issue. > It's impossible to comment those other queries, but chances are the root cause is the same. > Version 13 query plan has lower estimated cost than that of 12.4 which > implies 13.5 planner thought it found a better plan, but it is running > slower and actual cost show more. > > 12.4 Version: > "Merge Right Join (cost=*202198.78..295729.10* rows=1 width=8) (actual > time=1399.727..*5224.574* rows=296 loops=1)" > > 13.5 version:- > "Nested Loop Left Join (cost=*196360.90..287890.45* rows=1 width=8) > (actual time=3209.577..*371300.693* rows=296 loops=1)" > This is not a costing issue, the problem is that we expect 1 row and calculate the cost for that, but then get 296. And unfortunately a nested loop degrades much faster than a merge join. I'm not sure why exactly 12.4 picked a merge join, chances are the costing formular changed a bit somewhere. But as I said, the problem is in bogus row cardinality estimates - 12.4 is simply lucky. The problem most likely stems from this part: -> GroupAggregate (cost=0.43..85743.24 rows=1830 width=72) (actual time=1.621..3452.034 rows=282179 loops=3) Group Key: student_class_detail.aamc_id Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text) Rows Removed by Filter: 76060 -> Index Scan using uk_student_class_detail_aamcid_classlevelcd on student_class_detail (cost=0.43..74747.61 rows=1284079 width=6) (actual time=1.570..2723.014 rows=1272390 loops=3) Filter: (class_level_start_dt IS NOT NULL) Rows Removed by Filter: 160402 The filter is bound to be misestimated, and the error then snowballs. Try replacing this part with a temporary table (with pre-aggregated results) - you can run analyze on it, etc. I'd bet that'll make the issue go away. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company