Thread: Performance Issue on Query 18 of TPC-H Benchmark

Performance Issue on Query 18 of TPC-H Benchmark

From

Ba Jinsheng

Date:

15 October 2024, 21:28:47

For this query 18 of TPC-H benchmark:

select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from CUSTOMER, ORDERS, LINEITEM where o_orderkey in ( select l_orderkey from LINEITEM group by l_orderkey having sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100;

Based on 1GB data (https://drive.google.com/file/d/1ZBLHanIRwxbaMQIhRUSPv4I7y8g_0AWi/view?usp=drive_link), its query plan (EXPLAIN ANALYZE) is as follows:

QUERY PLAN

Limit (cost=666783.65..666783.90 rows=100 width=71) (actual time=2511.074..2517.146 rows=9 loops=1)

-> Sort (cost=666783.65..668052.46 rows=507522 width=71) (actual time=2511.073..2517.144 rows=9 loops=1)

Sort Key: orders.o_totalprice DESC, orders.o_orderdate

Sort Method: quicksort Memory: 25kB

-> Finalize GroupAggregate (cost=583237.80..647386.53 rows=507522 width=71) (actual time=2511.057..2517.137 rows=9 loops=1)

Group Key: customer.c_custkey, orders.o_orderkey

-> Gather Merge (cost=583237.80..636813.14 rows=422936 width=71) (actual time=2511.053..2517.128 rows=9 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Partial GroupAggregate (cost=582237.78..586995.81 rows=211468 width=71) (actual time=2507.789..2507.795 rows=3 loops=3)

Group Key: customer.c_custkey, orders.o_orderkey

-> Sort (cost=582237.78..582766.45 rows=211468 width=44) (actual time=2507.783..2507.786 rows=21 loops=3)

Sort Key: customer.c_custkey, orders.o_orderkey

Sort Method: quicksort Memory: 26kB

Worker 0: Sort Method: quicksort Memory: 26kB

Worker 1: Sort Method: quicksort Memory: 25kB

-> Nested Loop (cost=458569.18..557026.85 rows=211468 width=44) (actual time=2464.821..2507.757 rows=21 loops=3)

-> Parallel Hash Join (cost=458568.75..492734.09 rows=52844 width=43) (actual time=2464.793..2507.699 rows=3 loops=3)

Hash Cond: (orders.o_custkey = customer.c_custkey)

-> Hash Join (cost=453562.50..487589.13 rows=52844 width=24) (actual time=2433.507..2476.356 rows=3 loops=3)

Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)

-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.024..33.467 rows=500000 loops=3)

-> Hash (cost=451977.19..451977.19 rows=126825 width=4) (actual time=2412.836..2412.836 rows=9 loops=3)

Buckets: 131072 Batches: 1 Memory Usage: 1025kB

-> GroupAggregate (cost=0.43..451977.19 rows=126825 width=4) (actual time=708.272..2412.797 rows=9 loops=3)

Group Key: lineitem_1.l_orderkey

Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)

Rows Removed by Filter: 1499991

-> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416256.96 rows=6002623 width=9) (actual time=0.052..1331.820 rows=6001215 loops=3)

-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=30.683..30.683 rows=50000 loops=3)

Buckets: 262144 Batches: 1 Memory Usage: 10304kB

-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.019..11.368 rows=50000 loops=3)

-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.016..0.017 rows=7 loops=9)

Index Cond: (l_orderkey = orders.o_orderkey)

Planning Time: 0.833 ms

Execution Time: 2517.189 ms

(36 rows)

While I found that if we disable the path sorting here:

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c

index 0c7273b9cc..91320f473a 100644

--- a/src/backend/optimizer/plan/planner.c

+++ b/src/backend/optimizer/plan/planner.c

@@ -6928,7 +6928,7 @@ make_ordered_path(PlannerInfo *root, RelOptInfo *rel, Path *path,

path->pathkeys,

&presorted_keys);

- if (!is_sorted)

+ if (false)

{

* Try at least sorting the cheapest path and also try incrementally

Both the performance and estimated cost are reduced around 40%:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=390889.59..390889.84 rows=100 width=71) (actual time=1652.572..1655.298 rows=9 loops=1)

-> Sort (cost=390889.59..392158.39 rows=507522 width=71) (actual time=1652.571..1655.296 rows=9 loops=1)

Sort Key: orders.o_totalprice DESC, orders.o_orderdate

Sort Method: quicksort Memory: 25kB

-> Finalize GroupAggregate (cost=215938.45..371492.46 rows=507522 width=71) (actual time=1651.864..1655.256 rows=9 loops=1)

Group Key: customer.c_custkey, orders.o_orderkey

-> Gather (cost=215938.45..360919.08 rows=422936 width=71) (actual time=1651.670..1655.245 rows=9 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Partial GroupAggregate (cost=214938.45..317625.48 rows=211468 width=71) (actual time=1616.827..1648.580 rows=3 loops=3)

Group Key: customer.c_custkey, orders.o_orderkey

-> Nested Loop (cost=214938.45..313396.12 rows=211468 width=44) (actual time=1609.796..1648.571 rows=21 loops=3)

-> Parallel Hash Join (cost=214938.02..249103.36 rows=52844 width=43) (actual time=1609.777..1648.532 rows=3 loops=3)

Hash Cond: (orders.o_custkey = customer.c_custkey)

-> Hash Join (cost=209931.77..243958.39 rows=52844 width=24) (actual time=1573.950..1612.634 rows=3 loops=3)

Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)

-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.016..32.211 rows=500000 loops=3)

-> Hash (cost=208346.45..208346.45 rows=126825 width=4) (actual time=1549.849..1549.849 rows=9 loops=3)

Buckets: 131072 Batches: 1 Memory Usage: 1025kB

-> GroupAggregate (cost=0.00..208346.45 rows=126825 width=4) (actual time=334.397..1549.837 rows=9 loops=3)

Group Key: lineitem_1.l_orderkey

Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)

Rows Removed by Filter: 1500388

-> Seq Scan on lineitem lineitem_1 (cost=0.00..172626.23 rows=6002623 width=9) (actual time=0.051..438.226 rows=6001215 loops=3)

-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=34.762..34.763 rows=50000 loops=3)

Buckets: 262144 Batches: 1 Memory Usage: 10304kB

-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.011..10.406 rows=50000 loops=3)

-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.009..0.011 rows=7 loops=9)

Index Cond: (l_orderkey = orders.o_orderkey)

Planning Time: 1.644 ms

Execution Time: 1655.490 ms

(31 rows)

The major differerence between both query plans is the first one has additional SORT. I believe the second query plan is more efficient. Similar to my last report, perhaps we can optimize code to enable it.

I also tried 10 GB data, in which the execution time is reduced from 30s to 16s, but the estimated cost is increased. I can provide more info if you need.

Best regards,

Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

David Rowley

Date:

16 October 2024, 00:56:08

On Wed, 16 Oct 2024 at 07:29, Ba Jinsheng <bajinsheng@u.nus.edu> wrote:
> While I found that if we disable the path sorting here:
> diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
> index 0c7273b9cc..91320f473a 100644
> --- a/src/backend/optimizer/plan/planner.c
> +++ b/src/backend/optimizer/plan/planner.c
> @@ -6928,7 +6928,7 @@ make_ordered_path(PlannerInfo *root, RelOptInfo *rel, Path *path,
>                                                                                         path->pathkeys,
>                                                                                         &presorted_keys);
>
> -       if (!is_sorted)
> +       if (false)
>         {
>                 /*
>                  * Try at least sorting the cheapest path and also try incrementally
>
> Both the performance and estimated cost are reduced around 40%:

>                      ->  Partial GroupAggregate  (cost=214938.45..317625.48 rows=211468 width=71) (actual
time=1616.827..1648.580rows=3 loops=3)
 
>                            Group Key: customer.c_custkey, orders.o_orderkey
>                            ->  Nested Loop  (cost=214938.45..313396.12 rows=211468 width=44) (actual
time=1609.796..1648.571rows=21 loops=3)
 
>                                  ->  Parallel Hash Join  (cost=214938.02..249103.36 rows=52844 width=43) (actual
time=1609.777..1648.532rows=3 loops=3)
 
>                                        Hash Cond: (orders.o_custkey = customer.c_custkey)
>                                        ->  Hash Join  (cost=209931.77..243958.39 rows=52844 width=24) (actual
time=1573.950..1612.634rows=3 loops=3)
 
>                                              Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
>                                              ->  Parallel Seq Scan on orders  (cost=0.00..32386.00 rows=625000
width=20)(actual time=0.016..32.211 rows=500000 loops=3)
 

GroupAggreate needs sorted input, so what you've done isn't valid.

David

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Alena Rybakina

Date:

16 October 2024, 02:21:09

Hi!

On 15.10.2024 21:28, Ba Jinsheng wrote:

P {margin-top:0;margin-bottom:0;}
For this query 18 of TPC-H benchmark:
select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from CUSTOMER, ORDERS, LINEITEM where o_orderkey in ( select l_orderkey from LINEITEM group by l_orderkey having sum(l_quantity) > 314 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100;

Based on 1GB data (https://drive.google.com/file/d/1ZBLHanIRwxbaMQIhRUSPv4I7y8g_0AWi/view?usp=drive_link), its query plan (EXPLAIN ANALYZE) is as follows:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=666783.65..666783.90 rows=100 width=71) (actual time=2511.074..2517.146 rows=9 loops=1)
-> Sort (cost=666783.65..668052.46 rows=507522 width=71) (actual time=2511.073..2517.144 rows=9 loops=1)
Sort Key: orders.o_totalprice DESC, orders.o_orderdate
Sort Method: quicksort Memory: 25kB
-> Finalize GroupAggregate (cost=583237.80..647386.53 rows=507522 width=71) (actual time=2511.057..2517.137 rows=9 loops=1)
Group Key: customer.c_custkey, orders.o_orderkey
-> Gather Merge (cost=583237.80..636813.14 rows=422936 width=71) (actual time=2511.053..2517.128 rows=9 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=582237.78..586995.81 rows=211468 width=71) (actual time=2507.789..2507.795 rows=3 loops=3)
Group Key: customer.c_custkey, orders.o_orderkey
-> Sort (cost=582237.78..582766.45 rows=211468 width=44) (actual time=2507.783..2507.786 rows=21 loops=3)
Sort Key: customer.c_custkey, orders.o_orderkey
Sort Method: quicksort Memory: 26kB
Worker 0: Sort Method: quicksort Memory: 26kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=458569.18..557026.85 rows=211468 width=44) (actual time=2464.821..2507.757 rows=21 loops=3)
-> Parallel Hash Join (cost=458568.75..492734.09 rows=52844 width=43) (actual time=2464.793..2507.699 rows=3 loops=3)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Hash Join (cost=453562.50..487589.13 rows=52844 width=24) (actual time=2433.507..2476.356 rows=3 loops=3)
Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.024..33.467 rows=500000 loops=3)
-> Hash (cost=451977.19..451977.19 rows=126825 width=4) (actual time=2412.836..2412.836 rows=9 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 1025kB
-> GroupAggregate (cost=0.43..451977.19 rows=126825 width=4) (actual time=708.272..2412.797 rows=9 loops=3)
Group Key: lineitem_1.l_orderkey
Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
Rows Removed by Filter: 1499991
-> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416256.96 rows=6002623 width=9) (actual time=0.052..1331.820 rows=6001215 loops=3)
-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=30.683..30.683 rows=50000 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 10304kB
-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.019..11.368 rows=50000 loops=3)
-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.016..0.017 rows=7 loops=9)
Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 0.833 ms
Execution Time: 2517.189 ms
(36 rows)

While I found that if we disable the path sorting here:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0c7273b9cc..91320f473a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6928,7 +6928,7 @@ make_ordered_path(PlannerInfo *root, RelOptInfo *rel, Path *path,
path->pathkeys,
&presorted_keys);

- if (!is_sorted)
+ if (false)
{
/*
* Try at least sorting the cheapest path and also try incrementally

Both the performance and estimated cost are reduced around 40%:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=390889.59..390889.84 rows=100 width=71) (actual time=1652.572..1655.298 rows=9 loops=1)
-> Sort (cost=390889.59..392158.39 rows=507522 width=71) (actual time=1652.571..1655.296 rows=9 loops=1)
Sort Key: orders.o_totalprice DESC, orders.o_orderdate
Sort Method: quicksort Memory: 25kB
-> Finalize GroupAggregate (cost=215938.45..371492.46 rows=507522 width=71) (actual time=1651.864..1655.256 rows=9 loops=1)
Group Key: customer.c_custkey, orders.o_orderkey
-> Gather (cost=215938.45..360919.08 rows=422936 width=71) (actual time=1651.670..1655.245 rows=9 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=214938.45..317625.48 rows=211468 width=71) (actual time=1616.827..1648.580 rows=3 loops=3)
Group Key: customer.c_custkey, orders.o_orderkey
-> Nested Loop (cost=214938.45..313396.12 rows=211468 width=44) (actual time=1609.796..1648.571 rows=21 loops=3)
-> Parallel Hash Join (cost=214938.02..249103.36 rows=52844 width=43) (actual time=1609.777..1648.532 rows=3 loops=3)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Hash Join (cost=209931.77..243958.39 rows=52844 width=24) (actual time=1573.950..1612.634 rows=3 loops=3)
Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.016..32.211 rows=500000 loops=3)
-> Hash (cost=208346.45..208346.45 rows=126825 width=4) (actual time=1549.849..1549.849 rows=9 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 1025kB
-> GroupAggregate (cost=0.00..208346.45 rows=126825 width=4) (actual time=334.397..1549.837 rows=9 loops=3)
Group Key: lineitem_1.l_orderkey
Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
Rows Removed by Filter: 1500388
-> Seq Scan on lineitem lineitem_1 (cost=0.00..172626.23 rows=6002623 width=9) (actual time=0.051..438.226 rows=6001215 loops=3)
-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=34.762..34.763 rows=50000 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 10304kB
-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.011..10.406 rows=50000 loops=3)
-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.009..0.011 rows=7 loops=9)
Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 1.644 ms
Execution Time: 1655.490 ms
(31 rows)

The major differerence between both query plans is the first one has additional SORT. I believe the second query plan is more efficient. Similar to my last report, perhaps we can optimize code to enable it.

I also tried 10 GB data, in which the execution time is reduced from 30s to 16s, but the estimated cost is increased. I can provide more info if you need.

Best regards,
Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

can you tell me more information about the user and the name of your dump database? I can't log into the database.

$ my/inst/bin/psql -U postgres -d postgres
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: role "postgres" does not exist

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Ba Jinsheng

Date:

16 October 2024, 08:58:59

>can you tell me more information about the user and the name of your dump database? I can't log into the database.

I used this string to login:

"postgresql://ubuntu:ubuntu@127.0.0.1:5432/tpch"

Best regards,

Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Andrei Lepikhov

Date:

16 October 2024, 09:28:25

On 10/16/24 01:28, Ba Jinsheng wrote:
> The major differerence between both query plans is the first one has 
> additional *SORT*. I believe the second query plan is more efficient. 
> Similar to my last report, perhaps we can optimize code to enable it.
I would like to know if you can improve that case by switching from the 
sorted group to a hashed one.
I see huge underestimation because of the HAVING clause on an aggregate. 
It would be interesting to correct the prediction and observe what will 
happen.
Can you reproduce the same query using the SQL server? It would 
highlight some techniques Postgres has not adopted yet.

-- 
regards, Andrei Lepikhov

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Alena Rybakina

Date:

16 October 2024, 17:01:21

Hi!

On 16.10.2024 09:28, Andrei Lepikhov wrote:

On 10/16/24 01:28, Ba Jinsheng wrote:
The major differerence between both query plans is the first one has additional *SORT*. I believe the second query plan is more efficient. Similar to my last report, perhaps we can optimize code to enable it.
I would like to know if you can improve that case by switching from the sorted group to a hashed one.
I see huge underestimation because of the HAVING clause on an aggregate. It would be interesting to correct the prediction and observe what will happen.

Unfortunately my laptop is not so fast, so my execution time is 25822.899 ms.

Initial query plan on my laptop is the same:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=668833.97..668834.22 rows=100 width=71) (actual time=25814.870..25822.461 rows=9 loops=1)
   -> Sort (cost=668833.97..670117.12 rows=513262 width=71) (actual time=25814.867..25822.454 rows=9 loops=1)
         Sort Key: orders.o_totalprice DESC, orders.o_orderdate
         Sort Method: quicksort Memory: 25kB
         -> Finalize GroupAggregate (cost=584343.41..649217.47 rows=513262 width=71) (actual time=25814.736..25822.399 rows=9 loops=1)
               Group Key: customer.c_custkey, orders.o_orderkey
               -> Gather Merge (cost=584343.41..638524.51 rows=427718 width=71) (actual time=25814.716..25822.328 rows=9 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     -> Partial GroupAggregate (cost=583343.39..588155.22 rows=213859 width=71) (actual time=25792.566..25792.609 rows=3 loops=3)
                           Group Key: customer.c_custkey, orders.o_orderkey
                           -> Sort (cost=583343.39..583878.04 rows=213859 width=44) (actual time=25792.512..25792.527 rows=21 loops=3)
                                 Sort Key: customer.c_custkey, orders.o_orderkey
                                 Sort Method: quicksort Memory: 25kB
                                 Worker 0: Sort Method: quicksort Memory: 26kB
                                 Worker 1: Sort Method: quicksort Memory: 26kB
                                 -> Nested Loop (cost=458633.58..557830.13 rows=213859 width=44) (actual time=25480.327..25792.187 rows=21 loops=3)
                                       -> Parallel Hash Join (cost=458633.15..492800.08 rows=53450 width=43) (actual time=25480.163..25791.800 rows=3 loops=3)
                                             Hash Cond: (orders.o_custkey = customer.c_custkey)
                                             -> Hash Join (cost=453626.90..487653.53 rows=53450 width=24) (actual time=25367.158..25677.611 rows=3 loops=3)
                                                   Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
                                                   -> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.110..237.788 rows=500000 loops=3)
                                                   -> Hash (cost=452023.40..452023.40 rows=128280 width=4) (actual time=25157.711..25157.713 rows=9 loops=3)
                                                         Buckets: 131072 Batches: 1 Memory Usage: 1025kB
                                                         -> GroupAggregate (cost=0.43..452023.40 rows=128280 width=4) (actual time=5918.735..25157.610 rows=9 loops=3)
                                                               Group Key: lineitem_1.l_orderkey
                                                               Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
                                                               Rows Removed by Filter: 1499991
                                                               -> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416242.51 rows=6001659 width=9) (actual time=0.186..11092.476 rows=6001215 loops=3)
                                             -> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=110.867..110.868 rows=50000 loops=3)
                                                   Buckets: 262144 Batches: 1 Memory Usage: 10304kB
                                                   -> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.071..49.232 rows=50000 loops=3)
                                       -> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.096..0.109 rows=7 loops=9)
                                             Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 5.025 ms
Execution Time: 25822.899 ms
(36 rows)

To get a better plan, I ran the query with the AQO [0] extension and it cut the execution time in half. Unfortunately, AQO doesn't remember the power of hash nodes, and to be honest, I set the power to the correct power in the code.
I know this is the wrong way, but I haven't had enough time to fix it properly. And so here is my query plan:

I also disabled parallelism and my query plan like this:

set max_parallel_maintenance_workers =0;

set max_parallel_workers=0;

set max_parallel_workers_per_gather =0;

                                                                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=451107.39..451107.48 rows=36 width=71) (actual time=12582.114..12582.136 rows=9 loops=1)
   -> Sort (cost=451107.39..451107.48 rows=36 width=71) (actual time=12582.112..12582.118 rows=9 loops=1)
         Sort Key: orders.o_totalprice DESC, orders.o_orderdate
         Sort Method: quicksort Memory: 25kB
         -> GroupAggregate (cost=451105.65..451106.46 rows=36 width=71) (actual time=12581.981..12582.082 rows=9 loops=1)
               Group Key: customer.c_custkey, orders.o_orderkey
               -> Sort (cost=451105.65..451105.74 rows=36 width=44) (actual time=12581.951..12581.969 rows=63 loops=1)
                     Sort Key: customer.c_custkey, orders.o_orderkey
                     Sort Method: quicksort Memory: 28kB
                     -> Nested Loop (cost=1.71..451104.72 rows=36 width=44) (actual time=2704.881..12581.823 rows=63 loops=1)
                           -> Nested Loop (cost=1.28..451093.77 rows=9 width=43) (actual time=2704.826..12581.457 rows=9 loops=1)
                                 -> Nested Loop (cost=0.86..451089.74 rows=9 width=24) (actual time=2704.741..12581.024 rows=9 loops=1)
                                       -> GroupAggregate (cost=0.43..451013.73 rows=9 width=4) (actual time=2697.354..12522.126 rows=9 loops=1)
                                             Group Key: lineitem_1.l_orderkey
                                             Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
                                             Rows Removed by Filter: 1499991
                                             -> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416226.85 rows=6000615 width=9) (actual time=0.030..5515.665 rows=6001215 loops=1)
                                       -> Index Scan using orders_pkey on orders (cost=0.43..8.45 rows=1 width=20) (actual time=6.530..6.530 rows=1 loops=9)
                                             Index Cond: (o_orderkey = lineitem_1.l_orderkey)
                                 -> Index Scan using customer_pkey on customer (cost=0.42..0.45 rows=1 width=23) (actual time=0.039..0.039 rows=1 loops=9)
                                       Index Cond: (c_custkey = orders.o_custkey)
                           -> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.026..0.033 rows=7 loops=9)
                                 Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 1.403 ms
Execution Time: 12582.321 ms
(25 rows)

[0] https://github.com/postgrespro/aqo

I think it was more interesting when I turned off parallelism and tried to build a query plan without AQO, and the execution time there was significantly reduced:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=722284.69..722284.94 rows=100 width=71) (actual time=16251.401..16251.413 rows=9 loops=1)
   -> Sort (cost=722284.69..723567.84 rows=513262 width=71) (actual time=16251.399..16251.408 rows=9 loops=1)
         Sort Key: orders.o_totalprice DESC, orders.o_orderdate
         Sort Method: quicksort Memory: 25kB
         -> GroupAggregate (cost=512218.34..702668.18 rows=513262 width=71) (actual time=16227.765..16251.371 rows=9 loops=1)
               Group Key: customer.c_custkey, orders.o_orderkey
               -> Incremental Sort (cost=512218.34..692402.94 rows=513262 width=44) (actual time=16227.735..16251.241 rows=63 loops=1)
                     Sort Key: customer.c_custkey, orders.o_orderkey
                     Presorted Key: customer.c_custkey
                     Full-sort Groups: 2 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
                     -> Nested Loop (cost=512217.20..678432.66 rows=513262 width=44) (actual time=16094.663..16251.114 rows=63 loops=1)
                           -> Merge Join (cost=512216.77..522360.54 rows=128280 width=43) (actual time=16094.596..16250.531 rows=9 loops=1)
                                 Merge Cond: (customer.c_custkey = orders.o_custkey)
                                 -> Index Scan using customer_pkey on customer (cost=0.42..7524.45 rows=150000 width=23) (actual time=0.039..148.356 rows=147198 loops=1)
                                 -> Materialize (cost=512216.29..512857.69 rows=128280 width=24) (actual time=16068.493..16068.529 rows=9 loops=1)
                                       -> Sort (cost=512216.29..512536.99 rows=128280 width=24) (actual time=16068.478..16068.496 rows=9 loops=1)
                                             Sort Key: orders.o_custkey
                                             Sort Method: quicksort Memory: 25kB
                                             -> Hash Join (cost=453626.90..498700.41 rows=128280 width=24) (actual time=15309.799..16068.390 rows=9 loops=1)
                                                   Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
                                                   -> Seq Scan on orders (cost=0.00..41136.00 rows=1500000 width=20) (actual time=0.011..439.488 rows=1500000 loops=1)
                                                   -> Hash (cost=452023.40..452023.40 rows=128280 width=4) (actual time=15104.560..15104.562 rows=9 loops=1)
                                                         Buckets: 131072 Batches: 1 Memory Usage: 1025kB
                                                         -> GroupAggregate (cost=0.43..452023.40 rows=128280 width=4) (actual time=3731.232..15104.495 rows=9 loops=1)
                                                               Group Key: lineitem_1.l_orderkey
                                                               Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
                                                               Rows Removed by Filter: 1499991
                                                               -> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416242.51 rows=6001659 width=9) (actual time=0.034..6936.304 rows=6001215 loops=1)
                           -> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.045..0.053 rows=7 loops=9)
                                 Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 1.782 ms
Execution Time: 16251.698 ms
(32 rows)

Looking at the code, I think the problem with cost estimation in aggregation nodes is caused by incorrect selectivity estimation (cost_agg function) and can be solved using advanced statistics. But to be honest, I'm still investigating this.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Ba Jinsheng

Date:

16 October 2024, 17:56:54

>I would like to know if you can improve that case by switching from the sorted group to a hashed one.

I used this patch to enable the first HashAggregate only in the query plan:

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c

index 0c7273b9cc..b410452df1 100644

--- a/src/backend/optimizer/plan/planner.c

+++ b/src/backend/optimizer/plan/planner.c

@@ -6983,8 +6983,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,

bool can_sort = (extra->flags & GROUPING_CAN_USE_SORT) != 0;

List *havingQual = (List *) extra->havingQual;

AggClauseCosts *agg_final_costs = &extra->agg_final_costs;

- if (can_sort)

+ static int call_count = 0;

+ call_count++;

+ if (can_sort && call_count != 2)

{

* Use any available suitably-sorted path as input, and also consider

And got this query plan:

QUERY PLAN

Limit (cost=693083.48..693083.73 rows=100 width=71) (actual time=2624.282..2629.392 rows=9 loops=1)

-> Sort (cost=693083.48..694352.29 rows=507522 width=71) (actual time=2624.281..2629.390 rows=9 loops=1)

Sort Key: orders.o_totalprice DESC, orders.o_orderdate

Sort Method: quicksort Memory: 25kB

-> HashAggregate (cost=658421.05..673686.36 rows=507522 width=71) (actual time=2624.162..2629.346 rows=9 loops=1)

Group Key: customer.c_custkey, orders.o_orderkey

Planned Partitions: 32 Batches: 1 Memory Usage: 793kB

-> Gather (cost=459569.18..608779.05 rows=507522 width=44) (actual time=2623.805..2629.229 rows=63 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Nested Loop (cost=458569.18..557026.85 rows=211468 width=44) (actual time=2581.717..2620.494 rows=21 loops=3)

-> Parallel Hash Join (cost=458568.75..492734.09 rows=52844 width=43) (actual time=2581.704..2620.448 rows=3 loops=3)

Hash Cond: (orders.o_custkey = customer.c_custkey)

-> Hash Join (cost=453562.50..487589.13 rows=52844 width=24) (actual time=2541.024..2579.759 rows=3 loops=3)

Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)

-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.028..32.135 rows=500000 loops=3)

-> Hash (cost=451977.19..451977.19 rows=126825 width=4) (actual time=2515.787..2515.788 rows=9 loops=3)

Buckets: 131072 Batches: 1 Memory Usage: 1025kB

-> GroupAggregate (cost=0.43..451977.19 rows=126825 width=4) (actual time=608.052..2515.758 rows=9 loops=3)

Group Key: lineitem_1.l_orderkey

Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)

Rows Removed by Filter: 1499991

-> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416256.96 rows=6002623 width=9) (actual time=0.043..1399.708 rows=6001215 loops=3)

-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=39.601..39.602 rows=50000 loops=3)

Buckets: 262144 Batches: 1 Memory Usage: 10304kB

-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.032..15.561 rows=50000 loops=3)

-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.012..0.014 rows=7 loops=9)

Index Cond: (l_orderkey = orders.o_orderkey)

Planning Time: 1.850 ms

Execution Time: 2630.023 ms

(30 rows)

Compared to the query plan with GroupAggregate, both estimated cost and execution time are similar and have no significant difference.

>I think it was more interesting when I turned off parallelism and tried to build a query plan without AQO, and the execution time there was significantly reduced:

Turning off parallelism only brings this significant performance improvement?

Best regards,

Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Alena Rybakina

Date:

16 October 2024, 19:42:52

On 16.10.2024 17:56, Ba Jinsheng wrote:

P {margin-top:0;margin-bottom:0;}
>I would like to know if you can improve that case by switching from the sorted group to a hashed one.

I used this patch to enable the first HashAggregate only in the query plan:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0c7273b9cc..b410452df1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6983,8 +6983,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
bool can_sort = (extra->flags & GROUPING_CAN_USE_SORT) != 0;
List *havingQual = (List *) extra->havingQual;
AggClauseCosts *agg_final_costs = &extra->agg_final_costs;
-
- if (can_sort)
+ static int call_count = 0;
+ call_count++;
+ if (can_sort && call_count != 2)
{
/*
* Use any available suitably-sorted path as input, and also consider

And got this query plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=693083.48..693083.73 rows=100 width=71) (actual time=2624.282..2629.392 rows=9 loops=1)
-> Sort (cost=693083.48..694352.29 rows=507522 width=71) (actual time=2624.281..2629.390 rows=9 loops=1)
Sort Key: orders.o_totalprice DESC, orders.o_orderdate
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=658421.05..673686.36 rows=507522 width=71) (actual time=2624.162..2629.346 rows=9 loops=1)
Group Key: customer.c_custkey, orders.o_orderkey
Planned Partitions: 32 Batches: 1 Memory Usage: 793kB
-> Gather (cost=459569.18..608779.05 rows=507522 width=44) (actual time=2623.805..2629.229 rows=63 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=458569.18..557026.85 rows=211468 width=44) (actual time=2581.717..2620.494 rows=21 loops=3)
-> Parallel Hash Join (cost=458568.75..492734.09 rows=52844 width=43) (actual time=2581.704..2620.448 rows=3 loops=3)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Hash Join (cost=453562.50..487589.13 rows=52844 width=24) (actual time=2541.024..2579.759 rows=3 loops=3)
Hash Cond: (orders.o_orderkey = lineitem_1.l_orderkey)
-> Parallel Seq Scan on orders (cost=0.00..32386.00 rows=625000 width=20) (actual time=0.028..32.135 rows=500000 loops=3)
-> Hash (cost=451977.19..451977.19 rows=126825 width=4) (actual time=2515.787..2515.788 rows=9 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 1025kB
-> GroupAggregate (cost=0.43..451977.19 rows=126825 width=4) (actual time=608.052..2515.758 rows=9 loops=3)
Group Key: lineitem_1.l_orderkey
Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
Rows Removed by Filter: 1499991
-> Index Scan using lineitem_pkey on lineitem lineitem_1 (cost=0.43..416256.96 rows=6002623 width=9) (actual time=0.043..1399.708 rows=6001215 loops=3)
-> Parallel Hash (cost=4225.00..4225.00 rows=62500 width=23) (actual time=39.601..39.602 rows=50000 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 10304kB
-> Parallel Seq Scan on customer (cost=0.00..4225.00 rows=62500 width=23) (actual time=0.032..15.561 rows=50000 loops=3)
-> Index Scan using lineitem_pkey on lineitem (cost=0.43..1.06 rows=16 width=9) (actual time=0.012..0.014 rows=7 loops=9)
Index Cond: (l_orderkey = orders.o_orderkey)
Planning Time: 1.850 ms
Execution Time: 2630.023 ms
(30 rows)

Compared to the query plan with GroupAggregate, both estimated cost and execution time are similar and have no significant difference.

To be honest, I don't quite understand what you are showing. I believe you are not allowing the optimizer to generate a different aggregation path (Group Aggregate) because it requires a sort operation. So I think this is not correct.

>I think it was more interesting when I turned off parallelism and tried to build a query plan without AQO, and the execution time there was significantly reduced:
Turning off parallelism only brings this significant performance improvement?

You may notice that disabling parallelism results in improved cardinality estimation and therefore a better query plan, since the optimizer selects paths based on their cost. If parallelism is disabled, query plan have become more correct.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Ba Jinsheng

Date:

16 October 2024, 21:26:12

>I believe you are not allowing the optimizer to generate a different aggregation path (Group Aggregate) because it requires a sort operation. So I think this is not correct.

Yes, this is what I did. I though it is what you were asking? I have not found another way to enforce HashAggregate, so I directly modified the code. Can you eliberate why it is incorrect?

>You may notice that disabling parallelism results in improved cardinality estimation and therefore a better query plan, since the optimizer selects paths based on their cost. If parallelism is disabled, query plan have become more correct.

Can I understand disabling parallelism is a good setup for finding performance issues?

Best regards,

Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

David Rowley

Date:

17 October 2024, 05:14:35

On Thu, 17 Oct 2024 at 07:26, Ba Jinsheng <bajinsheng@u.nus.edu> wrote:
>
> >I believe you are not allowing the optimizer to generate a different aggregation path (Group Aggregate) because it
requiresa sort operation. So I think this is not correct.

>
> Yes, this is what I did. I though it is what you were asking? I have not found another way to enforce HashAggregate,
soI directly modified the code. Can you eliberate why it is incorrect?

Like I mentioned earlier, GroupAggregate needs its input to be
guaranteed to be sorted by the group key. Above you've hacked the
planner to produce:

->  GroupAggregate  (cost=0.00..208346.45 rows=126825 width=4) (actual
time=334.397..1549.837 rows=9 loops=3)
                                                         Group Key:
lineitem_1.l_orderkey
                                                         Filter:
(sum(lineitem_1.l_quantity) > '314'::numeric)
                                                         Rows Removed
by Filter: 1500388
                                                         ->  Seq Scan
on lineitem lineitem_1  (cost=0.00..172626.23 rows=6002623 width=9)
(actual time=0.051..438.226 rows=6001215 loops=3)

This is simply not a valid plan.  GroupAggreate relies on the rows
arriving in Group Key order as it checks if the current row is in the
same group as the previous row. When it's not, that group is classes
as complete and the HAVING clause can be evaluated.  It's possible the
plan you've ended up when happens to produce the correct results
before the lineitem table is already in l_orderkey order.  Try
updating one of the earlier rows in a way that puts the heap out of
order and you'll likely notice the "Rows Removed by Filter" change. Or
try without the HAVING clause and observe the number of groups
changing.

I strongly suggest you experiment further before proposing changes in
this area. Also, this is not a valid discussion for the pgsql-bugs
mailing list.  This list is about reporting newly discovered bugs.
It's not a place to discuss proposing new ones.

David

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Andrei Lepikhov

Date:

17 October 2024, 05:51:12

On 10/17/24 01:26, Ba Jinsheng wrote:
>  >I believe you are not allowing the optimizer to generate a different 
> aggregation path (Group Aggregate) because it requires a sort operation. 
> So I think this is not correct.
> 
> Yes, this is what I did. I though it is what you were asking? I have not 
> found another way to enforce HashAggregate, so I directly modified the 
> code. Can you eliberate why it is incorrect?
As I see, the main problem lies in the first aggregate (with HAVING 
clause) where hash aggregation seems preferable. To disable that, you 
can use some trick:
SET enable_hashagg = 'on';
SET enable_sort = 'off';
SET work_mem = '1GB';

In my case the optimiser have built aggregation:
->  HashAggregate  (cost=202639.35..208346.45 rows=126825 width=4)
     (actual time=3540.761..4224.547 rows=9 loops=3)
     Group Key: lineitem_1.l_orderkey
       Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
       Batches: 1  Memory Usage: 638993kB
       Rows Removed by Filter: 1499991
       Worker 0:  Batches: 1  Memory Usage: 638993kB
       Worker 1:  Batches: 1  Memory Usage: 638993kB
       ->  Seq Scan on lineitem lineitem_1
       (cost=0.00..172626.23 rows=6002623 width=9)
       (actual time=0.014..675.552 rows=6001215 loops=3)

Not sure it is the best plan possible in this case. I know only about 
re-optimisation feature which can provide correct number of groups 
estimation to aggregates as well as cardinality and work_mem and give 
the optimiser all correct estimations. But it is still an enterprise 
feature :(.

> Can I understand disabling parallelism is a good setup for finding 
> performance issues?
I usually use such such a switch to identify problems. Parallel workers 
use HashJoin more frequently and it sometimes cause non-linear behaviour 
of execution time, compared to sequental plan. At the same time, they 
donn't support parameterised paths and it may end up in interesting 
performance jumps ...

BTW, I also wonder why do you report to pgsql-bugs in presence of better 
fitted pgsql-performance thread?

-- 
regards, Andrei Lepikhov

Re: Performance Issue on Query 18 of TPC-H Benchmark

From

Ba Jinsheng

Date:

17 October 2024, 10:00:00

Thanks a lot for your explanation!

>BTW, I also wonder why do you report to pgsql-bugs in presence of better
fitted pgsql-performance thread?

I see. I will do more attamps. If I find something interesting, I will report to pgsql-performance instead.

Best regards,

Jinsheng Ba

Notice: This email is generated from the account of an NUS alumnus. Contents, views, and opinions therein are solely those of the sender.