Thread: BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

From

PG Bug reporting form

Date:

14 September 2023, 08:43:23

The following bug has been logged on the website:

Bug reference:      18109
Logged by:          Stephen Flavin
Email address:      stephen_flav@hotmail.com
PostgreSQL version: 15.3
Operating system:   macOS Ventura 13.4.1
Description:

I have a partitioned table which contains some raw data that I need to
aggregate and upsert into another table (partitioned by the same key).
The plan is to aggregate the raw table at a regular interval, when
aggregation runs I use a procedure to rename the existing raw table and
create a new one, this stops writes and allow the apps to continue writing
while the aggregation and upsert is running.

The problem I have is the procedure to run the aggregates takes too long and
allows too much lag to build up so the procedure get's slower and slower
over time.

I've been looking at `enable_partitionwise_aggregate` which significantly
increases the speed of the aggregation however, it seems that when I combine
the select that normally launches the parallel aggregations it's forced to
be sequential even if I wrap the aggregation query in a materialised cte.

replication steps:
```
create unlogged table datapoints (id bigint, tag text, day date, count
bigint) partition by hash (id);
create unlogged table aggregated_datapoints (like datapoints, primary key
(id, tag, day)) partition by hash (id);

-- generate partitions
do
$do$
    declare
        i smallint;
    begin
        for i in select generate_series(0, 15)
            loop
                execute $$
                  create unlogged table datapoints_$$ || i || $$
                  partition of datapoints
                  for values with (modulus 16, remainder $$ || i || $$);

                          create unlogged table aggregated_datapoints_$$ || i || $$
                  partition of aggregated_datapoints
                  for values with (modulus 16, remainder $$ || i || $$);
            $$;
            end loop;
    end
$do$;

set work_mem = '128MB';
set force_parallel_mode = 'on';
set max_parallel_workers = 16;
-- max_worker_processes = 16 (set in postgresql.conf and restarted)
set max_parallel_workers_per_gather = 16;
SET parallel_leader_participation = 'off';
set enable_partitionwise_aggregate = 'on';

begin;
-- seed data
insert into datapoints
select 
  (random()*(16*10))::int,
    ((random()*(16*10))::int)::text,
    to_timestamp((random()*16)::int)::date,
    (random()*100)::int 
from generate_series(1, 1000000 * 16); -- 1M per partition
-- INSERT 0 16000000
-- Time: 7395.269 ms (00:07.395)

explain analyse select
  id,
  tag,
  day,
  sum(count)
from datapoints
group by id, tag, day;
--  Gather  (cost=24216.48..172624.93 rows=1039570 width=76) (actual
time=250.718..607.141 rows=25921 loops=1)
--    Workers Planned: 5
--    Workers Launched: 5
--    ->  Parallel Append  (cost=23216.48..67667.93 rows=207914 width=76)
(actual time=258.961..571.139 rows=5184 loops=5)
--  Planning Time: 1.651 ms
--  Execution Time: 623.554 ms

rollback; -- to ensure there's no caching going on

begin;
-- seed data
insert into datapoints
select 
  (random()*(16*10))::int,
    ((random()*(16*10))::int)::text,
    to_timestamp((random()*16)::int)::date,
    (random()*100)::int 
from generate_series(1, 1000000 * 16); -- 1M per partition
-- INSERT 0 16000000
-- Time: 7395.269 ms (00:07.395)

explain analyse insert into aggregated_datapoints
select
  id,
  tag,
  day,
  sum(count)
from datapoints
group by id, tag, day;
--  Insert on aggregated_datapoints  (cost=12093.01..314337.65 rows=0
width=0) (actual time=6973.353..6973.361 rows=0 loops=1)
--    ->  Subquery Scan on "*SELECT*"  (cost=12093.01..314337.65 rows=516734
width=52) (actual time=379.111..6939.722 rows=25921 loops=1)
--          ->  Append  (cost=12093.01..307878.48 rows=516734 width=76)
(actual time=379.110..6937.932 rows=25921 loops=1)
--  Planning Time: 0.166 ms
--  Execution Time: 7001.725 ms
rollback;
```
two side questions here:
 1. I can't seem to get the parallel aggregation to use all 16 available
workers, is there some additional config I need to bump how many workers
would be planned?
 2. The best scenario would be that the insert itself would be partitionwise
since both the raw and aggregated tables will have the same keys in each
partition but I'm not sure if that's possible without running the inserts in
parallel manually on each partitioned table?

Re: BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

From

David Rowley

Date:

14 September 2023, 11:42:08

On Thu, 14 Sept 2023 at 21:04, PG Bug reporting form
<noreply@postgresql.org> wrote:
> I've been looking at `enable_partitionwise_aggregate` which significantly
> increases the speed of the aggregation however, it seems that when I combine
> the select that normally launches the parallel aggregations it's forced to
> be sequential even if I wrap the aggregation query in a materialised cte.

(it does not really seem like you're reporting a bug here, so the
pgsql-general mailing list would have been a better place to ask these
questions)

It's confusing what this complaint is about.  In the subject, you've
indicated that you can't get partitionwise aggregation working with
INSERT, but what you've written above seems to be about the lack of
parallel query for the INSERT operation.  I'm assuming the subject
line is wrong as I see partition-wise aggregate on my local instance
in the EXPLAIN output of your INSERT query.

Going by https://www.postgresql.org/docs/15/when-can-parallel-query-be-used.html
the text there says:

"the planner will not generate them for a given query if any of the
following are true:

The query writes any data or locks any database rows"

So that should answer why you're not seeing a parallel plan in your
INSERT as INSERTs write data.

> two side questions here:
>  1. I can't seem to get the parallel aggregation to use all 16 available
> workers, is there some additional config I need to bump how many workers
> would be planned?

Unfortunately not.  You're seeing 5 workers because of the following
code in add_paths_to_append_rel()

parallel_workers = Max(parallel_workers,
   pg_leftmost_one_pos32(list_length(live_childrels)) + 1);

list_length(live_childrels) is 16 (one for each of your partitions).

>  2. The best scenario would be that the insert itself would be partitionwise
> since both the raw and aggregated tables will have the same keys in each
> partition but I'm not sure if that's possible without running the inserts in
> parallel manually on each partitioned table?

Running 16 separate INSERT ... SELECT ... GROUP BY queries sounds like
it would be the fastest way.  We don't do parallel query for INSERTs
yet.

David

Re: BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

From

Stephen Flavin

Date:

14 September 2023, 13:57:17

Hi David,

Thanks for the speedy response.

Apologies if this is the wrong mailing list, my complaint is purely that the insert with the select doesn't seem to launch parallel workers even

though it's the aggregation before the insert that takes the most time in this query.

I missed https://www.postgresql.org/docs/15/when-can-parallel-query-be-used.html so I guess this isn't a bug but you say you're seeing

parallel workers in the insert which confuses me more. (sorry)

> I'm assuming the subject line is wrong as I see partition-wise aggregate on my local instance in the EXPLAIN output of your INSERT query.

here's the full explain analyse output I get on the

insert ... select ... group by...

With no mention of workers planned or parallel append.

```

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------

Insert on aggregated_datapoints (cost=18382.88..341024.32 rows=0 width=0) (actual time=2678.225..2678.233 rows=0 loops=1)

-> Subquery Scan on "*SELECT*" (cost=18382.88..341024.32 rows=1039584 width=52) (actual time=153.587..2650.315 rows=25921 loops=1)

-> Append (cost=18382.88..328029.52 rows=1039584 width=76) (actual time=153.586..2648.681 rows=25921 loops=1)

-> HashAggregate (cost=18382.88..19153.87 rows=61679 width=76) (actual time=153.585..154.107 rows=1610 loops=1)

Group Key: datapoints.id, datapoints.tag, datapoints.day

Batches: 1 Memory Usage: 3345kB

-> Seq Scan on datapoints_0 datapoints (cost=0.00..12214.94 rows=616794 width=52) (actual time=0.009..42.242 rows=949248 loops=1)

-> HashAggregate (cost=19340.48..20151.63 rows=64892 width=76) (actual time=168.131..168.649 rows=1610 loops=1)

Group Key: datapoints_1.id, datapoints_1.tag, datapoints_1.day

Batches: 1 Memory Usage: 3345kB

-> Seq Scan on datapoints_1 (cost=0.00..12851.24 rows=648924 width=52) (actual time=0.008..46.165 rows=998810 loops=1)

-> HashAggregate (cost=29044.16..30262.30 rows=97451 width=76) (actual time=254.215..254.830 rows=2415 loops=1)

Group Key: datapoints_2.id, datapoints_2.tag, datapoints_2.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_2 (cost=0.00..19299.08 rows=974508 width=52) (actual time=0.011..69.580 rows=1499940 loops=1)

-> HashAggregate (cost=21295.20..22188.34 rows=71451 width=76) (actual time=186.481..187.031 rows=1771 loops=1)

Group Key: datapoints_3.id, datapoints_3.tag, datapoints_3.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_3 (cost=0.00..14150.10 rows=714510 width=52) (actual time=0.005..52.619 rows=1099665 loops=1)

-> HashAggregate (cost=17413.12..18143.44 rows=58426 width=76) (actual time=152.666..153.004 rows=1449 loops=1)

Group Key: datapoints_4.id, datapoints_4.tag, datapoints_4.day

Batches: 1 Memory Usage: 1809kB

-> Seq Scan on datapoints_4 (cost=0.00..11570.56 rows=584256 width=52) (actual time=0.016..43.800 rows=899178 loops=1)

-> HashAggregate (cost=11640.16..12128.36 rows=39056 width=76) (actual time=101.427..101.699 rows=966 loops=1)

Group Key: datapoints_5.id, datapoints_5.tag, datapoints_5.day

Batches: 1 Memory Usage: 1809kB

-> Seq Scan on datapoints_5 (cost=0.00..7734.58 rows=390558 width=52) (actual time=0.010..29.490 rows=601138 loops=1)

-> HashAggregate (cost=27074.24..28209.75 rows=90841 width=76) (actual time=244.688..245.294 rows=2254 loops=1)

Group Key: datapoints_6.id, datapoints_6.tag, datapoints_6.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_6 (cost=0.00..17990.12 rows=908412 width=52) (actual time=0.008..73.015 rows=1398205 loops=1)

-> HashAggregate (cost=15513.12..16163.76 rows=52051 width=76) (actual time=140.067..140.378 rows=1288 loops=1)

Group Key: datapoints_7.id, datapoints_7.tag, datapoints_7.day

Batches: 1 Memory Usage: 1809kB

-> Seq Scan on datapoints_7 (cost=0.00..10308.06 rows=520506 width=52) (actual time=0.010..43.419 rows=801031 loops=1)

-> HashAggregate (cost=31020.16..32321.17 rows=104081 width=76) (actual time=268.779..269.424 rows=2576 loops=1)

Group Key: datapoints_8.id, datapoints_8.tag, datapoints_8.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_8 (cost=0.00..20612.08 rows=1040808 width=52) (actual time=0.009..74.481 rows=1601875 loops=1)

-> HashAggregate (cost=25207.68..26264.90 rows=84578 width=76) (actual time=212.029..212.658 rows=2093 loops=1)

Group Key: datapoints_9.id, datapoints_9.tag, datapoints_9.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_9 (cost=0.00..16749.84 rows=845784 width=52) (actual time=0.009..53.442 rows=1301782 loops=1)

-> HashAggregate (cost=13561.44..14130.22 rows=45502 width=76) (actual time=110.546..110.840 rows=1127 loops=1)

Group Key: datapoints_10.id, datapoints_10.tag, datapoints_10.day

Batches: 1 Memory Usage: 1809kB

-> Seq Scan on datapoints_10 (cost=0.00..9011.22 rows=455022 width=52) (actual time=0.010..27.684 rows=700231 loops=1)

-> HashAggregate (cost=3879.04..4041.73 rows=13015 width=76) (actual time=30.914..31.001 rows=322 loops=1)

Group Key: datapoints_11.id, datapoints_11.tag, datapoints_11.day

Batches: 1 Memory Usage: 465kB

-> Seq Scan on datapoints_11 (cost=0.00..2577.52 rows=130152 width=52) (actual time=0.009..7.793 rows=200218 loops=1)

-> HashAggregate (cost=19358.72..20170.65 rows=64954 width=76) (actual time=154.465..154.963 rows=1610 loops=1)

Group Key: datapoints_12.id, datapoints_12.tag, datapoints_12.day

Batches: 1 Memory Usage: 3345kB

-> Seq Scan on datapoints_12 (cost=0.00..12863.36 rows=649536 width=52) (actual time=0.007..38.759 rows=999663 loops=1)

-> HashAggregate (cost=23234.72..24209.21 rows=77959 width=76) (actual time=185.336..185.886 rows=1932 loops=1)

Group Key: datapoints_13.id, datapoints_13.tag, datapoints_13.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_13 (cost=0.00..15438.86 rows=779586 width=52) (actual time=0.008..46.450 rows=1199799 loops=1)

-> HashAggregate (cost=20328.48..21181.07 rows=68207 width=76) (actual time=161.628..162.166 rows=1771 loops=1)

Group Key: datapoints_14.id, datapoints_14.tag, datapoints_14.day

Batches: 1 Memory Usage: 3601kB

-> Seq Scan on datapoints_14 (cost=0.00..13507.74 rows=682074 width=52) (actual time=0.006..40.190 rows=1049859 loops=1)

-> HashAggregate (cost=13543.20..14111.21 rows=45441 width=76) (actual time=115.563..115.853 rows=1127 loops=1)

Group Key: datapoints_15.id, datapoints_15.tag, datapoints_15.day

Batches: 1 Memory Usage: 1809kB

-> Seq Scan on datapoints_15 (cost=0.00..8999.10 rows=454410 width=52) (actual time=0.007..31.461 rows=699358 loops=1)

Planning Time: 0.215 ms

Execution Time: 2687.251 ms

(69 rows)

From: David Rowley <dgrowleyml@gmail.com>
Sent: Thursday 14 September 2023 12:42
To: stephen_flav@hotmail.com <stephen_flav@hotmail.com>; pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

On Thu, 14 Sept 2023 at 21:04, PG Bug reporting form
<noreply@postgresql.org> wrote:
> I've been looking at `enable_partitionwise_aggregate` which significantly
> increases the speed of the aggregation however, it seems that when I combine
> the select that normally launches the parallel aggregations it's forced to
> be sequential even if I wrap the aggregation query in a materialised cte.

(it does not really seem like you're reporting a bug here, so the
pgsql-general mailing list would have been a better place to ask these
questions)

It's confusing what this complaint is about. In the subject, you've
indicated that you can't get partitionwise aggregation working with
INSERT, but what you've written above seems to be about the lack of
parallel query for the INSERT operation. I'm assuming the subject
line is wrong as I see partition-wise aggregate on my local instance
in the EXPLAIN output of your INSERT query.

Going by https://www.postgresql.org/docs/15/when-can-parallel-query-be-used.html
the text there says:

"the planner will not generate them for a given query if any of the
following are true:

The query writes any data or locks any database rows"

So that should answer why you're not seeing a parallel plan in your
INSERT as INSERTs write data.

> two side questions here:
> 1. I can't seem to get the parallel aggregation to use all 16 available
> workers, is there some additional config I need to bump how many workers
> would be planned?

Unfortunately not. You're seeing 5 workers because of the following
code in add_paths_to_append_rel()

parallel_workers = Max(parallel_workers,
pg_leftmost_one_pos32(list_length(live_childrels)) + 1);

list_length(live_childrels) is 16 (one for each of your partitions).

> 2. The best scenario would be that the insert itself would be partitionwise
> since both the raw and aggregated tables will have the same keys in each
> partition but I'm not sure if that's possible without running the inserts in
> parallel manually on each partitioned table?

Running 16 separate INSERT ... SELECT ... GROUP BY queries sounds like
it would be the fastest way. We don't do parallel query for INSERTs
yet.

David

Re: BUG #18109: enable_partitionwise_aggregate does not apply when inserting or updating

From

David Rowley

Date:

14 September 2023, 14:14:03

On Fri, 15 Sept 2023 at 01:57, Stephen Flavin <stephen_flav@hotmail.com> wrote:
> I missed https://www.postgresql.org/docs/15/when-can-parallel-query-be-used.html so I guess this isn't a bug but you
sayyou're seeing

> parallel workers in the insert which confuses me more. (sorry)

Parallel workers are not the same thing as partition-wise aggregate.
Partition-wise aggregate just means that the aggregation will be
performed below the Append/MergeAppend node. In the query you've just
shown you can see that as Hash Aggregate below (before) the Append.
Parallel workers is something that is used by parallel query.  A
parallel plan will have a Gather or Gather Merge node.  I certainly
didn't see one of those with the INSERT.

David