Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied - Mailing list pgsql-bugs
From | Yura Sokolov |
---|---|
Subject | Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied |
Date | |
Msg-id | 0f58e3496cfa13c024dee2c645cacdb621864c8a.camel@postgrespro.ru Whole thread Raw |
In response to | Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied (Yura Sokolov <y.sokolov@postgrespro.ru>) |
Responses |
Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
|
List | pgsql-bugs |
В Пт, 17/12/2021 в 09:00 +0300, Yura Sokolov пишет: > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > The following bug has been logged on the website: > > > > Bug reference: 17335 > > Logged by: Yura Sokolov > > Email address: y.sokolov@postgrespro.ru > > PostgreSQL version: 14.1 > > Operating system: any > > Description: > > > > Our customer faced a query that returns duplicate rows if parallel execution > > is enabled > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > Investigation shows bug were introduced in > > 8edd0e79460b414b1d971895312e549e95e12e4f > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > Bug is present in all versions since 12 beta, and it is reproduced in master > > branch as well. > > > > Reproduction. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > dump > > large enough to trigger parallel plan execution. > > Also `analyze public.testi3, public.testi4` line is added. > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > Query: > > > > SELECT count(1) > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk); > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > correct. > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > wrong. > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > with max_parallel_workers_per_gather = 2. > > It seems that it happens when worker returns first row fast enough so > > "Gather" node doesn't start to execute plan by itself. > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > correct result. > > `set enable_parallel_append = false` also forces correct result. > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > = 1 . > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > 2 just because it were easier to get bad plan. > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > result) > В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет: > > The following bug has been logged on the website: > > > > Bug reference: 17335 > > Logged by: Yura Sokolov > > Email address: y.sokolov@postgrespro.ru > > PostgreSQL version: 14.1 > > Operating system: any > > Description: > > > > Our customer faced a query that returns duplicate rows if parallel execution > > is enabled > > (max_parallel_workers_per_gather > 0) after 11->12 upgrade. > > > > Investigation shows bug were introduced in > > 8edd0e79460b414b1d971895312e549e95e12e4f > > "Suppress Append and MergeAppend plan nodes that have a single child." . > > Bug is present in all versions since 12 beta, and it is reproduced in master > > branch as well. > > > > Reproduction. > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75 > > > > Dump contains definition of two tables (with 3 and 2 fields), index and data > > dump > > large enough to trigger parallel plan execution. > > Also `analyze public.testi3, public.testi4` line is added. > > > > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst > > > > Query: > > > > SELECT count(1) > > FROM ( > > SELECT fk, id > > FROM testi3 > > WHERE flagged > > UNION ALL > > SELECT fk, id > > from testi3 > > WHERE false > > ) as t > > LEFT OUTER JOIN testi4 > > USING (fk); > > > > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is > > correct. > > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is > > wrong. > > > > Sometimes (for example, when you try to debug it) it returns 44 even > > with max_parallel_workers_per_gather = 2. > > It seems that it happens when worker returns first row fast enough so > > "Gather" node doesn't start to execute plan by itself. > > > > It is confirmed by `set parallel_leader_participation = false`, which forces > > correct result. > > `set enable_parallel_append = false` also forces correct result. > > > > Original plan included Gather Merge node and max_parallel_workers_per_gather > > = 1 . > > Reproduction code uses Gather instead and max_parallel_workers_per_gather = > > 2 just because it were easier to get bad plan. > > > > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct > > result) > > > > Finalize Aggregate (cost=25003.16..25003.17 rows=1 width=8) (actual > > time=435.473..435.473 rows=1 loops=1) > > -> Gather (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > -> Partial Aggregate (cost=24003.05..24003.06 rows=1 width=8) > > (actual time=433.000..433.000 rows=1 loops=2) > > -> Parallel Hash Left Join (cost=21701.47..24002.86 rows=74 > > width=0) (actual time=312.576..432.981 rows=22 loops=2) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Parallel Append (cost=0.42..9.18 rows=24 width=4) > > (actual time=0.021..0.112 rows=16 loops=2) > > -> Subquery Scan on "*SELECT* 1" > > (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1) > > -> Index Scan using testi3_ix on testi3 > > (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33 > > loops=1) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > width=4) (actual time=305.473..305.473 rows=703309 loops=2) > > Buckets: 131072 Batches: 32 Memory Usage: > > 2784kB > > -> Parallel Seq Scan on testi4 > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958 > > rows=703309 loops=2) > > > > Plan after that change (returns wrong result) > > > > Finalize Aggregate (cost=25003.87..25003.88 rows=1 width=8) (actual > > time=652.062..652.062 rows=1 loops=1) > > -> Gather (cost=25003.75..25003.86 rows=1 width=8) (actual > > time=651.482..668.718 rows=2 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > -> Partial Aggregate (cost=24003.75..24003.76 rows=1 width=8) > > (actual time=649.150..649.150 rows=1 loops=2) > > -> Parallel Hash Left Join (cost=21701.47..24003.49 > > rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2) > > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.51 > > rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2) > > -> Index Scan using testi3_ix on testi3 > > (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2) > > Index Cond: (flagged = true) > > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 > > width=4) (actual time=444.931..444.931 rows=703309 loops=2) > > Buckets: 131072 Batches: 32 Memory Usage: > > 2784kB > > -> Parallel Seq Scan on testi4 > > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913 > > rows=703309 loops=2)] > > I found simple fix: > > --- a/src/backend/optimizer/util/pathnode.c > +++ b/src/backend/optimizer/util/pathnode.c > @@ -1284,14 +1284,17 @@ create_append_path(PlannerInfo *root, > if > (list_length(pathnode->subpaths) == 1) > { > Path *child = (Path *) linitial(pathnode->subpaths); > > pathnode->path.rows = > child->rows; > pathnode->path.startup_cost = child->startup_cost; > pathnode->path.total_cost = child->total_cost; > pathn > ode->path.pathkeys = child->pathkeys; > + pathnode->path.parallel_aware = child->parallel_aware; > + pathnode->path.parallel_safe = child- > > parallel_safe; > + pathnode->path.parallel_workers = child->parallel_workers; > } > else > { > pathnode->path.pathkeys = > NIL; /* unsorted if more than 1 subpath */ > cost_append(pathnode); > } > > With this fix Gather node deduce `Single Copy: true`, and leader doesn't > participate in query execution. > > It seems to be correct change: > - if append node will be removed, then it should take parallel > awareness/safety from its single child. > > With this change plan Gather Merge node fails on assert in > cost_gather_merge > > /* > * Add one to the number of workers to account for the leader. This might > * be overgenerous since the leader will do less work than other workers > * in typical cases, but we'll go with it for now. > */ > Assert(path->num_workers > 0); > N = (double) path->num_workers + 1; > logN = LOG2(N); > > Commenting this assert doesn't seem to break anything and doesn't seem to mess > cost calculation (ie it remains logically correct). > > To preserve assert I've changed it to `Assert(path->num_workers >= 0);`. > > With this changes queries from reproduction code returns correct 44 rows. > > # explain analyze SELECT id, val > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk); > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------- > Gather (cost=22701.47..25015.79 rows=130 width=8) (actual time=1268.071..1686.918 rows=44 loops=1) > Workers Planned: 1 > Workers Launched: 1 > Single Copy: true > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=1242.936..1634.421 rows=44 loops=1) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.049..0.166 rows=33 loops=1) > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.047..0.151 rows=33loops=1) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=1238.029..1238.029 rows=1406618loops=1) > Buckets: 131072 Batches: 32 Memory Usage: 2784kB > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.070..540.652 rows=1406618loops=1) > > Interestingly, Gather Merge node doesn't run workers: > > # explain analyze SELECT id, val > FROM ( > SELECT fk, id > FROM testi3 > WHERE flagged > UNION ALL > SELECT fk, id > from testi3 > WHERE false > ) as t > LEFT OUTER JOIN testi4 > USING (fk) ORDER BY val; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------- > Gather Merge (cost=25007.36..25007.68 rows=0 width=8) (actual time=909.975..909.984 rows=44 loops=1) > Workers Planned: 0 > Workers Launched: 0 > -> Sort (cost=24007.36..24007.68 rows=130 width=8) (actual time=909.948..909.952 rows=44 loops=1) > Sort Key: testi4.val > Sort Method: quicksort Memory: 27kB > -> Parallel Hash Left Join (cost=21701.47..24002.79 rows=130 width=8) (actual time=757.099..909.779 rows=44loops=1) > Hash Cond: ("*SELECT* 1".fk = testi4.fk) > -> Subquery Scan on "*SELECT* 1" (cost=0.42..9.09 rows=24 width=8) (actual time=0.224..0.576 rows=33 loops=1) > -> Index Scan using testi3_ix on testi3 (cost=0.42..8.85 rows=24 width=8) (actual time=0.202..0.543rows=33 loops=1) > Index Cond: (flagged = true) > -> Parallel Hash (cost=12084.91..12084.91 rows=586091 width=8) (actual time=755.983..755.983 rows=1406618loops=1) > Buckets: 131072 Batches: 32 Memory Usage: 2730kB > -> Parallel Seq Scan on testi4 (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.052..330.983rows=1406618 loops=1) > > So this changes could be taken as quick valid fix. > > Futher improvement is to remove Append node earlier to detect there is > no need in parallelism when single Append child doesn't support it. > I've attached this changes as a patch. -------- regards, Sokolov Yura Postgres Professional y.sokolov@postgrespro.ru funny.falcon@gmail.com
Attachment
pgsql-bugs by date: