Thread: pgsql: Support partition pruning at execution time
Support partition pruning at execution time Existing partition pruning is only able to work at plan time, for query quals that appear in the parsed query. This is good but limiting, as there can be parameters that appear later that can be usefully used to further prune partitions. This commit adds support for pruning subnodes of Append which cannot possibly contain any matching tuples, during execution, by evaluating Params to determine the minimum set of subnodes that can possibly match. We support more than just simple Params in WHERE clauses. Support additionally includes: 1. Parameterized Nested Loop Joins: The parameter from the outer side of the join can be used to determine the minimum set of inner side partitions to scan. 2. Initplans: Once an initplan has been executed we can then determine which partitions match the value from the initplan. Partition pruning is performed in two ways. When Params external to the plan are found to match the partition key we attempt to prune away unneeded Append subplans during the initialization of the executor. This allows us to bypass the initialization of non-matching subplans meaning they won't appear in the EXPLAIN or EXPLAIN ANALYZE output. For parameters whose value is only known during the actual execution then the pruning of these subplans must wait. Subplans which are eliminated during this stage of pruning are still visible in the EXPLAIN output. In order to determine if pruning has actually taken place, the EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never executed due to the elimination of the partition then the execution timing area will state "(never executed)". Whereas, if, for example in the case of parameterized nested loops, the number of loops stated in the EXPLAIN ANALYZE output for certain subplans may appear lower than others due to the subplan having been scanned fewer times. This is due to the list of matching subnodes having to be evaluated whenever a parameter which was found to match the partition key changes. This commit required some additional infrastructure that permits the building of a data structure which is able to perform the translation of the matching partition IDs, as returned by get_matching_partitions, into the list index of a subpaths list, as exist in node types such as Append, MergeAppend and ModifyTable. This allows us to translate a list of clauses into a Bitmapset of all the subpath indexes which must be included to satisfy the clause list. Author: David Rowley, based on an earlier effort by Beena Emerson Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi, Jesper Pedersen Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/499be013de65242235ebdde06adb08db887f0ea5 Modified Files -------------- doc/src/sgml/perform.sgml | 12 + src/backend/commands/explain.c | 51 +- src/backend/executor/execPartition.c | 419 +++++++++ src/backend/executor/nodeAppend.c | 268 ++++-- src/backend/nodes/copyfuncs.c | 21 + src/backend/nodes/nodeFuncs.c | 28 +- src/backend/nodes/outfuncs.c | 28 + src/backend/nodes/readfuncs.c | 20 + src/backend/optimizer/path/allpaths.c | 12 +- src/backend/optimizer/path/joinrels.c | 2 +- src/backend/optimizer/plan/createplan.c | 45 +- src/backend/optimizer/plan/planner.c | 8 +- src/backend/optimizer/prep/prepunion.c | 6 +- src/backend/optimizer/util/pathnode.c | 26 +- src/backend/partitioning/partprune.c | 267 +++++- src/include/executor/execPartition.h | 77 ++ src/include/nodes/execnodes.h | 12 +- src/include/nodes/nodes.h | 1 + src/include/nodes/plannodes.h | 5 + src/include/nodes/primnodes.h | 23 + src/include/optimizer/pathnode.h | 2 +- src/include/partitioning/partprune.h | 14 + src/test/regress/expected/partition_prune.out | 1135 +++++++++++++++++++++++++ src/test/regress/sql/partition_prune.sql | 344 ++++++++ 24 files changed, 2714 insertions(+), 112 deletions(-)
On 8 April 2018 at 09:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Support partition pruning at execution time I'm looking at buildfarm member lapwing's failure [1] now. Probably it can be fixed by adding a vacuum, but will need a few mins to test and produce a patch. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2018-04-07%2021%3A20%3A01&stg=check -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Support partition pruning at execution time Buildfarm member lapwing doesn't like this. I can reproduce the failures here by setting force_parallel_mode = regress. Kind of looks like instrumentation counts aren't getting propagated from workers back to the leader? regards, tom lane
On 8 April 2018 at 09:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> Support partition pruning at execution time > > Buildfarm member lapwing doesn't like this. I can reproduce the > failures here by setting force_parallel_mode = regress. Kind > of looks like instrumentation counts aren't getting propagated > from workers back to the leader? I'm looking at this now. I've tried adding vacuum (analyze) to the tables before the queries in order to have relallvisible set so that the index only scan's "Heap Fetches" becomes stable, but very weirdly it still sometimes fetches from the heap after having vacuumed. To help see what's going on while testing this I added: select relname,relallvisible from pg_Class where relname like 'tprt%' and relkind = 'r'; just before the: explain (analyze, costs off, summary off, timing off) select * from tbl1 join tprt on tbl1.col1 > tprt.col1; Sometimes I see: relname | relallvisible ---------+--------------- tprt_1 | 0 tprt_2 | 1 Other times I see: relname | relallvisible ---------+--------------- tprt_1 | 0 tprt_2 | 0 I thought maybe something might be holding a pin on a page somewhere and vacuum could be skipping it, so I added a VERBOSE to the vacuum and I see: Skipped 0 pages due to buffer pins, 0 frozen pages. I'd considered just doing: set enable_indexonly_scan = off; for all these tests, but I don't have an explanation for this vacuum behaviour yet. I'll need to dig through the vacuum code that sets the visibility bit and see if there's some good reason for this. I have a local patch ready to go for the set enable_indexonlyscan = off; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote: > Sometimes I see: > > relname | relallvisible > ---------+--------------- > tprt_1 | 0 > tprt_2 | 1 > > Other times I see: > > relname | relallvisible > ---------+--------------- > tprt_1 | 0 > tprt_2 | 0 The minimum set of commands I can find to recreate this are: drop table if exists tprt; create table tprt (col1 int) partition by range (col1); create table tprt_1 partition of tprt for values from (1) to (5001); create index tprt1_idx on tprt_1 (col1); insert into tprt values (10), (20), (501), (502), (505), (1001), (4500); vacuum tprt; select relname,relallvisible from pg_Class where relname like 'tprt%' and relkind = 'r'; I get relallvisible = 0 once in maybe 20 or so attempts. I didn't manage to get the same without a partitioned table. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 11:26, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Sometimes I see: >> >> relname | relallvisible >> ---------+--------------- >> tprt_1 | 0 >> tprt_2 | 1 >> >> Other times I see: >> >> relname | relallvisible >> ---------+--------------- >> tprt_1 | 0 >> tprt_2 | 0 > > The minimum set of commands I can find to recreate this are: > > drop table if exists tprt; > create table tprt (col1 int) partition by range (col1); > create table tprt_1 partition of tprt for values from (1) to (5001); > create index tprt1_idx on tprt_1 (col1); > insert into tprt values (10), (20), (501), (502), (505), (1001), (4500); > vacuum tprt; select relname,relallvisible from pg_Class where relname > like 'tprt%' and relkind = 'r'; > > I get relallvisible = 0 once in maybe 20 or so attempts. > > I didn't manage to get the same without a partitioned table. Anyway, this does not seem related to this patch. So no point in the build farm blaming it. There might be some reasonable explanation for this that I just can't think of now. I've attached a patch which gets rid of the index only scans in the tests. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Yeah, I don't quite understand this problem, and I tend to agree that it likely isn't this patch's fault. However, for the moment I'm going to avoid pushing the patch you propose because maybe there's a bug elsewhere and it'd be good to understand it. I'm looking at it now. If others would prefer me to push David's patch (or do so themselves), I'm not dead set against that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8 April 2018 at 12:15, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Yeah, I don't quite understand this problem, and I tend to agree that > it likely isn't this patch's fault. However, for the moment I'm going > to avoid pushing the patch you propose because maybe there's a bug > elsewhere and it'd be good to understand it. I'm looking at it now. > > If others would prefer me to push David's patch (or do so themselves), > I'm not dead set against that. I just wanted to share this: #!/bin/bash for i in {1..1000000} do if [ $(psql --no-psqlrc -w -v ON_ERROR_STOP=0 -d postgres -q -A -F " " -t <<EOF drop table if exists tprt; create table tprt (col1 int); create index tprt_idx on tprt (col1); insert into tprt values (10), (20), (501), (502), (505), (1001), (4500); vacuum tprt; select relallvisible from pg_Class where relname like 'tprt%' and relkind = 'r'; EOF ) = "0" ]; then echo "[$(date --iso-8601=seconds)]: 0" fi done If I run this I only get the wrong result from the visibility map in 60 second intervals: Check this output: [2018-04-08T02:50:34+0000]: 0 [2018-04-08T02:50:34+0000]: 0 [2018-04-08T02:50:34+0000]: 0 [2018-04-08T02:50:34+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:50:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:51:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 [2018-04-08T02:52:35+0000]: 0 It happens 12 or 13 times on my machine, then does not happen again for 60 seconds, then happens again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote: > It happens 12 or 13 times on my machine, then does not happen again > for 60 seconds, then happens again. Setting autovacuum_naptime to 10 seconds makes it occur in 10 second intervals... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>>>>> "David" == David Rowley <david.rowley@2ndquadrant.com> writes: >> It happens 12 or 13 times on my machine, then does not happen again >> for 60 seconds, then happens again. David> Setting autovacuum_naptime to 10 seconds makes it occur in 10 David> second intervals... Analyze (including auto-analyze on a different table entirely) has a snapshot, which can hold back OldestXmin, hence preventing the all-visible flag from being set. -- Andrew (irc:RhodiumToad)
On 8 April 2018 at 15:02, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote: >> It happens 12 or 13 times on my machine, then does not happen again >> for 60 seconds, then happens again. > > Setting autovacuum_naptime to 10 seconds makes it occur in 10 second > intervals... Ok, I thought it might have been some concurrent vacuum on the table but the only tables I see being vacuumed are system tables. I tried performing a manual vacuum of each of these and could not get it to trigger, but then I did: select * from pg_class; from another session and then the script starts spitting out some errors. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 15:21, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > David> Setting autovacuum_naptime to 10 seconds makes it occur in 10 > David> second intervals... > > Analyze (including auto-analyze on a different table entirely) has a > snapshot, which can hold back OldestXmin, hence preventing the > all-visible flag from being set. urg, that's true. Seems like there's no bugs here then; begin work; set transaction isolation level repeatable read; select * from pg_class; -- do nothing makes the script go crazy. You're right, thanks. I guess the patch I sent is the way forward with this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>>>>> "David" == David Rowley <david.rowley@2ndquadrant.com> writes: >> Setting autovacuum_naptime to 10 seconds makes it occur in 10 second >> intervals... David> Ok, I thought it might have been some concurrent vacuum on the David> table but the only tables I see being vacuumed are system David> tables. It's not vacuum that tends to be the problem, but analyze (on any table). Lazy-vacuum's snapshots are mostly ignored for computing global xmin horizons by other vacuums, but analyze's snapshots are not. David> I tried performing a manual vacuum of each of these and could David> not get it to trigger, but then I did: David> select * from pg_class; David> from another session and then the script starts spitting out David> some errors. Obviously, because the select holds a snapshot and therefore also holds back OldestXmin. You can't ever assume that data you just inserted will become all-visible just because you just vacuumed the table, unless you know that there is NO concurrent activity that might have a snapshot (and no other possible reason why OldestXmin might be older than your insert). -- Andrew (irc:RhodiumToad)
On 8 April 2018 at 15:34, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > You can't ever assume that data you just inserted will become > all-visible just because you just vacuumed the table, unless you know > that there is NO concurrent activity that might have a snapshot (and no > other possible reason why OldestXmin might be older than your insert). Thanks. I got it. It just slipped my slightly paranoid and sleep deprived mind. I've attached my proposed fix for the unstable regression tests. I removed the vacuums I'd added in the last version and commented why we're doing set enable_indesonlyscan = off; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
>>>>> "David" == David Rowley <david.rowley@2ndquadrant.com> writes: David> I've attached my proposed fix for the unstable regression tests. David> I removed the vacuums I'd added in the last version and David> commented why we're doing set enable_indesonlyscan = off; Looks basically sane - I'll try it out and commit it shortly. -- Andrew (irc:RhodiumToad)
Andrew Gierth wrote: > >>>>> "David" == David Rowley <david.rowley@2ndquadrant.com> writes: > > David> I've attached my proposed fix for the unstable regression tests. > David> I removed the vacuums I'd added in the last version and > David> commented why we're doing set enable_indesonlyscan = off; > > Looks basically sane - I'll try it out and commit it shortly. Thanks for cleaning that up. I'll look into why the test (without this commit) fails with force_parallel_mode=regress next week. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: Alvaro> Thanks for cleaning that up. I'll look into why the test Alvaro> (without this commit) fails with force_parallel_mode=regress Alvaro> next week. Seems clear enough to me - the "Heap Fetches" statistic is kept in the IndexOnlyScanState node in its own field, not part of ss.ps.instrument, and is therefore not reported from workers to leader. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Alvaro> Thanks for cleaning that up. I'll look into why the test > Alvaro> (without this commit) fails with force_parallel_mode=regress > Alvaro> next week. > Seems clear enough to me - the "Heap Fetches" statistic is kept in the > IndexOnlyScanState node in its own field, not part of ss.ps.instrument, > and is therefore not reported from workers to leader. BTW, pademelon just exhibited a different instability in this test: *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out Sun Apr 8 01:56:04 2018 --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Sun Apr 8 17:48:14 2018 *************** *** 1606,1612 **** -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=0 loops=3) Subplans Removed: 6 ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) --- 1606,1612 ---- -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=0 loops=3) Subplans Removed: 6 ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) ====================================================================== Dunno quite what to make of that, but this animal previously passed at commit b47a86f Sun Apr 8 05:35:42 2018 UTC Attempt to stabilize partition_prune test output. so it's not a consistent failure. regards, tom lane
On 9 April 2018 at 09:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, pademelon just exhibited a different instability in this test: > > *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out Sun Apr 8 01:56:04 2018 > --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Sun Apr 8 17:48:14 2018 > *************** > *** 1606,1612 **** > -> Partial Aggregate (actual rows=1 loops=3) > -> Parallel Append (actual rows=0 loops=3) > Subplans Removed: 6 > ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > --- 1606,1612 ---- > -> Partial Aggregate (actual rows=1 loops=3) > -> Parallel Append (actual rows=0 loops=3) > Subplans Removed: 6 > ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > Reading code it looks like a bug in choose_next_subplan_for_worker(): The following needs to be changed for this patch: /* Advance to next plan. */ pstate->pa_next_plan++; I'll think a bit harder about the best way to fix and submit a patch for it later. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9 April 2018 at 13:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 9 April 2018 at 09:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, pademelon just exhibited a different instability in this test: >> >> *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out Sun Apr 8 01:56:04 2018 >> --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Sun Apr 8 17:48:14 2018 >> *************** >> *** 1606,1612 **** >> -> Partial Aggregate (actual rows=1 loops=3) >> -> Parallel Append (actual rows=0 loops=3) >> Subplans Removed: 6 >> ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) >> Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) >> -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) >> Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) >> --- 1606,1612 ---- >> -> Partial Aggregate (actual rows=1 loops=3) >> -> Parallel Append (actual rows=0 loops=3) >> Subplans Removed: 6 >> ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2) >> Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) >> -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) >> Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) >> > > Reading code it looks like a bug in choose_next_subplan_for_worker(): > > The following needs to be changed for this patch: > > /* Advance to next plan. */ > pstate->pa_next_plan++; > > I'll think a bit harder about the best way to fix and submit a patch > for it later. Okay, I've written and attached a fix for this. I'm not 100% certain that this is the cause of the problem on pademelon, but the code does look wrong, so needs to be fixed. Hopefully, it'll make pademelon happy, if not I'll think a bit harder about what might be causing that instability. I've also attached a 2nd patch to fix a spelling mistake and a misleading comment in the code. The misleading comment claimed we unset the extern params so we didn't perform pruning again using these. I'd failed to update this comment after I realised that we still need to attempt to prune again with the external params since quals against the partition key may actually contain a mix of exec and external params, which would mean that it's only possible to prune partitions using both types of params. No actual code needs to be updated since the 2nd pass of pruning uses "allparams" anyway. We could actually get away without the bms_free() and set to NULL in the lines below the comment, but I wanted some way to ensure that we never write any code which calls the function twice on the same PartitionPruneState, but maybe I'm just overly paranoid there. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 9 April 2018 at 15:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 9 April 2018 at 13:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > Okay, I've written and attached a fix for this. I'm not 100% certain > that this is the cause of the problem on pademelon, but the code does > look wrong, so needs to be fixed. Hopefully, it'll make pademelon > happy, if not I'll think a bit harder about what might be causing that > instability. Added to PG11 open items list [1]. [1] https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > Okay, I've written and attached a fix for this. I'm not 100% certain > that this is the cause of the problem on pademelon, but the code does > look wrong, so needs to be fixed. Hopefully, it'll make pademelon > happy, if not I'll think a bit harder about what might be causing that > instability. Pushed it just now. Let's see what happens with pademelon now. > The misleading comment claimed we unset the extern params so we didn't > perform pruning again using these. I'd failed to update this comment > after I realised that we still need to attempt to prune again with the > external params since quals against the partition key may actually > contain a mix of exec and external params, which would mean that it's > only possible to prune partitions using both types of params. No > actual code needs to be updated since the 2nd pass of pruning uses > "allparams" anyway. We could actually get away without the bms_free() > and set to NULL in the lines below the comment, but I wanted some way > to ensure that we never write any code which calls the function twice > on the same PartitionPruneState, but maybe I'm just overly paranoid > there. Pushed this earlier today, thanks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > David Rowley wrote: >> Okay, I've written and attached a fix for this. I'm not 100% certain >> that this is the cause of the problem on pademelon, but the code does >> look wrong, so needs to be fixed. Hopefully, it'll make pademelon >> happy, if not I'll think a bit harder about what might be causing that >> instability. > Pushed it just now. Let's see what happens with pademelon now. I've had pademelon's host running a "make installcheck" loop all day trying to reproduce the problem. I haven't gotten a bite yet (although at 15+ minutes per cycle, this isn't a huge number of tests). I think we were remarkably (un)lucky to see the problem so quickly after the initial commit, and I'm afraid pademelon isn't going to help us prove much about whether this was the same issue. This does remind me quite a bit though of the ongoing saga with the postgres_fdw test instability. Given the frequency with which that's failing in the buildfarm, you would not think it's impossible to reproduce outside the buildfarm, and yet I'm here to tell you that it's pretty damn hard. I haven't succeeded yet, and that's not for lack of trying. Could there be something about the buildfarm environment that makes these sorts of things more likely? regards, tom lane
Andrew Gierth wrote: > >>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Alvaro> Thanks for cleaning that up. I'll look into why the test > Alvaro> (without this commit) fails with force_parallel_mode=regress > Alvaro> next week. > > Seems clear enough to me - the "Heap Fetches" statistic is kept in the > IndexOnlyScanState node in its own field, not part of ss.ps.instrument, > and is therefore not reported from workers to leader. Right, thanks for the pointer. So here's a patch that makes thing behave as expected. I noticed that instrument->nfiltered3 was available, so I used that to keep the counter. I wanted to print it using show_instrumentation_count (which has the nice property that you don't even have to test for es->analyze), but it doesn't work, because it divides the number by nloops, which is not what we want in this case. (It also doesn't print if the counter is zero, which maybe is desirable for the other counters but probably not for this one). I then noticed that support for nfiltered3 was incomplete; hence 0001. (I then noticed that nfiltered3 was added for MERGE. It looks wrong to me.) Frankly, I don't like this. I would rather have an instrument->ntuples2 rather than these "divide this by nloops, sometimes" schizoid counters. This is already being misused by ON CONFLICT (see "other_path" in show_modifytable_info). But it seems like a correct fix would require more code. Anyway, the partition_prune test works correctly now (after reverting AndrewSN's b47a86f5008f26) in both force_parallel_mode settings. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I then noticed that support for nfiltered3 was incomplete; hence 0001. > (I then noticed that nfiltered3 was added for MERGE. It looks wrong to > me.) In that case, it's likely to go away when Simon reverts MERGE. Suggest you hold off committing until he's done so, as he probably already has some conflicts to deal with and doesn't need another. > Frankly, I don't like this. I would rather have an instrument->ntuples2 > rather than these "divide this by nloops, sometimes" schizoid counters. > This is already being misused by ON CONFLICT (see "other_path" in > show_modifytable_info). But it seems like a correct fix would require > more code. The question then becomes whether to put back nfiltered3, or to do something more to your liking. Think I'd vote for the latter. regards, tom lane
On 10 April 2018 at 09:58, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I then noticed that support for nfiltered3 was incomplete; hence 0001. > (I then noticed that nfiltered3 was added for MERGE. It looks wrong to > me.) > > Frankly, I don't like this. I would rather have an instrument->ntuples2 > rather than these "divide this by nloops, sometimes" schizoid counters. > This is already being misused by ON CONFLICT (see "other_path" in > show_modifytable_info). But it seems like a correct fix would require > more code. +1 for a new field for this and making ON CONFLICT use it. ntuples2 seems fine. If we make it too specific then we'll end up with lots more than we need. I don't think re-using the filter counters are very good when it's not for filtering. MERGE was probably just following the example made by ON CONFLICT. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> David Rowley wrote: >>> Okay, I've written and attached a fix for this. I'm not 100% certain >>> that this is the cause of the problem on pademelon, but the code does >>> look wrong, so needs to be fixed. Hopefully, it'll make pademelon >>> happy, if not I'll think a bit harder about what might be causing that >>> instability. > >> Pushed it just now. Let's see what happens with pademelon now. > > I've had pademelon's host running a "make installcheck" loop all day > trying to reproduce the problem. I haven't gotten a bite yet (although > at 15+ minutes per cycle, this isn't a huge number of tests). I think > we were remarkably (un)lucky to see the problem so quickly after the > initial commit, and I'm afraid pademelon isn't going to help us prove > much about whether this was the same issue. > > This does remind me quite a bit though of the ongoing saga with the > postgres_fdw test instability. Given the frequency with which that's > failing in the buildfarm, you would not think it's impossible to > reproduce outside the buildfarm, and yet I'm here to tell you that > it's pretty damn hard. I haven't succeeded yet, and that's not for > lack of trying. Could there be something about the buildfarm > environment that makes these sorts of things more likely? coypu just demonstrated that this was not the cause of the problem [1] I'll study the code a bit more and see if I can think why this might be happening. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11 April 2018 at 18:58, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 10 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >>> David Rowley wrote: >>>> Okay, I've written and attached a fix for this. I'm not 100% certain >>>> that this is the cause of the problem on pademelon, but the code does >>>> look wrong, so needs to be fixed. Hopefully, it'll make pademelon >>>> happy, if not I'll think a bit harder about what might be causing that >>>> instability. >> >>> Pushed it just now. Let's see what happens with pademelon now. >> >> I've had pademelon's host running a "make installcheck" loop all day >> trying to reproduce the problem. I haven't gotten a bite yet (although >> at 15+ minutes per cycle, this isn't a huge number of tests). I think >> we were remarkably (un)lucky to see the problem so quickly after the >> initial commit, and I'm afraid pademelon isn't going to help us prove >> much about whether this was the same issue. >> >> This does remind me quite a bit though of the ongoing saga with the >> postgres_fdw test instability. Given the frequency with which that's >> failing in the buildfarm, you would not think it's impossible to >> reproduce outside the buildfarm, and yet I'm here to tell you that >> it's pretty damn hard. I haven't succeeded yet, and that's not for >> lack of trying. Could there be something about the buildfarm >> environment that makes these sorts of things more likely? > > coypu just demonstrated that this was not the cause of the problem [1] > > I'll study the code a bit more and see if I can think why this might > be happening. > > [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C I've spent a bit of time tonight trying to dig into this problem to see if I can figure out what's going on. I ended up running the following script on both a Linux x86_64 machine and also a power8 machine. #!/bin/bash for x in {1..1000} do echo "$x"; for i in {1..1000} do psql -d postgres -f test.sql -o test.out diff -u test.out test.expect done done I was unable to recreate this problem after about 700k loops on the Linux machine and 130k loops on the power8. I've emailed the owner of coypu to ask if it would be possible to get access to the machine, or have him run the script to see if it does actually fail. Currently waiting to hear back. I've made another pass over the nodeAppend.c code and I'm unable to see what might cause this, although I did discover a bug where first_partial_plan is not set taking into account that some subplans may have been pruned away during executor init. The only thing I think this would cause is for parallel workers to not properly help out with some partial plans if some earlier subplans were pruned. I can see no reason for this to have caused this particular issue since the first_partial_plan would be 0 with and without the attached fix. Tom, would there be any chance you could run the above script for a while on pademelon to see if it can in fact reproduce the problem? coypu did show this problem in the install check, so I don't think it will need the other concurrent tests to fail. If you can recreate, after adjusting the expected output, does the problem still exist in 5c0675215? I also checked with other tests perform an EXPLAIN ANALYZE of a plan with a Parallel Append and I see there's none. So I've not ruled out that this is an existing bug. git grep "explain.*analyze" also does not show much outside of the partition_prune tests either. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley wrote: > I've made another pass over the nodeAppend.c code and I'm unable to > see what might cause this, although I did discover a bug where > first_partial_plan is not set taking into account that some subplans > may have been pruned away during executor init. The only thing I think > this would cause is for parallel workers to not properly help out with > some partial plans if some earlier subplans were pruned. I can see no > reason for this to have caused this particular issue since the > first_partial_plan would be 0 with and without the attached fix. Pushed this. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 April 2018 at 07:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > David Rowley wrote: > >> I've made another pass over the nodeAppend.c code and I'm unable to >> see what might cause this, although I did discover a bug where >> first_partial_plan is not set taking into account that some subplans >> may have been pruned away during executor init. The only thing I think >> this would cause is for parallel workers to not properly help out with >> some partial plans if some earlier subplans were pruned. I can see no >> reason for this to have caused this particular issue since the >> first_partial_plan would be 0 with and without the attached fix. > > Pushed this. Thanks! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services