initial pruning in parallel append - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | initial pruning in parallel append |
Date | |
Msg-id | CA+HiwqFA=swkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw@mail.gmail.com Whole thread Raw |
Responses |
Re: initial pruning in parallel append
|
List | pgsql-hackers |
Hi, In an off-list chat, Robert suggested that it might be a good idea to look more closely into $subject, especially in the context of the project of moving the locking of child tables / partitions to the ExecInitNode() phase when executing cached generic plans [1]. Robert's point is that a worker's output of initial pruning which consists of the set of child subplans (of a parallel-aware Append or MergeAppend) it considers as valid for execution may not be the same as the leader's and that of other workers. If that does indeed happen, it may confuse the Append's parallel-execution code, possibly even cause crashes, because the ParallelAppendState set up by the leader assumes a certain number and identity (?) of valid-for-execution subplans. So he suggests that initial pruning should only be done once in the leader and the result of that put in the EState for ExecInitParallelPlan() to serialize to pass down to workers. Workers would simply consume that as-is to set the valid-for-execution child subplans in its copy of AppendState, instead of doing the initial pruning again. Actually, earlier patches at [1] had implemented that mechanism (remembering the result of initial pruning and using it at a later time and place), because the earlier design there was to move the initial pruning on the nodes in a cached generic plan tree from ExecInitNode() to GetCachedPlan(). The result of initial pruning done in the latter would be passed down to and consumed in the former using what was called PartitionPruneResult nodes. Maybe that stuff could be resurrected, though I was wondering if the risk of the same initial pruning steps returning different results when performed repeatedly in *one query lifetime* aren't pretty minimal or maybe rather non-existent? I think that's because performing initial pruning steps entails computing constant and/or stable expressions and comparing them with an unchanging set of partition bound values, with comparison functions whose result is also presumed to be stable. Then there's also the step of mapping the partition indexes as they appear in the PartitionDesc to the indexes of their subplans under Append/MergeAppend using the information contained in PartitionPruneInfo (subplan_map) and the result of mapping should be immutable too. I considered that the comparison functions that match_clause_to_partition_key() obtains by calling get_opfamily_proc() may in fact not be stable, though that doesn't seem to be a worry at least with the out-of-the-box pg_amproc collection: select amproc, p.provolatile from pg_amproc, pg_proc p where amproc = p.oid and p.provolatile <> 'i'; amproc | provolatile ---------------------------+------------- date_cmp_timestamptz | s timestamp_cmp_timestamptz | s timestamptz_cmp_date | s timestamptz_cmp_timestamp | s pg_catalog.in_range | s (5 rows) Is it possible for a user to add a volatile procedure to pg_amproc? If that's possible, match_clause_to_partition_key() may pick one as a comparison function for pruning, because it doesn't actually check the procedure's provolatile before doing so. I'd hope not, though would like to be sure to support what I wrote above. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/43/3478/
pgsql-hackers by date: