Re: Declarative partitioning - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Declarative partitioning |
Date | |
Msg-id | 563341AE.5010207@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Declarative partitioning (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: Declarative partitioning
|
List | pgsql-hackers |
On 2015/08/26 23:01, Simon Riggs wrote: > The following review has been posted through the commitfest application: > ... > > Needs planner work and tests of that. ALTER TABLE etc can wait. > > The new status of this patch is: Waiting on Author Alright, after spending much time on this I came up with the attached. I'm sincerely sorry this took so long. I'll describe below what's in those patches. The DDL and catalogs part are not much different from what I had last described though I took a few steps to simplify things. I dropped the multi-level partitioning bit and changed range partition creation syntax a little. Ability to specify both min and max for every range partition had made life much difficult to construct and use the internal representation. Anyway, attached document patch (and regression tests to some degree) might give a picture of what user commands look like as of now, although things are very much subject to overhaul in this general area of user visible commands and features. As pointed out by Simon, there is a need to invent a better internal representation of partitioning constraints/metadata to address the planning considerations more effectively and scalably. To that end, I have tried to implement a number of things - There is a proposed set_partitioned_rel_size() that calls planner's partitioning module (proposed) which takes RelOptInfo of a partitioned table and returns a list of OIDs of partitions that are selected to be scanned based on the baserestrictinfo (other partitions are pruned!) The partitioned rel size is then computed using size estimates for selected partitions. Later, proposed set_partitioned_rel_pathlist() will create a Append/MergeAppend path with cheapest paths for individual partitions as its subpaths. It seems important to point out that the selected partitions each get simple_rte_array and a simple_rel_array slots (note that those arrays are required to be repalloc'd to make those slots available). The partition reloptinfos are of kind RELOPT_PARTITION_REL which are functionally quite similar to RELOPT_OTHER_MEMBER_REL but different in that they do not require using any auxiliary structures like AppendRelInfo and append_rel_list. The parent-partition mapping would still be necessary and is achieved using a field called parent_relid of partition reloptinfo. The proposed partitioning module (in the planner) figures out which of the clauses to use for partition pruning. The rules of matching clauses to partition key look similar to those used to match them to an index, at least with respect to the key representation. Key info includes attribute number, expression tree (for expression key columns), operator family OID, per key column. While clauses matched to an index would be returned as it is to make index paths with, things work differently for clauses matched to partition key column(s). For each matched clause, the Const operand and the operator would be added to operand list and operator list, respectively, for the matched column. Operands are later evaluated to datums. The datum arrays along with operator OIDs are passed along to another (general-purpose) partitioning module (proposed). That module returns a list of OIDs of selected partitions after comparing the clause operand datums with the partition bounds. The operator strategies (of passed operators) coupled with partitioning method (list or range, currently) determine how partitions are selected. For some partitioning methods, faster partition selection is enabled, for example, being able to use binary search helps make selecting range partitions reasonably fast. Note that the clauses matched for a given column are effectively ANDed with one another. At the moment, using OR clauses disables partition pruning. But it would be a matter of improving the structure used to pass around the operands and operators to also convey the logical structure of clauses. Also, even though range partitioning allows multi-column key, only the clauses referencing the first key are used to select partitions. Multi-column range partition pruning implementation is saved for later. Note that partition bounds are themselves stored as arrays of datums computed on-demand (cached in RelationData as part of a structure called PartitionDesc which stores other info about partitions like OIDs.) There are not many executor changes currently except for those made to ensure sane I/U/D behaviors (made to respective ExecI/U/D functions). We can certainly think about implementing specialized scan node for partitioned tables but I put it off for now. Attached user documentation and regression test patches, though not quite comprehensive, should help one get started. Thanks, Amit
Attachment
pgsql-hackers by date: