Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables |
Date | |
Msg-id | CAFjFpRfHLrgni-1+C14Nj1R96dje-rGNorgEs1qvGJtqTM6=vQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
|
List | pgsql-hackers |
On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Thanks a lot Robert. >> >> Here are rebased patches. > > I didn't get quite as much time to look at these today as I would have > liked, but here's what I've got so far. > > Comments on 0001: > > - In the RelOptInfo, part_oids is defined in a completely different > part of the structure than nparts, but you can't use it without nparts > because you don't know how long it is. I suggest moving the > definition to just after nparts. > > - On the other hand, maybe we should just remove it completely. I > don't see it in any of the subsequent patches. If it's used by the > advanced matching code, let's leave it out of 0001 for now and add it > back after the basic feature is committed. No, it's not used by advanced partition matching code. It was used by to match OIDs with the child rels to order those in the array. But now that we are expanding in EIBO fashion, it is not useful. Should have removed it earlier. Removed now. > > - Similarly, partsupfunc isn't used by the later patches either. It > seems it could also be removed, at least for now. It's used by advanced partition matching code to compare bounds. It will be required by partition pruning patch. But removed for now. > > - The comment for partexprs isn't very clear about how the lists > inside the array work. My understanding is that the lists have as > many members as the partition key has columns/expressions. Actually we are doing some preparation for partition-wise join here. partexprs and nullable_partexprs are used in partition-wise join implementation patch. I have updated prologue of RelOptInfo structure with the comments like below * Note: A base relation will always have only one set of partition keys. But a* join relation has as many sets of partitionkeys as the number of joining* relations. The number of partition keys is given by* "part_scheme->partnatts". "partexprs"and "nullable_partexprs" are arrays* containing part_scheme->partnatts elements. Each element of the array* containsa list of partition key expressions. For a base relation each list* contains only one expression. For a join relationeach list contains at* most as many expressions as the joining relations. The expressions in a list* at a given positionin the array correspond to the partition key at that* position. "partexprs" contains partition keys of non-nullablejoining* relations and "nullable_partexprs" contains partition keys of nullable* joining relations. For a baserelation only "partexprs" is populated. Let me know this looks fine. The logic to match the partition keys of joining relations in have_partkey_equi_join() and match_expr_to_partition_keys() becomes simpler if we arrange the partition key expressions as array indexed by position of partition key and each array element as list of partition key expressions at that position. partition pruning might need partexprs look up relevant quals, but nullable_partexprs doesn't have any use there. So may be we should add nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join implementation) instead of 0001. What do you think? > > - I'm not entirely sure whether maintaining partexprs and > nullable_partexprs is the right design. If I understand correctly, > whether or not a partexpr is nullable is really a per-RTI property, > not a per-expression property. You could consider something like > "Relids nullable_rels". That's true. However in order to decide whether an expression falls on nullable side of a join, we will need to call pull_varnos() on it and check the output against nullable_rels. Separating the expressions themselves avoids that step. > > Comments on 0002: > > - The relationship between deciding to set partition scheme and > related details and the configured value of enable_partition_wise_join > needs some consideration. If the only use of the partition scheme is > partition-wise join, there's no point in setting it even for a baserel > unless enable_partition_wise_join is set -- but if there are other > important uses for that data, such as Amit's partition pruning work, > then we might want to always set it. And similarly for a join: if the > details are only needed in the partition-wise join case, then we only > need to set them in that case, but if there are other uses, then it's > different. If it turns out that setting these details for a baserel > is useful in other cases but that it's only for a joinrel in the > partition-wise join case, then the patch gets it exactly right. But > is that correct? I'm not sure. Partition scheme contains the information about data types of partition keys, which is required to compare partition bounds. Partition pruning will need to compare constants with partition bounds and hence will need information contained in partition scheme. So, we will need to set it for base relations whether or not partition-wise join is enabled. > > - The naming of enable_partition_wise_join might also need some > thought. What happens when we also have partition-wise aggregate? > What about the proposal to strength-reduce MergeAppend to Append -- > would that use this infrastructure? I wonder if we out to call this > enable_partition_wise or enable_partition_wise_planning to make it a > bit more general. Then, too, I've never really liked having > partition_wise in the GUC name because it might make someone think > that it makes you partitions have a lot of wisdom. Removing the > underscore might help: partitionwise. Or maybe there is some whole > different name that would be better. If anyone wants to bikeshed, > now's the time. partitions having a lot of wisdom would be wise_partitions rather than partition_wise ;). If partition-wise join is disabled, partition-wise aggregates, strength reduction of MergeAppend won't be possible on a join tree, but those will be possible on a base relation. Even if partition-wise join enabled, one may want to disable other partition-wise optimizations individually. So, they are somewhat independent switches. I don't think we should bundle all of those into one. Whatever names we choose for those GUCs, I think they should have same naming convention e.g. "partition_wise_xyz". I am open to suggestions about the names. > > - It seems to me that build_joinrel_partition_info() could be > simplified a bit. One thing is that list_copy() is perfectly capable > of handling a NIL input, so there's no need to test for that before > calling it. partexprs may be NULL for FULL JOIN and nullable_partexprs may be NULL when there is no nullable relation. So, we have to check existence of those arrays before accessing lists containing partition key expressions. list_copy() is being called on individual array elements and "if" conditions check for the existence of array. The functions might have become complicated because I am using outer/inner_partexprs to hold one of the lists and partexprs contains the array of lists. We may use better named, but I don't have any better ideas right now. Will think about them. We could simplify that function according to your suggestion of nullable_relids. Basically partexprs then contains partition key expressions all relations nullable and non-nullable. nullable_relids + pull_varnos() tells us which of those fall on nullable side and which ones don't. Is this how you are thinking of simplifying it? If we go with this scheme, again nullable_relids will not be useful for partition pruning, so may be we should add it as part of 0002 (partition-wise join implementation) instead of 0001. > > Comments on 0003: > > - Instead of reorganizing add_paths_to_append_rel as you did, could > you just add an RTE_JOIN case to the switch? Not sure if there's some > problem with that idea, but it seems like it might come out nicer. RTE_JOIN is created only for joins specified using JOIN clause i.e syntactic joins. The joins created during query planner like rel1, rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: