Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables |
Date | |
Msg-id | CAFjFpRfbndFZUqM8u64nCcz8fEsgyDXEpq9Ma3jGziyoBJ8Bkw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
|
List | pgsql-hackers |
On Wed, Mar 15, 2017 at 6:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Of course, that supposes that 0009 can manage to postpone creating >> non-sampled child joinrels until create_partition_join_plan(), which >> it currently doesn't. In fact, unless I'm missing something, 0009 >> hasn't been even slightly adapted to take advantage of the >> infrastructure in 0001; it doesn't seem to reset the path_cxt or >> anything. That seems like a fairly major omission. > > Some other comments on 0009: > > Documentation changes for the new GUCs are missing. Done. The description might need more massaging, but I will work on that once we have fixed their names and usage. I think sample_partition_fraction and partition_wise_plan_weight, if retained, will be applicable to other partition-wise planning like partition-wise aggregates. So we will need more generic description there. > > +between the partition keys of the joining tables. The equi-join between > +partition keys implies that for a given row in a given partition of a given > +partitioned table, its joining row, if exists, should exist only in the > +matching partition of the other partitioned table; no row from non-matching > > There could be more than one. I'd write: The equi-join between > partition keys implies that all join partners for a given row in one > partitioned table must be in the corresponding partition of the other > partitioned table. Done. I think it's important to emphasize the the joining partners can not be in other partitions. So, added that sentence after your suggested sentence. > > +#include "miscadmin.h" > #include <limits.h> > #include <math.h> > > Added in wrong place. Done. > > + * System attributes do not need > translation. In such a case, > + * the attribute numbers of the parent > and the child should > + * start from the same minimum attribute. > > I would delete the second sentence and add an Assert() to that effect instead. The assertion is there just few lines down. Please let me know if that suffices. Deleted the second sentence. > > + /* Pass top parent's relids down the inheritance hierarchy. */ > > Why? That is required for a multi-level partitioned table. top_parent_relids are used for translating expressions of the top parent to that of child table. > > + for (attno = rel->min_attr; attno <= > rel->max_attr; attno++) > > Add add a comment explaining why we need to do this. The comment is there just few lines above. I have moved it just above this for loop. > > - add_paths_to_append_rel(root, rel, live_childrels); > + add_paths_to_append_rel(root, rel, live_childrels, false); > } > > - > > No need to remove blank line. Sorry. That was added by my patch to refactor set_append_rel_pathlist(). I have added a patch in the series to remove that line. > > + * When called on partitioned join relation with partition_join_path = true, it > + * adds PartitionJoinPath instead of Merge/Append path. This path is costed > + * based on the costs of sampled child-join and is expanded later into > + * Merge/Append plan. > > I'm not a big fan of the Merge/Append terminology here. If somebody > adds another kind of append-path someday, then all of these comments > will have to be updated. I think this can be phrased more > generically. Reworded as + * When partition_join_path is true, the caller intends to add a + * PartitionJoinPath costed based on the sampled child-joins passed as + * live_childrels. Also added an assertion to make sure the partition_join_path is true only for join relations. > > /* > + * While creating PartitionJoinPath, we sample paths from only > a few child > + * relations. Even if all of sampled children have partial > paths, it's not > + * guaranteed that all the unsampled children will have partial paths. > + * Hence we do not create partial PartitionJoinPaths. > + */ > > Very sad. I guess if we had parallel append available, we could maybe > dodge this problem, but for now I suppose we're stuck with it. Really sad. Is there a way to look at the relation (without any partial paths yet) and see whether the relation will have partial paths or not. Even if we don't have actual partial paths but know that there will be at least one added in the future, we will be able to fix this problem. > > + /* > + * Partitioning scheme in join relation indicates a possibility that the > + * join may be partitioned, but it's not necessary that every pair of > + * joining relations can use partition-wise join technique. If one of > + * joining relations turns out to be unpartitioned, this pair of joining > + * relations can not use partition-wise join technique. > + */ > + if (!rel1->part_scheme || !rel2->part_scheme) > + return; > > How can this happen? If rel->part_scheme != NULL, doesn't that imply > that every rel covered by the joinrel is partitioned that way, and > therefore this condition must necessarily hold? I don't remember exactly, but this was added considering a more generic partition-wise join. But then we would have more changes when we support that. So, turned this into an assertion. > > In general, I think it's better style to write explicit tests against > NULL or NIL than to just write if (blahptr). PG code uses both the styles. Take for example src/backend/rewrite/rewriteManip.c or any file, both styles are being used. I find this style useful, when I want to code, say "if this relation does not have a partitioning scheme" rather than "if this relation have NULL partitioning scheme". Although I don't have objections changing it as per your suggestion. > > + partitioned_join->sjinfo = copyObject(parent_sjinfo); > > Why do we need to copy it? > sjinfo in make_join_rel() may be from root->join_info_list or it could be one made up locally in that function. The one made up in that function would go away with that function, whereas we need it much later to create paths for child-joins. So, I thought it's better to copy it. But now I have changed to code to pass NULL for a made-up sjinfo. In such a case, the child-join's sjinfo is also made up. This required some refactoring to separate out the making-up code. So, there's new refactoring patch. > + /* > + * Remove the relabel decoration. We can assume that there is > at most one > + * RelabelType node; eval_const_expressions() simplifies multiple > + * RelabelType nodes into one. > + */ > + if (IsA(expr, RelabelType)) > + expr = (Expr *) ((RelabelType *) expr)->arg; > > Still, instead of assuming this, you could just s/if/while/, and then > you wouldn't need the assumption any more. Also, consider castNode(). Done. > > partition_wise_plan_weight may be useful for testing, but I don't > think it should be present in the final patch. partition_join test needs it so that it can work with smaller dataset and complete faster. For smaller data sets the partition-wise join paths come out to be costlier than other kinds and are never chosen. By setting partition_wise_plan_weight I can force partition-wise join to be chosen. An alternate solution would be to use sample_partition_fraction = 1.0, but then we will never test delayed planning for unsampled child-joins. I also think that users will find partition_wise_plan_weight useful when estimates based on samples are unrealistic. Obviously, in a longer run we should be able to provide better estimates. Apart from this, I have also removed recursive calls to try_partition_wise_join() and generate_partition_wise_join_paths() from 0009 and places them in the 0014 patch. Those are required for multi-level partitioned tables, which are not supported in 0009. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: