Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables |
Date | |
Msg-id | CAFjFpRfdv2HFv7yB9ER+Qc3pKLT1Wfi+mAoPf2z4E-s3DE+2eA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
|
List | pgsql-hackers |
Fixed a problem with the way qsort was being used in the earlier set of patches. Attached PFA the set of patches with that fixed. On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Per your suggestion I have split the patch into many smaller patches. > > 0001-Refactor-set_append_rel_pathlist.patch > 0002-Refactor-make_join_rel.patch > 0003-Refactor-adjust_appendrel_attrs.patch > 0004-Refactor-build_join_rel.patch > 0005-Add-function-find_param_path_info.patch > > These four refactor existing code. > > 0006-Canonical-partition-scheme.patch > 0007-Partition-wise-join-tests.patch -- just tests, they fail > 0008-Partition-wise-join.patch -- actual patch implementing > partition-wise join, still some tests fail\ > > 0009-Adjust-join-related-to-code-to-accept-child-relation.patch > 0010-Parameterized-path-fixes.patch > 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch > > The last three patches change existing code to expect child(-join) > relations where they were not expected earlier. > > Each patch has summary of the changes. > > Partition-wise join for multi-level partitioned tables is not covered > by these patches. I will post those patches soon. > >> >>> >>> Other questions/comments: >>> >>> Why does find_partition_scheme need to copy the partition bound >>> information instead of just pointing to it? Amit went to some trouble >>> to make sure that this can't change under us while we hold a lock on >>> the relation, and we'd better hold a lock on the relation if we're >>> planning a query against it. >> >> PartitionScheme is shared across multiple relations, join or base, >> partitioned similarly. Obviously it can't and does not need to point >> partition bound informations (which should all be same) of all those >> base relations. O the the face of it, it looks weird that it points to >> only one of them, mostly the one which it encounters first. But, since >> it's going to be the same partition bound information, it doesn't >> matter which one. So, I think, we can point of any one of those. Do >> you agree? > > Instead of copying PartitionBoundInfo, used pointer of the first > encountered one. > >> >>> >>> I think the PartitionScheme stuff should live in the optimizer rather >>> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we >>> eventually need a new file in the optimizer just for partitioning >>> stuff, but I'm not sure about that yet. >> >> I placed PartitionScheme stuff in partition.c because most of the >> functions and structures in partition.c are not visible outside that >> file. But I will try again to locate PartitionScheme to optimizer. > > Moved the code as per your suggestion. > >> >>> >>> The fact that set_append_rel_size needs to reopen the relation to >>> extract a few more bits of information is not desirable. You need to >>> fish this information through in some other way; for example, you >>> could have get_relation_info() stash the needed bits in the >>> RelOptInfo. >> >> I considered this option and discarded it, since not all partitioned >> relations will have OIDs for partitions e.g. partitioned joins will >> not have OIDs for their partitions. But now that I think of it, we >> should probably store those OIDs just for the base relation and leave >> them unused for non-base relations just like other base relation >> specific fields in RelOptInfo. > > Changed as per your suggestions. > >> >>> >>> + * For two partitioned tables with the same >>> partitioning scheme, it is >>> + * assumed that the Oids of matching partitions from >>> both the tables >>> + * are placed at the same position in the array of >>> partition oids in >>> >>> Rather than saying that we assume this, you should say why it has to >>> be true. (If it doesn't have to be true, we shouldn't assume it.) >> >> Will take care of this. > > Done. Please check. > >> >>> >>> + * join relations. Partition tables should have same >>> layout as the >>> + * parent table and hence should not need any >>> translation. But rest of >>> >>> The same attributes have to be present with the same types, but they >>> can be rearranged. This comment seems to imply the contrary. >> >> Hmm, will take care of this. > > Done. > >> >>> >>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC. >> >> +1. Will take care of this. Does "representative_partitions_fraction" >> or "sample_partition_fraction" look like a good GUC name? Any other >> suggestions? > > used "sample_partition_fraction" for now. Suggestions are welcome. > >> >>> >>> + /* >>> + * Add this relation to the list of samples ordered by >>> the increasing >>> + * number of rows at appropriate place. >>> + */ >>> + foreach (lc, ordered_child_nos) >>> + { >>> + int child_no = lfirst_int(lc); >>> + RelOptInfo *other_childrel = rel->part_rels[child_no]; >>> + >>> + /* >>> + * Keep track of child with lowest number of >>> rows but higher than the >>> + * that of the child being inserted. Insert >>> the child before a >>> + * child with highest number of rows lesser than it. >>> + */ >>> + if (child_rel->rows <= other_childrel->rows) >>> + insert_after = lc; >>> + else >>> + break; >>> + } >>> >>> Can we use quicksort instead of a hand-coded insertion sort? >> >> I guess so, if I write comparison functions, which shouldn't be a >> problem. Will try that. > > Done. > >> >>> >>> + if (bms_num_members(outer_relids) > 1) >>> >>> Seems like bms_get_singleton_member could be used. > > That code is not required any more. > >>> >>> + * Partitioning scheme in join relation indicates a possibilty that the >>> >>> Spelling. > > Done. > >>> >>> There seems to be no reason for create_partition_plan to be separated >>> from create_plan_recurse. You can just add another case for the new >>> path type. > > Done. > >>> >>> Why does create_partition_join_path need to be separate from >>> create_partition_join_path_with_pathkeys? Couldn't that be combined >>> into a single function with a pathkeys argument that might sometimes >>> be NIL? I assume most of the logic is common. > > Combined those into a single function. > >>> >>> From a sort of theoretical standpoint, the biggest danger of this >>> patch seems to be that by deferring path creation until a later stage >>> than normal, we could miss some important processing. >>> subquery_planner() does a lot of stuff after >>> expand_inherited_tables(); if any of those things, especially the ones >>> that happen AFTER path generation, have an effect on the paths, then >>> this code needs to compensate for those changes somehow. It seems >>> like having the planning of unsampled children get deferred until >>> create_plan() time is awfully surprising; here we are creating the >>> plan and suddenly what used to be a straightforward path->plan >>> translation is running around doing major planning work. I can't >>> entirely justify it, but I somehow have a feeling that work ought to >>> be moved earlier. Not sure exactly where. > > Pasting my previous replies here to keep everything in one mail. > > I agree with this. Probably we should add a path tree mutator before > SS_identify_outer_params() to replace any Partition*Paths with > Merge/Append paths. The mutator will create paths for child-joins > within temporary memory context, copy the relevant paths and create > Merge/Append paths. There are two problems there 1. We have to write > code to copy paths; most of the paths would be flat copy but custom > scan paths might have some unexpected problems. 2. There will be many > surviving PartitionPaths, and all the corresponding child paths would > need copying and consume memory. In order to reduce that consumption, > we have run this mutator after set_cheapest() in subquery_planner(); > but then nothing interesting happens between that and create_plan(). > Expanding PartitionPaths during create_plan() does not need any path > copying and we expand only the PartitionPaths which will be converted > to plans. That does save a lot of memory; the reason why we defer > creating paths for child-joins. > >>> >>> This is not really a full review, mostly because I can't easily figure >>> out the motivation for all of the changes the patch makes. It makes a >>> lot of changes in a lot of places, and it's not really very easy to >>> understand why those changes are necessary. My comments above about >>> splitting the patch into a series of patches that can potentially be >>> reviewed and applied independently, with the main patch being the last >>> in the series, are a suggestion as to how to tackle that. There might >>> be some work that needs to or could be done on the comments, too. For >>> example, the patch splits out add_paths_to_append_rel from >>> set_append_rel_pathlist, but the comments don't say anything helpful >>> like "we need to do X after Y, because Z". They just say that we do >>> it. To some extent I think the comments in the optimizer have that >>> problem generally, so it's not entirely the fault of this patch; >>> still, the lack of those explanations makes the code reorganization >>> harder to follow, and might confuse future patch authors, too. >>> > > Specifically about add_paths_to_append_rel(), what do you expect the > comment to say? It would be obvious why we split that functionality > into a separate function: in fact, we don't necessarily explain why > certain code resides in a separate function in the comments. I think, > that particular comment (or for that matter other such comments in the > optimizer) can be removed altogether, since it just writes the > function names as an "English" sentence. I sometimes find those > comments useful, because I can read just those comments and forget > about the code, making comprehension easy. If highlighting is ON, your > brain habitually ignores the non-comment portions when required. I am > open to suggestions. > > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Refactor-set_append_rel_pathlist.patch
- 0002-Refactor-make_join_rel.patch
- 0003-Refactor-adjust_appendrel_attrs.patch
- 0004-Refactor-build_join_rel.patch
- 0005-Add-function-find_param_path_info.patch
- 0006-Canonical-partition-scheme.patch
- 0007-Partition-wise-join-tests.patch
- 0008-Partition-wise-join.patch
- 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
- 0010-Parameterized-path-fixes.patch
- 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
pgsql-hackers by date: