Home > mailing lists
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

From	Ashutosh Bapat
Subject	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Date	February 10, 2017 08:49:09
Msg-id	CAFjFpRfdv2HFv7yB9ER+Qc3pKLT1Wfi+mAoPf2z4E-s3DE+2eA@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
List	pgsql-hackers
Tree view
Fixed a problem with the way qsort was being used in the earlier set
of patches. Attached PFA the set of patches with that fixed.

On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Per your suggestion I have split the patch into many smaller patches.
>
> 0001-Refactor-set_append_rel_pathlist.patch
> 0002-Refactor-make_join_rel.patch
> 0003-Refactor-adjust_appendrel_attrs.patch
> 0004-Refactor-build_join_rel.patch
> 0005-Add-function-find_param_path_info.patch
>
> These four refactor existing code.
>
> 0006-Canonical-partition-scheme.patch
> 0007-Partition-wise-join-tests.patch -- just tests, they fail
> 0008-Partition-wise-join.patch -- actual patch implementing
> partition-wise join, still some tests fail\
>
> 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
> 0010-Parameterized-path-fixes.patch
> 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
>
> The last three patches change existing code to expect child(-join)
> relations where they were not expected earlier.
>
> Each patch has summary of the changes.
>
> Partition-wise join for multi-level partitioned tables is not covered
> by these patches. I will post those patches soon.
>
>>
>>>
>>> Other questions/comments:
>>>
>>> Why does find_partition_scheme need to copy the partition bound
>>> information instead of just pointing to it?  Amit went to some trouble
>>> to make sure that this can't change under us while we hold a lock on
>>> the relation, and we'd better hold a lock on the relation if we're
>>> planning a query against it.
>>
>> PartitionScheme is shared across multiple relations, join or base,
>> partitioned similarly. Obviously it can't and does not need to point
>> partition bound informations (which should all be same) of all those
>> base relations. O the the face of it, it looks weird that it points to
>> only one of them, mostly the one which it encounters first. But, since
>> it's going to be the same partition bound information, it doesn't
>> matter which one. So, I think, we can point of any one of those. Do
>> you agree?
>
> Instead of copying PartitionBoundInfo, used pointer of the first
> encountered one.
>
>>
>>>
>>> I think the PartitionScheme stuff should live in the optimizer rather
>>> that src/backend/catalog/partition.c.  Maybe plancat.c?  Perhaps we
>>> eventually need a new file in the optimizer just for partitioning
>>> stuff, but I'm not sure about that yet.
>>
>> I placed PartitionScheme stuff in partition.c because most of the
>> functions and structures in partition.c are not visible outside that
>> file. But I will try again to locate PartitionScheme to optimizer.
>
> Moved the code as per your suggestion.
>
>>
>>>
>>> The fact that set_append_rel_size needs to reopen the relation to
>>> extract a few more bits of information is not desirable.  You need to
>>> fish this information through in some other way; for example, you
>>> could have get_relation_info() stash the needed bits in the
>>> RelOptInfo.
>>
>> I considered this option and discarded it, since not all partitioned
>> relations will have OIDs for partitions e.g. partitioned joins will
>> not have OIDs for their partitions. But now that I think of it, we
>> should probably store those OIDs just for the base relation and leave
>> them unused for non-base relations just like other base relation
>> specific fields in RelOptInfo.
>
> Changed as per your suggestions.
>
>>
>>>
>>> +                * For two partitioned tables with the same
>>> partitioning scheme, it is
>>> +                * assumed that the Oids of matching partitions from
>>> both the tables
>>> +                * are placed at the same position in the array of
>>> partition oids in
>>>
>>> Rather than saying that we assume this, you should say why it has to
>>> be true.  (If it doesn't have to be true, we shouldn't assume it.)
>>
>> Will take care of this.
>
> Done. Please check.
>
>>
>>>
>>> +                * join relations. Partition tables should have same
>>> layout as the
>>> +                * parent table and hence should not need any
>>> translation. But rest of
>>>
>>> The same attributes have to be present with the same types, but they
>>> can be rearranged.  This comment seems to imply the contrary.
>>
>> Hmm, will take care of this.
>
> Done.
>
>>
>>>
>>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
>>
>> +1. Will take care of this. Does "representative_partitions_fraction"
>> or "sample_partition_fraction" look like a good GUC name? Any other
>> suggestions?
>
> used "sample_partition_fraction" for now. Suggestions are welcome.
>
>>
>>>
>>> +               /*
>>> +                * Add this relation to the list of samples ordered by
>>> the increasing
>>> +                * number of rows at appropriate place.
>>> +                */
>>> +               foreach (lc, ordered_child_nos)
>>> +               {
>>> +                       int     child_no = lfirst_int(lc);
>>> +                       RelOptInfo *other_childrel = rel->part_rels[child_no];
>>> +
>>> +                       /*
>>> +                        * Keep track of child with lowest number of
>>> rows but higher than the
>>> +                        * that of the child being inserted. Insert
>>> the child before a
>>> +                        * child with highest number of rows lesser than it.
>>> +                        */
>>> +                       if (child_rel->rows <= other_childrel->rows)
>>> +                               insert_after = lc;
>>> +                       else
>>> +                               break;
>>> +               }
>>>
>>> Can we use quicksort instead of a hand-coded insertion sort?
>>
>> I guess so, if I write comparison functions, which shouldn't be a
>> problem. Will try that.
>
> Done.
>
>>
>>>
>>> +               if (bms_num_members(outer_relids) > 1)
>>>
>>> Seems like bms_get_singleton_member could be used.
>
> That code is not required any more.
>
>>>
>>> +        * Partitioning scheme in join relation indicates a possibilty that the
>>>
>>> Spelling.
>
> Done.
>
>>>
>>> There seems to be no reason for create_partition_plan to be separated
>>> from create_plan_recurse.  You can just add another case for the new
>>> path type.
>
> Done.
>
>>>
>>> Why does create_partition_join_path need to be separate from
>>> create_partition_join_path_with_pathkeys?  Couldn't that be combined
>>> into a single function with a pathkeys argument that might sometimes
>>> be NIL?  I assume most of the logic is common.
>
> Combined those into a single function.
>
>>>
>>> From a sort of theoretical standpoint, the biggest danger of this
>>> patch seems to be that by deferring path creation until a later stage
>>> than normal, we could miss some important processing.
>>> subquery_planner() does a lot of stuff after
>>> expand_inherited_tables(); if any of those things, especially the ones
>>> that happen AFTER path generation, have an effect on the paths, then
>>> this code needs to compensate for those changes somehow.  It seems
>>> like having the planning of unsampled children get deferred until
>>> create_plan() time is awfully surprising; here we are creating the
>>> plan and suddenly what used to be a straightforward path->plan
>>> translation is running around doing major planning work.  I can't
>>> entirely justify it, but I somehow have a feeling that work ought to
>>> be moved earlier.  Not sure exactly where.
>
> Pasting my previous replies here to keep everything in one mail.
>
> I agree with this. Probably we should add a path tree mutator before
> SS_identify_outer_params() to replace any Partition*Paths with
> Merge/Append paths. The mutator will create paths for child-joins
> within temporary memory context, copy the relevant paths and create
> Merge/Append paths. There are two problems there 1. We have to write
> code to copy paths; most of the paths would be flat copy but custom
> scan paths might have some unexpected problems. 2. There will be many
> surviving PartitionPaths, and all the corresponding child paths would
> need copying and consume memory. In order to reduce that consumption,
> we have run this mutator after set_cheapest() in subquery_planner();
> but then nothing interesting happens between that and create_plan().
> Expanding PartitionPaths during create_plan() does not need any path
> copying and we expand only the PartitionPaths which will be converted
> to plans. That does save a lot of memory; the reason why we defer
> creating paths for child-joins.
>
>>>
>>> This is not really a full review, mostly because I can't easily figure
>>> out the motivation for all of the changes the patch makes.  It makes a
>>> lot of changes in a lot of places, and it's not really very easy to
>>> understand why those changes are necessary.  My comments above about
>>> splitting the patch into a series of patches that can potentially be
>>> reviewed and applied independently, with the main patch being the last
>>> in the series, are a suggestion as to how to tackle that.  There might
>>> be some work that needs to or could be done on the comments, too.  For
>>> example, the patch splits out add_paths_to_append_rel from
>>> set_append_rel_pathlist, but the comments don't say anything helpful
>>> like "we need to do X after Y, because Z".  They just say that we do
>>> it.  To some extent I think the comments in the optimizer have that
>>> problem generally, so it's not entirely the fault of this patch;
>>> still, the lack of those explanations makes the code reorganization
>>> harder to follow, and might confuse future patch authors, too.
>>>
>
> Specifically about add_paths_to_append_rel(), what do you expect the
> comment to say? It would be obvious why we split that functionality
> into a separate function: in fact, we don't necessarily explain why
> certain code resides in a separate function in the comments. I think,
> that particular comment (or for that matter other such comments in the
> optimizer) can be removed altogether, since it just writes the
> function names as an "English" sentence. I sometimes find those
> comments useful, because I can read just those comments and forget
> about the code, making comprehension easy. If highlighting is ON, your
> brain habitually ignores the non-comment portions when required. I am
> open to suggestions.
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment

pgsql-hackers by date:
From: Michael Paquier
Date: 10 February 2017, 08:32:14
Subject: Re: [HACKERS] contrib modules and relkind check
From: Tom Lane
Date: 10 February 2017, 08:51:54
Subject: Re: [HACKERS] Removal of deprecated views pg_user, pg_group, pg_shadow
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

Attachment

Previous

Next