Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables |
Date | |
Msg-id | CAFjFpRcwBf0wMDSBVqh7pjxDogJNAgXC6Njw2raJSY8f=hx5BA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
|
List | pgsql-hackers |
On Tue, Dec 27, 2016 at 11:01 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > PFA patch rebased after partitioning code was committed. > > On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Hi Robert, >> Sorry for delayed response. >> >> The attached patch implements following ideas: >> 1. At the time of creating paths - If the joining relations are both >> partitioned and join can use partition-wise join, we create paths for >> few child-joins. Similar to inheritance relations >> (set_append_rel_pathlist()), we collect paths with similar properties >> from all sampled child-joins and create one PartitionJoinPath with >> each set of paths. The cost of the PartitionJoinPath is obtained by >> multiplying the sum of costs of paths in the given set by the ratio of >> (number of rows estimated in the parent-join/sum of rows in >> child-joins). >> >> 2. If the PartitionJoinPath emerges as the best path, we create paths >> for each of the remaining child-joins. Then we collect paths with >> properties same as the given PartitionJoinPath, one from each >> child-join. These paths are converted into plans and a Merge/Append >> plan is created combing these plans. The paths and plans for >> child-join are created in a temporary memory context. The final plan >> for each child-join is copied into planner's context and the temporary >> memory context is reset. >> >> Right now, we choose 1% or 1 (whichever is higher) child-joins to base >> PartitionJoinPath costs on. >> >> Memory consumption >> ----------------------------- >> I tested a 5-way self-join for a table with 1000 partitions, each >> partition having 1M rows. The memory consumed in standard_planner() >> was measured with some granular tracking >> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join >> consumed total of 289MB memory which is approx 6.6 times more than >> non-partition-wise join which consumed 44MB. That's much better than >> the earlier 16 times consumption for 5-way join with 100 partitions. >> >> The extra 245MB memory was consumed by child-join RelOptInfos (48MB), >> SpecialJoinInfos for child-joins (64MB), restrictlist translation >> (92MB), paths for sampled child-joins (1.5MB), building targetlists >> for child-joins (7MB). >> > > In the earlier implementation, a given clause which was applicable to > multiple join orders was getting translated as many times as the join > orders it was applicable in. I changed RestrictInfo for parent to > store a list of RestrictInfos applicable to children to avoid multiple > translations. > > My earlier patch created the child-join plans in a temporary context > and then copied them into planner context since the translated clauses > were allocated memory in temporary memory context then. Now that they > are stored in planner's context, we can directly create the plan in > the planner's context. > > Third, I added code to free up child SpecialJoinInfos after using those. > > As a result the total memory consumption now is 192MB, which is approx > 4.4 times the memory consumed during planning in case of > non-partition-wise join. > >> >> Choosing representative child-joins: >> -------------------------------------------------- >> There's another angle to choosing representative child joins. In a >> partitioned N-way join, different joins covering different subsets of >> N relations, will have different size distributions across the >> partitions. This means that the child-joins costed for (N-k) joins, >> may be different for those required for (N-k+1) joins. With a factor >> of 1% sampling, N is such that a child-join participates in 100 joins, >> we will end up creating paths for all partitions before creating >> PartitionJoinPaths for the final N-way join. Hopefully that will be a >> rare case and usually we will end up using paths already created. We >> can not avoid creating PartitionJoinPaths for subset joins, as there >> might be cases when partition-wise join will be optimal for an N-k way >> join but not for N-way join. We may avoid this if we choose >> representative child-joins based on their positions, in which case, we >> may end up with some or all of those being empty and thus skewing the >> costs heavily. >> >> Partial paths >> ----------------- >> AFAIU, we create partial paths for append relation, when all the >> children have partial paths. Unlike parameterized paths or path with >> pathkeys, there is no way to create a partial path for a normal path. >> This means that unless we create paths for all child-joins, we can not >> create partial paths for appendrel comprising of child-joins, and thus >> can not use parallel query right now. This may not be that bad, since >> it would be more efficient to run each child-join in a separate >> worker, rather than using multiple workers for a single child-join. > > This still applies. > >> >> regression tests >> ---------------------- >> I observed that for small relations (1000 rows in each partition and >> 100 partitions), the size estimates in append relations and sum of >> those in child relations are very different. As a result, the >> extrapolated costs for PartitionJoinPaths as described above, are way >> higher than costs of join of appends (or even append of joins if we >> are to create paths for all child-joins). Thus with this approach, we >> choose partition-wise join for large number of partitions with large >> data (e.g. 1000 partitions with 1M rows each). These are certainly the >> cases when partition-wise join is a big win. I have not tried to find >> out a threshold above which partition-wise join gets chosen with above >> approach, but it's going to be a larger threshold. That makes writing >> regression tests difficult, as those will require large data. So, we >> have to find a way so that we can test partition-wise join with >> smaller data. There are few possibilities like 1. convert the fraction >> of representative child-joins into GUC and setting it to 100% would >> start choosing partition-wise joins for tables with a few hundred rows >> per partition, like it did in earlier approach, 2. provide a way to >> force partition-wise join whenever possible, by say costing >> partition-wise joins much lesser than non-partition-wise join when a >> GUC is set (e.g. enable_partition_wise_join with values always, never, >> optimal or something like that). >> > > For now I have added a float GUC partition_wise_plan_weight. The > partition-wise join cost derived from the samples is multiplied by > this GUC and set as the cost of ParitionJoinPath. A value of 1 means > that the cost derived from the samples are used as is. A value higher > than 1 discourages use of partition-wise join and that lower than 1 > encourages use of partition-wise join. I am not very keen on keeping > this GUC, in this form. But we need some way to run regression with > smaller data. > > For now I have disabled partition-wise join for multi-level > partitions. I will post a patch soon with that enabled. PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased on the latest code. Also, PFA patch to support partition-wise join between multi-level partitioned tables. I copied the Amit Langote's patch for translating partition hierarchy into inheritance hierarchy and added code to support partition-wise join. You had expressed some concerns about Amit's approach in [1], but that discussion is still open. So, I haven't merged those changes to partition-wise join patch. We may continue to work on it as separate patch or I can include it in partition-wise join main patch. BTW, INSERT into multi-level partitioned tables is crashing with latest head. The issue was reported in [2]. Because of that multi_level_partition_join test crashes in pg_dp_join_v6.patch. Intestingly the crash vanishes when we apply patch supporting mult-level partition-wise join. [1] https://www.postgresql.org/message-id/CA%2BTgmoaEU10Kmdy44izcqJYLh1fkh58_6sbGGu0Q4b7PPE46eA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: