Re: [HACKERS] Partition-wise aggregation/grouping - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Partition-wise aggregation/grouping |
Date | |
Msg-id | CAFjFpRfyPVKD0ZqTYF8Y-pySu_M0wp_Vn2+NCUwHPs6AT_k2Jw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Partition-wise aggregation/grouping (Jeevan Chalke <jeevan.chalke@enterprisedb.com>) |
Responses |
Re: [HACKERS] Partition-wise aggregation/grouping
Re: [HACKERS] Partition-wise aggregation/grouping |
List | pgsql-hackers |
On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > > OK. Done in the attached patch set. > > I have rebased all my patches on latest HEAD which is at > 7518049980be1d90264addab003476ae105f70d4 > > Thanks These are review comments for the last set and I think most of them apply to the new set as well. Patches 0001 - 0005 refactoring existing code. I haven't reviewed them in detail, checking whether we have missed anything in moving the code, but they mostly look fine. Comments on 0006/* + * cost_append + * Determines and returns the cost of an Append node. + * ... clipped portion + + /* Add Append node overhead. */ + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; + I am wondering whether it's really worth creating a new function for a single line addition to create_append_path(). I think all we need to change in create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples) to path->total_cost. + /* Add MergeAppend node overhead like we do it for the Append node */ + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; + With this change the following comment is no more true. Please remove it. * extracted tuple. We don't charge cpu_tuple_costbecause a MergeAppend * node doesn't do qual-checking or projection, so it has less overhead * than mostplan nodes. */ +/* + * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note that May be reword it as " ... to cost per tuple processing by an append node ..." + * this value should be multiplied with cpu_tuple_cost wherever applicable. + */ +#define DEFAULT_APPEND_COST_FACTOR 0.5 I am wondering whether we should just define #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5) and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would have other than multiplying with cpu_tuple_cost? -- test partition matching with N-way joinEXPLAIN (COSTS OFF)SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.cFROM plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c; - QUERY PLAN --------------------------------------------------------------------------------------- + QUERY PLAN +-------------------------------------------------------------------------------- Sort Sort Key: t1.c, t3.c -> HashAggregate Group Key: t1.c, t2.c, t3.c - -> Result + -> Hash Join + Hash Cond: (t1.c = t2.c) -> Append - -> Hash Join - Hash Cond: (t1.c = t2.c) That's sad. Interestingly this query has an aggregate, so the plan will use partition-wise join again when partition-wise aggregation patch will be applied. So may be fine. - Append (cost=0.00..0.04 rows=2 width=32) + Append (cost=0.00..0.05 rows=2 width=32) - Append (cost=0.00..0.04 rows=2 width=4) + Append (cost=0.00..0.05 rows=2 width=4) We do have some testcases which print costs. Interesting :). I don't have any objection to this change. Comments on 0007 + <para> + Enables or disables the query planner's use of partition-wise grouping + or aggregation, which allows If partition-wise aggregation does not result in the + cheapest path, it will still spend time in creating these paths and + consume memory which increase linearly with the number of partitions. + The default is <literal>off</>. + </para> + </listitem> + </varlistentry> + May be we should word this in the same manner as partition-wise join like Enables or disables the query planner's use of partition-wise grouping or aggregation, which allows aggregationor grouping on a partitioned tables to be spread across the partitions. If <literal>GROUP BY<literal>clause includes partition keys, the rows are aggregated at each partition. Otherwise, partial aggregatescomputed for each partition are required to be combined. Because partition-wise aggregation/gropuingcan use significantly more CPU time and memory during planning, the default is <literal>off</literal>. + +Partition-wise aggregates/grouping +---------------------------------- ... clipped patch +In above plan, aggregation is performed after append node which means that the +whole table is an input for the aggregation node. However, with partition-wise +aggregation, same query will have plane like: s/plane/plan/ + Append ... clipped patch +PartialAggregate stage greatly reduces the number of groups and lose if we have +lots of small groups. To keep the discussion brief, I suggest we rewrite this paragraph as ---- If GROUP BY clause has all partition keys, all the rows that belong to a given group come from a single partition and thus aggregates can be finalized separately for each partition. When the number of groups is far lesser than the number of rows being grouped, as usually is the case, the number of rows processed by an Append node reduces apart from reducing the size of the hash table or size of the data to be sorted. This usually improves efficiency of the query. If GROUP BY doesn't contain all the partition keys, partial aggregates can be computed for each partition followed by combining partial aggregates from one or more partitions belonging to the same group to compute complete aggregate for each group. This improves efficiency of the query if the number of groups is far less than the number of rows produced by the scan underneath. --- I am not sure whether we should be discussing why this technique performs better or when it performs better. We don't have similar discussion for partition-wise join. That paragraph just describes the technique and may be we want to do the same here. + * + * extra is the additional information required when we are doing aggregation + * or grouping below the append node. In case of partial partition-wise + * aggregation on a child node, we need to compute finalized step after the + * append, which cannot be done in this function. And thus if we have non-NULL + * value for extra, we call create_partition_agg_paths() to create an append + * node and finalization, if any. May be we want to just say "extra provides more information about the partitioned aggregation/grouping e.g path target, whether to use partial aggregate and so on." When present we call create_partition_agg_paths() to add paths for partition-wise aggregatges. - add_path(rel, (Path *) create_append_path(rel, subpaths, - rel->reltarget, NULL, 0, - partitioned_rels)); + { + if (extra) + create_partition_agg_paths(root, rel, subpaths, NIL, + NIL, NULL, 0, + partitioned_rels, extra); + else + add_path(rel, (Path *) create_append_path(rel, subpaths, + rel->reltarget, NULL, 0, + partitioned_rels)); + } I am wondering whether we could write a function to call appropriate one out of create_append_path(), create_partition_agg_paths() or create_merge_append_path() based on the presence of extra and/or pathkeys and use it everywhere such a change is made. I don't know whether that will be worth the code. But there are a handful places where such diffs are required. - - plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL); + plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, + IS_OTHER_REL(best_path->subpath->parent) ? + best_path->path.parent->relids : NULL); While I can guess why this change is required, it may be better to separate it into a patch of its own and adding some explanation in the commit message, for other reviewers. + /* Copy input rels's relids to grouped rel */ + grouped_rel->relids = input_rel->relids; I am fine with this change, but Tom may not agree [1]. May be we should get his opinion on this one. /* + * If input relation is partitioned, check if we can perform + * partition-wise grouping and/or aggregation. + */ Just like partition-wise join a concise "Apply partition-wise aggregation technique, if possible." would suffice. dNumPartialGroups = get_number_of_groups(root, cheapest_partial_path->rows, gd, - parse->targetList); + make_tlist_from_pathtarget(target)); Can we guarantee that the output of make_tlist_from_pathtarget() will be same as translation of parse->targetList for the given child? Even if not, may be it's fine to pass slightly different tlist to get_number_of_groups() since it doesn't depend upon the exact shape but right group column references. Nonetheless something to test and verify. * - * Determines whether parallel grouping and/or aggrgation is possible, or not. + * Determines whether parallel grouping and/or aggregation is possible, or not. * Returns true when possible, false otherwise. Does this hunk belong to one of the refactoring patches or as a separate patch correcting a typo? +/* + * try_partition_wise_grouping + * + * If the input relation is partitioned and the partition keys are part of the + * group by clauses, each partition produces a different set of groups. + * Aggregates within each such group can be computed partition-wise. This While these sentences are correct, I think the reason why we could compute an aggregate at the level of each partition is because rows from a given group belong to a single partition. So, I guess, we have to reword this as "If the partition keys of input relation are part of group by clause, all the rows belonging to a given group come from a single partition, each partition producing a different set of groups. This allows aggregation/grouping over a partitioned relation to be broken down into aggregation/grouping on each partition. If group by clause does not contain all the partition keys, rows from a given group may be spread across multiple partitions. In that case, we can combine partial aggregates for a given group across partitions to produce the final aggregate for a that group " + * might be optimal because of presence of suitable paths with pathkeys or + * because the hash tables for most of the partitions fit into the memory. + * However, if partition keys are not part of the group by clauses, then we + * still able to compute the partial aggregation for each partition and then + * finalize them over append. This can either win or lose. It may win if the + * PartialAggregate stage greatly reduces the number of groups and lose if we + * have lots of small groups. I have not seen prologue of a function implementing a query optimization technique explain why that technique improves performance. So I am not sure whether the comment should include this explanation. One of the reasons being that the reasons why a technique works might change over the period of time with the introduction of other techniques, thus obsoleting the comment. But may be it's good to have it here. + /* + * Grouping sets plan does not work with an inheritance subtree (see notes + * in create_groupingsets_plan). Thus do not handle grouping sets here. + */ + if (query->groupingSets || gd) + return; Even if that restriction is lifted, we won't be able to compute "whole" grouping sets for each partition, since grouping sets implies multiple group by clauses, each of which may not have all partition keys. Those sets which have all partition keys will be computed completely for each partition, but others will require partial aggregation. I guess, we will need to apply partition-wise aggregation at each derived group by clause and not as a whole-sale strategy. Anyway, it doesn't look like a good idea to pass an argument (gd) only to return from that function in case of its presence. May be we should handle it outside this function. + + /* Nothing to do, if the input relation is not partitioned. */ + if (!input_rel->part_scheme) + return; + + Assert(input_rel->part_rels); For a join between two partitioned tables with one of them being dummy relation, would have part_scheme set but not part_rels (See try_partition_wise_join()). This assertion would fail in such a case. Have you tested the case? May be we should just test if input_rel->part_rels exists similar to generate_partition_wise_join_paths(). Also, how is a dummy input relation is handled in this function? Do we need to handle? + nparts = input_rel->nparts; + part_rels = (RelOptInfo **) palloc(nparts * sizeof(RelOptInfo *)); + grouped_rel->part_rels = part_rels; For a partial aggregation, we can't say that the child rels produced here are partitions of the top grouped relation, so setting part_rels looks wrong. We should set this only when a full aggregate is obtained from each partition. + scanjoin_target = copy_pathtarget(input_rel->cheapest_startup_path->pathtarget); + scanjoin_target->exprs = (List *) adjust_appendrel_attrs(root, + (Node *) scanjoin_target->exprs, + nappinfos, + appinfos); Why can't we use input_child_rel->pathtarget? It should be same as the translation of its parent's path target. I probably understand that's because the input rel's path targets have been changed after the underlying join was planned, a step which is not applied to the individual children. May be add a comment here? + child_target->exprs = (List *) adjust_appendrel_attrs(root, + (Node *) target->exprs, + nappinfos, + appinfos); + partial_target = make_partial_grouping_target(root, target); + partial_target->exprs = (List *) adjust_appendrel_attrs(root, + (Node *) partial_target->exprs, + nappinfos, + appinfos); We need both of these steps for any aggregate since parallel paths will compute parial paths anyway. If that's correct, may be we should add a comment? + extra.inputRows = 0; /* Not needed at child paths creation */ Why? Comment should be on its own line. + if (!create_child_grouping_paths(root, input_child_rel, agg_costs, gd, + &extra)) + { + /* Could not create path for childrel, return */ + pfree(appinfos); + return; + } Can we detect this condition and bail out even before planning any of the children? It looks wasteful to try to plan children only to bail out in this case. + /* Nothing to do if we have no live children */ + if (live_children == NIL) + return; A parent relation with all dummy children will also be dummy. May be we should mark the parent dummy case using mark_dummy_rel() similar to generate_partition_wise_join_paths(). +/* + * have_grouping_by_partkey + * Somehow this name sounds like it would return true when GROUP BY contains only partition key. May be rename as group_by_has_partkey? to indicate the + * Returns true, if partition keys of the given relation are part of the + * GROUP BY clauses, false otherwise. Reword as " ... if all the partition keys of ... " +static bool +have_grouping_by_partkey(RelOptInfo *input_rel, PathTarget *target, + List *groupClause) +{ + List *tlist = make_tlist_from_pathtarget(target); + List *groupexprs = get_sortgrouplist_exprs(groupClause, tlist); Have we tested the case with multi-level partitioned table and children with different order of partition key columns? + partexprs = input_rel->partexprs ? input_rel->partexprs[cnt] : NIL; + + /* Rule out early, if there are no partition keys present */ + if (partexprs == NIL) + return false; If input_rel->partexprs is NIL, we should "bail" out even before the loop starts. + foreach(lc, partexprs) + { + Expr *partexpr = lfirst(lc); + + if (list_member(groupexprs, partexpr)) + { + found = true; + break; + } + } This looks like a useful piece of general functionality list_has_intersection(), which would returns boolean instead of the whole intersection. I am not sure whether we should add that function to list.c and use here. + * If none of the partition key matches with any of the GROUP BY Reword as "... the partition key expressions match with ...." This isn't a full review of 0007, but I think it covers most of the new functionality. [1] https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo+MZsPXidZ077h=gp9bvQw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
pgsql-hackers by date: