Home > mailing lists
Re: [HACKERS] Partition-wise aggregation/grouping - Mailing list pgsql-hackers

From	Ashutosh Bapat
Subject	Re: [HACKERS] Partition-wise aggregation/grouping
Date	November 17, 2017 15:24:59
Msg-id	CAFjFpRfyPVKD0ZqTYF8Y-pySu_M0wp_Vn2+NCUwHPs6AT_k2Jw@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Partition-wise aggregation/grouping (Jeevan Chalke <jeevan.chalke@enterprisedb.com>)
Responses	Re: [HACKERS] Partition-wise aggregation/grouping Re: [HACKERS] Partition-wise aggregation/grouping
List	pgsql-hackers
Tree view
On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> OK. Done in the attached patch set.
>
> I have rebased all my patches on latest HEAD which is at
> 7518049980be1d90264addab003476ae105f70d4
>
> Thanks

These are review comments for the last set and I think most of them
apply to the new set as well.

Patches 0001 - 0005 refactoring existing code. I haven't
reviewed them in detail, checking whether we have missed anything in moving the
code, but they mostly look fine.

Comments on 0006/*
+ * cost_append
+ *      Determines and returns the cost of an Append node.
+ *
... clipped portion
+
+    /* Add Append node overhead. */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

I am wondering whether it's really worth creating a new function for a single
line addition to create_append_path(). I think all we need to change in
create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
tuples) to path->total_cost.


+    /* Add MergeAppend node overhead like we do it for the Append node */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

With this change the following comment is no more true. Please remove it.    * extracted tuple.  We don't charge
cpu_tuple_costbecause a MergeAppend    * node doesn't do qual-checking or projection, so it has less overhead    * than
mostplan nodes.    */
 

+/*
+ * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note that

May be reword it as " ... to cost per tuple processing by an append node ..."

+ * this value should be multiplied with cpu_tuple_cost wherever applicable.
+ */
+#define DEFAULT_APPEND_COST_FACTOR 0.5

I am wondering whether we should just define
#define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
have other than multiplying with cpu_tuple_cost?
-- test partition matching with N-way joinEXPLAIN (COSTS OFF)SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c,
t3.cFROM
 
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
-                                      QUERY PLAN
---------------------------------------------------------------------------------------
+                                   QUERY PLAN
+-------------------------------------------------------------------------------- Sort   Sort Key: t1.c, t3.c   ->
HashAggregate        Group Key: t1.c, t2.c, t3.c
 
-         ->  Result
+         ->  Hash Join
+               Hash Cond: (t1.c = t2.c)               ->  Append
-                     ->  Hash Join
-                           Hash Cond: (t1.c = t2.c)

That's sad. Interestingly this query has an aggregate, so the plan will use
partition-wise join again when partition-wise aggregation patch will be
applied. So may be fine.

- Append  (cost=0.00..0.04 rows=2 width=32)
+ Append  (cost=0.00..0.05 rows=2 width=32)

- Append  (cost=0.00..0.04 rows=2 width=4)
+ Append  (cost=0.00..0.05 rows=2 width=4)

We do have some testcases which print costs. Interesting :). I don't have
any objection to this change.

Comments on 0007

+       <para>
+        Enables or disables the query planner's use of partition-wise grouping
+        or aggregation, which allows  If partition-wise aggregation
does not result in the
+        cheapest path, it will still spend time in creating these paths and
+        consume memory which increase linearly with the number of partitions.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
May be we should word this in the same manner as partition-wise join like
       Enables or disables the query planner's use of partition-wise grouping       or aggregation, which allows
aggregationor grouping on a partitioned       tables to be spread across the partitions. If <literal>GROUP
BY<literal>clause includes partition keys, the rows are aggregated at       each partition. Otherwise, partial
aggregatescomputed for each       partition are required to be combined. Because partition-wise
aggregation/gropuingcan use significantly more CPU time and memory       during planning, the default is
<literal>off</literal>.

+
+Partition-wise aggregates/grouping
+----------------------------------

... clipped patch

+In above plan, aggregation is performed after append node which means that the
+whole table is an input for the aggregation node. However, with partition-wise
+aggregation, same query will have plane like:

s/plane/plan/

+ Append

... clipped patch

+PartialAggregate stage greatly reduces the number of groups and lose if we have
+lots of small groups.

To keep the discussion brief, I suggest we rewrite this paragraph as

----
If GROUP BY clause has all partition keys, all the rows that belong to a given
group come from a single partition and thus aggregates can be finalized
separately for each partition. When the number of groups is far lesser than the
number of rows being grouped, as usually is the case, the number of rows
processed by an Append node reduces apart from reducing the size of the hash
table or size of the data to be sorted. This usually improves efficiency of the
query. If GROUP BY doesn't contain all the partition keys, partial
aggregates can be computed for
each partition followed by combining partial aggregates from one or more
partitions belonging to the same group to compute complete aggregate for each
group. This improves efficiency of the query if the number of groups is far
less than the number of rows produced by the scan underneath.
---

I am not sure whether we should be discussing why this technique performs
better or when it performs better. We don't have similar discussion for
partition-wise join. That paragraph just describes the technique and may be we
want to do the same here.

+ *
+ * extra is the additional information required when we are doing aggregation
+ * or grouping below the append node. In case of partial partition-wise
+ * aggregation on a child node, we need to compute finalized step after the
+ * append, which cannot be done in this function. And thus if we have non-NULL
+ * value for extra, we call create_partition_agg_paths() to create an append
+ * node and finalization, if any.

May be we want to just say "extra provides more information about the
partitioned aggregation/grouping e.g path target, whether to use partial
aggregate and so on." When present we call create_partition_agg_paths() to add
paths for partition-wise aggregatges.

-        add_path(rel, (Path *) create_append_path(rel, subpaths,
-                                                  rel->reltarget, NULL, 0,
-                                                  partitioned_rels));
+    {
+        if (extra)
+            create_partition_agg_paths(root, rel, subpaths, NIL,
+                                       NIL, NULL, 0,
+                                       partitioned_rels, extra);
+        else
+            add_path(rel, (Path *) create_append_path(rel, subpaths,
+                                                      rel->reltarget, NULL, 0,
+                                                      partitioned_rels));
+    }

I am wondering whether we could write a function to call appropriate one out of
create_append_path(), create_partition_agg_paths() or
create_merge_append_path() based on the presence of extra and/or pathkeys and
use it everywhere such a change is made. I don't know whether that will be
worth the code. But there are a handful places where such diffs are required.

-
-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

While I can guess why this change is required, it may be better to separate it
into a patch of its own and adding some explanation in the commit message, for
other reviewers.

+    /* Copy input rels's relids to grouped rel */
+    grouped_rel->relids = input_rel->relids;

I am fine with this change, but Tom may not agree [1]. May be we should get his
opinion on this one.
    /*
+     * If input relation is partitioned, check if we can perform
+     * partition-wise grouping and/or aggregation.
+     */

Just like partition-wise join a concise "Apply partition-wise aggregation
technique, if possible." would suffice.
    dNumPartialGroups = get_number_of_groups(root,
cheapest_partial_path->rows,                                            gd,
 
-                                             parse->targetList);
+
make_tlist_from_pathtarget(target));
Can we guarantee that the output of make_tlist_from_pathtarget() will be same
as translation of parse->targetList for the given child? Even if not, may be
it's fine to pass slightly different tlist to get_number_of_groups() since it
doesn't depend upon the exact shape but right group column references.
Nonetheless something to test and verify.
 *
- * Determines whether parallel grouping and/or aggrgation is possible, or not.
+ * Determines whether parallel grouping and/or aggregation is possible, or not. * Returns true when possible, false
otherwise.

Does this hunk belong to one of the refactoring patches or as a separate patch
correcting a typo?

+/*
+ * try_partition_wise_grouping
+ *
+ * If the input relation is partitioned and the partition keys are part of the
+ * group by clauses, each partition produces a different set of groups.
+ * Aggregates within each such group can be computed partition-wise. This

While these sentences are correct, I think the reason why we could compute an
aggregate at the level of each partition is because rows from a given group
belong to a single partition. So, I guess, we have to reword this as

"If the partition keys of input relation are part of group by clause, all the
rows belonging to a given group come from a single partition, each partition
producing a different set of groups. This allows aggregation/grouping over a
partitioned relation to be broken down into aggregation/grouping on each
partition.

If group by clause does not contain all the partition keys, rows from a given
group may be spread across multiple partitions. In that case, we can combine
partial aggregates for a given group across partitions to produce the final
aggregate for a that group "

+ * might be optimal because of presence of suitable paths with pathkeys or
+ * because the hash tables for most of the partitions fit into the memory.
+ * However, if partition keys are not part of the group by clauses, then we
+ * still able to compute the partial aggregation for each partition and then
+ * finalize them over append. This can either win or lose. It may win if the
+ * PartialAggregate stage greatly reduces the number of groups and lose if we
+ * have lots of small groups.

I have not seen prologue of a function implementing a query optimization
technique explain why that technique improves performance. So I am not sure
whether the comment should include this explanation. One of the reasons being
that the reasons why a technique works might change over the period of time
with the introduction of other techniques, thus obsoleting the comment.  But
may be it's good to have it here.

+    /*
+     * Grouping sets plan does not work with an inheritance subtree (see notes
+     * in create_groupingsets_plan). Thus do not handle grouping sets here.
+     */
+    if (query->groupingSets || gd)
+        return;

Even if that restriction is lifted, we won't be able to compute
"whole" grouping sets
for each partition, since grouping sets implies multiple group by clauses, each
of which may not have all partition keys. Those sets which have all partition
keys will be computed completely for each partition, but others will require
partial aggregation. I guess, we will need to apply partition-wise aggregation
at each derived group by clause and not as a whole-sale strategy.

Anyway, it doesn't look like a good idea to pass an argument (gd) only to
return from that function in case of its presence. May be we should handle it
outside this function.

+
+    /* Nothing to do, if the input relation is not partitioned. */
+    if (!input_rel->part_scheme)
+        return;
+
+    Assert(input_rel->part_rels);

For a join between two partitioned tables with one of them being dummy
relation, would have part_scheme set but not part_rels (See
try_partition_wise_join()). This assertion would
fail in such a case. Have you tested the case? May be we should just test if
input_rel->part_rels exists similar to generate_partition_wise_join_paths().
Also, how is a dummy input relation is handled in this function? Do we need to
handle?

+    nparts = input_rel->nparts;
+    part_rels = (RelOptInfo **) palloc(nparts * sizeof(RelOptInfo *));
+    grouped_rel->part_rels = part_rels;

For a partial aggregation, we can't say that the child rels produced here are
partitions of the top grouped relation, so setting part_rels looks wrong. We
should set this only when a full aggregate is obtained from each partition.

+        scanjoin_target =
copy_pathtarget(input_rel->cheapest_startup_path->pathtarget);
+        scanjoin_target->exprs = (List *) adjust_appendrel_attrs(root,
+
(Node *) scanjoin_target->exprs,
+                                                                 nappinfos,
+                                                                 appinfos);

Why can't we use input_child_rel->pathtarget? It should be same as the
translation of its parent's path target. I probably understand that's because
the input rel's path targets have been changed after the underlying join was
planned, a step which is not applied to the individual children. May be add a
comment here?

+        child_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                              (Node
*) target->exprs,
+                                                              nappinfos,
+                                                              appinfos);
+        partial_target = make_partial_grouping_target(root, target);
+        partial_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                                (Node
*) partial_target->exprs,
+                                                                nappinfos,
+                                                                appinfos);

We need both of these steps for any aggregate since parallel paths will compute
parial paths anyway. If that's correct, may be we should add a comment?

+        extra.inputRows = 0;    /* Not needed at child paths creation */

Why? Comment should be on its own line.

+        if (!create_child_grouping_paths(root, input_child_rel, agg_costs, gd,
+                                         &extra))
+        {
+            /* Could not create path for childrel, return */
+            pfree(appinfos);
+            return;
+        }

Can we detect this condition and bail out even before planning any of the
children? It looks wasteful to try to plan children only to bail out in this
case.

+    /* Nothing to do if we have no live children */
+    if (live_children == NIL)
+        return;

A parent relation with all dummy children will also be dummy. May be we should
mark the parent dummy case using mark_dummy_rel() similar to
generate_partition_wise_join_paths().

+/*
+ * have_grouping_by_partkey
+ *

Somehow this name sounds like it would return true when GROUP BY contains only
partition key. May be rename as group_by_has_partkey? to indicate the

+ * Returns true, if partition keys of the given relation are part of the
+ * GROUP BY clauses, false otherwise.

Reword as " ... if all the partition keys of ... "

+static bool
+have_grouping_by_partkey(RelOptInfo *input_rel, PathTarget *target,
+                         List *groupClause)
+{
+    List       *tlist = make_tlist_from_pathtarget(target);
+    List       *groupexprs = get_sortgrouplist_exprs(groupClause, tlist);

Have we tested the case with multi-level partitioned table and children with
different order of partition key columns?

+        partexprs = input_rel->partexprs ? input_rel->partexprs[cnt] : NIL;
+
+        /* Rule out early, if there are no partition keys present */
+        if (partexprs == NIL)
+            return false;

If input_rel->partexprs is NIL, we should "bail" out even before the loop
starts.

+        foreach(lc, partexprs)
+        {
+            Expr       *partexpr = lfirst(lc);
+
+            if (list_member(groupexprs, partexpr))
+            {
+                found = true;
+                break;
+            }
+        }

This looks like a useful piece of general functionality
list_has_intersection(), which would returns boolean instead of the whole
intersection. I am not sure whether we should add that function to list.c and
use here.

+         * If none of the partition key matches with any of the GROUP BY

Reword as "... the partition key expressions match with ...."

This isn't a full review of 0007, but I think it covers most of the new
functionality.

[1] https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo+MZsPXidZ077h=gp9bvQw@mail.gmail.com



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
pgsql-hackers by date:
From: Arthur Silva
Date: 17 November 2017, 15:02:43
Subject: Re: [HACKERS] Proposal: generic WAL compression
From: Fabrízio de Royes Mello
Date: 17 November 2017, 15:41:46
Subject: Re: [HACKERS] [PATCH] A hook for session start
Re: [HACKERS] Partition-wise aggregation/grouping - Mailing list pgsql-hackers

Previous

Next