Thread: Optimizing aggregates

Optimizing aggregates

From

Heikki Linnakangas

Date:

31 August 2016, 14:47:25

I've been profiling simple aggregate queries, looking for any
low-hanging fruit. For this query:


-- setup
create table floats as select g::float8 as a, g::float8 as b, g::float8
as c from generate_series(1, 10000000) g;
vacuum freeze floats;

-- query
select sum(a), sum(b+c) from floats;


perf report says:

# Children      Self  Command     Shared Object      Symbol

# ........  ........  ..........  .................
........................................
#
     25.70%     0.00%  postmaster  [unknown]          [k] 0000000000000000
     14.23%    13.75%  postmaster  postgres           [.] ExecProject
     11.18%    10.57%  postmaster  postgres           [.] slot_deform_tuple
      9.58%     9.04%  postmaster  postgres           [.] advance_aggregates
      8.96%     0.00%  postmaster  [unknown]          [.] 0x00000000000298d4
      8.77%     8.42%  postmaster  postgres           [.]
ExecMakeFunctionResultNoSets
      7.78%     0.00%  postmaster  [unknown]          [.] 0x0000000001d38260
      6.63%     6.15%  postmaster  postgres           [.]
advance_transition_function
      6.61%     0.00%  postmaster  [unknown]          [.] 0x0000000001e99e40
      6.47%     0.00%  postmaster  libc-2.23.so       [.] __GI___libc_read
      6.24%     5.88%  postmaster  postgres           [.] heap_getnext
      4.62%     4.62%  postmaster  [kernel.kallsyms]  [k]
copy_user_enhanced_fast_string
      3.91%     3.82%  postmaster  postgres           [.] slot_getsomeattrs
      3.29%     3.18%  postmaster  postgres           [.] slot_getattr
      3.06%     3.00%  postmaster  postgres           [.] ExecClearTuple
      2.59%     0.00%  postmaster  [unknown]          [.] 0x0000000001e9a370
      2.57%     2.45%  postmaster  postgres           [.] ExecScan
      2.56%     2.37%  postmaster  postgres           [.] float8pl
      2.54%     2.43%  postmaster  postgres           [.] heapgetpage
      2.25%     2.17%  postmaster  postgres           [.] ExecAgg
      2.10%     1.96%  postmaster  postgres           [.] ExecStoreTuple
      2.00%     1.91%  postmaster  postgres           [.] ExecProcNode

ExecProject stands out. I find that pretty surprising.

We're using ExecProject to extract the arguments from the input tuples,
to pass to the aggregate transition functions. It looks like that's a
pretty expensive way of doing it, for a typical aggregate that takes
only one argument.

We actually used to call ExecEvalExpr() directly for each argument, but
that was changed by the patch that added support for ordered set
aggregates. It looks like that was a bad idea, from a performance point
of view.

I propose that we go back to calling ExecEvalExpr() directly, for
non-ordered aggregates, per the attached patch. That makes that example
query about 10% faster on my laptop, which is in line with the fact that
ExecProject() accounted for about 13% of the CPU time.

Another idea is that maybe we should add a fast-path to ExecProject(),
for these trivial cases.

- Heikki

Attachment

0001-Skip-ExecProject-for-non-ordered-aggregates.patch

Re: Optimizing aggregates

From

Andres Freund

Date:

31 August 2016, 15:52:05

Hi,

On 2016-08-31 17:47:18 +0300, Heikki Linnakangas wrote:
> # ........  ........  ..........  .................
> ........................................
> #
>     25.70%     0.00%  postmaster  [unknown]          [k] 0000000000000000
>     14.23%    13.75%  postmaster  postgres           [.] ExecProject

> ExecProject stands out. I find that pretty surprising.
>
> We're using ExecProject to extract the arguments from the input tuples, to
> pass to the aggregate transition functions. It looks like that's a pretty
> expensive way of doing it, for a typical aggregate that takes only one
> argument.
>
> We actually used to call ExecEvalExpr() directly for each argument, but that
> was changed by the patch that added support for ordered set aggregates. It
> looks like that was a bad idea, from a performance point of view.

I complained about that as well
http://archives.postgresql.org/message-id/20160519175727.ymv2y5tye4qgcmqx%40alap3.anarazel.de

> I propose that we go back to calling ExecEvalExpr() directly, for
> non-ordered aggregates, per the attached patch. That makes that example
> query about 10% faster on my laptop, which is in line with the fact that
> ExecProject() accounted for about 13% of the CPU time.

My approach is a bit different.

I've first combined the projection for all the aggregates, ordered set,
or not, into one projetion. That got rid of a fair amount of overhead
when you have multiple aggregates.  I attached an, probably out of date,
WIP version of that patch.

Secondly, I'm working on overhauling expression evaluation to be
faster. Even without the ExecProject overhead, the computations very
quickly become the bottleneck. During that I pretty much merged
ExecProject and ExecEvalExpr into one - they're really not that
different, and the distinction serves no purpose, except to increase the
number of function calls. The reason I'm working on getting rid of
targetlist SRFs is precisely that. A proof of concept of that is
attached to
http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de

Greetings,

Andres Freund

Attachment

0001-WIP-Only-perform-one-projection-in-aggregation.patch

Re: Optimizing aggregates

From

Heikki Linnakangas

Date:

31 August 2016, 16:07:10

On 08/31/2016 06:51 PM, Andres Freund wrote:
> On 2016-08-31 17:47:18 +0300, Heikki Linnakangas wrote:
>> We actually used to call ExecEvalExpr() directly for each argument, but that
>> was changed by the patch that added support for ordered set aggregates. It
>> looks like that was a bad idea, from a performance point of view.
>
> I complained about that as well
> http://archives.postgresql.org/message-id/20160519175727.ymv2y5tye4qgcmqx%40alap3.anarazel.de

Ah, missed that!

>> I propose that we go back to calling ExecEvalExpr() directly, for
>> non-ordered aggregates, per the attached patch. That makes that example
>> query about 10% faster on my laptop, which is in line with the fact that
>> ExecProject() accounted for about 13% of the CPU time.
>
> My approach is a bit different.
>
> I've first combined the projection for all the aggregates, ordered set,
> or not, into one projetion. That got rid of a fair amount of overhead
> when you have multiple aggregates.  I attached an, probably out of date,
> WIP version of that patch.

A-ha, I also considered doing just that! I also considered a variant 
where we call ExecProject once for all non-ordered aggregates, and a 
separate ExecProject() for each ordered one. But just switching back to 
straight ExecEvalExprs seemed easier.

> Secondly, I'm working on overhauling expression evaluation to be
> faster. Even without the ExecProject overhead, the computations very
> quickly become the bottleneck. During that I pretty much merged
> ExecProject and ExecEvalExpr into one - they're really not that
> different, and the distinction serves no purpose, except to increase the
> number of function calls. The reason I'm working on getting rid of
> targetlist SRFs is precisely that. A proof of concept of that is
> attached to
> http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de

Cool, yes, all that should help.

- Heikki

Re: Optimizing aggregates

From

Andres Freund

Date:

31 August 2016, 16:10:56

On 2016-08-31 19:07:00 +0300, Heikki Linnakangas wrote:
> On 08/31/2016 06:51 PM, Andres Freund wrote:
> > I've first combined the projection for all the aggregates, ordered set,
> > or not, into one projetion. That got rid of a fair amount of overhead
> > when you have multiple aggregates.  I attached an, probably out of date,
> > WIP version of that patch.
> 
> A-ha, I also considered doing just that! I also considered a variant where
> we call ExecProject once for all non-ordered aggregates, and a separate
> ExecProject() for each ordered one. But just switching back to straight
> ExecEvalExprs seemed easier.

The issue is that might, I think, end up iteratively deforming the
underlying tuple. The projection machinery takes care of that, if we do
it in one go.

Greetings,

Andres Freund