Re: The Future of Aggregation - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: The Future of Aggregation |
Date | |
Msg-id | 1986970674.211372.1433943568062.JavaMail.yahoo@mail.yahoo.com Whole thread Raw |
In response to | Re: The Future of Aggregation (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: The Future of Aggregation
|
List | pgsql-hackers |
David Rowley <david.rowley@2ndquadrant.com> wrote: > On 10 June 2015 at 02:52, Kevin Grittner <kgrittn@ymail.com> wrote: >> David Rowley <david.rowley@2ndquadrant.com> wrote: >>> The idea I discussed in the link in item 5 above gets around this >>> problem, but it's a perhaps more surprise filled implementation >>> as it will mean "select avg(x),sum(x),count(x) from t" is >>> actually faster than "select sum(x),count(x) from t" as the agg >>> state for avg() will satisfy sum and count too. >> >> I'm skeptical that it will be noticeably faster. It's easy to see >> why this optimization will make a query *with all three* faster, >> but I would not expect the process of accumulating the sum and >> count to be about the same speed whether performed by one >> transition function or two. Of course I could be persuaded by a >> benchmark showing otherwise. Of course, after reading Tom's post and digging into what aggregates share a transition function, I was already prepared to eat my words above. Since the sum() aggregate is using the xxx_avg_accum transition function, it is clearly doing the work of accumulating the count already, so it's clear that the above can be a win. > Assuming that if we reuse the avg(x) state for count(x) and > sum(x) then it will perform almost exactly like a query > containing just avg(x), the only additional overhead is the call > to the final functions per group, so in the following case that's > likely immeasurable: > > /* setup */ create table millionrowtable as select > generate_series(1,1000000)::numeric as x; > /* test 1 */ SELECT sum(x) / count(x) from millionrowtable; > /* test 2 */ SELECT avg(x) from millionrowtable; > > Test 1: > 274.979 ms > 272.104 ms > 269.915 ms > > Test 2: > 229.619 ms > 220.703 ms > 234.743 ms > > (About 19% slower) Of course, with Tom's approach you would see the benefit; the two statements should run at about the same speed. I am a little curious what sort of machine you're running on, because my i7 is much slower. I ran a few other tests with your table for perspective. To get the raw time to just pass the tuples: SELECT from millionrowtable where xmin = '0'; Time: 125.340 ms Time: 124.443 ms Time: 115.629 ms Just the count(*) of those rows didn't boost the time much: SELECT count(*) from millionrowtable; Time: 132.128 ms Time: 128.036 ms Time: 125.400 ms The NULL check added by specifying count(x) boosted it more: SELECT count(x) from millionrowtable; Time: 165.858 ms Time: 163.872 ms Time: 165.448 ms A NULL check plus numeric addition gets expensive: SELECT sum(x) from millionrowtable; Time: 366.879 ms Time: 364.503 ms Time: 365.418 ms Since sum() and avg() use the same transition function, I was suprised to see a difference here: SELECT avg(x) from millionrowtable; Time: 374.339 ms Time: 372.294 ms Time: 366.933 ms Here's the statement you are talking about optimizing: SELECT sum(x), count(x) from millionrowtable; Time: 441.331 ms Time: 442.501 ms Time: 436.930 ms To confirm that projecting the extra column compared to avg() was not significant: SELECT sum(x) / count(x) from millionrowtable; Time: 442.404 ms Time: 436.241 ms Time: 442.381 ms So this can reasonably be compared to the avg(x) time above. On my machine this optimization could be expected to shave off about 16% of current run time. One question that arose in my mind running this was whether might be able to combine sum(x) with count(*) if x was NOT NULL, even though the arguments don't match. It might not be worth the gymnastics of recognizing the special case, and I certainly wouldn't recommend looking at that optimization in a first pass; but it might be worth jotting down on a list somewhere.... -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: