Thread: Optimizing numeric SUM() aggregate
Hi, I spent some time profiling a simply query with a SUM() aggregate. We've made some big improvements in this area in recent years, but it seems there's still some room for improvement. A lot of CPU time is spent in the numeric add_var() and friends. Which isn't that surprising, I guess. I came up with the attached patch that keeps the sum in a specialized accumulator, instead of a NumericVar. The specialized accumulator has a few tricks, compared to the status quo: 1. Uses 32-bit integers to represent each base-10000 "digit". Instead of propagating carry after each new value, it's done only every 9999 values (or at the end). 2. Accumulates positive and negative values separately. They positive and negative sums are added together only at the end. This avoids the overhead in add_var(), for figuring out which value is larger and determining the result sign at each step. 3. Only allocate a new buffer when it needs to be enlarged. add_abs() allocates a new one on every call. These optimizations give a nice speedup for SUM(), and other aggregates like AVG() and STDDEV() that use the same agg state. For example, using the same test query that Hadi Moshayedi used on previous work on numeric aggregates (https://www.postgresql.org/message-id/CAK%3D1%3DWrmCkWc_xQXs_bpUyswCPr7O9zkLmm8Oa7_nT2vybvBEQ%40mail.gmail.com): CREATE TABLE avg_test AS SELECT (random() * 999)::decimal(5,2) as d FROM generate_series(1, 10000000) s; SELECT avg(d) FROM avg_test; On my laptop, with max_parallel_workers_per_gather=0, this runs in about 1.5 s without the patch, and 1.2 s with the patch. - Heikki
Attachment
Hi! I like the idea and implementation, but I have one suggestion. > Instead of propagating carry after each new value, it's done only every 9999 values (or at the end). I think we could do carry every 0x7FFFFFF / 10000 accumulation, couldn't we? Best regards, Andrey Borodin, Octonica & Ural Federal University.
>I think we could do carry every 0x7FFFFFF / 10000 accumulation, couldn't we? I feel that I have to elaborate a bit. Probably my calculations are wrong. Lets assume we already have accumulated INT_MAX of 9999 digits in previous-place accumulator. That's almost overflow, but that's not overflow. Carring that accumulator to currents gives us INT_MAX/10000 carried sum. So in current-place accumulator we can accumulate: ( INT_MAX - INT_MAX / 10000 ) / 9999, where 9999 is max value dropped in current-place accumulator on each addition. That is INT_MAX * 9999 / 99990000 or simply INT_MAX / 10000. If we use unsigned 32-bit integer that is 429496. Which is 43 times less frequent carring. As a bonus, we get rid of 9999 const in the code (: Please correct me if I'm wrong. Best regards, Andrey Borodin, Octonica & Ural Federal University.
On 27 July 2016 at 07:33, Andrew Borodin <borodin@octonica.com> wrote: >>I think we could do carry every 0x7FFFFFF / 10000 accumulation, couldn't we? > > I feel that I have to elaborate a bit. Probably my calculations are wrong. > > Lets assume we already have accumulated INT_MAX of 9999 digits in > previous-place accumulator. That's almost overflow, but that's not > overflow. Carring that accumulator to currents gives us INT_MAX/10000 > carried sum. > So in current-place accumulator we can accumulate: ( INT_MAX - INT_MAX > / 10000 ) / 9999, where 9999 is max value dropped in current-place > accumulator on each addition. > That is INT_MAX * 9999 / 99990000 or simply INT_MAX / 10000. > > If we use unsigned 32-bit integer that is 429496. Which is 43 times > less frequent carring. > > As a bonus, we get rid of 9999 const in the code (: > > Please correct me if I'm wrong. > This is basically the same problem that has already been solved in mul_var() and other places in numeric.c, so in this case it could be coded using something like accum->maxdig += NBASE - 1; if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) Propagate carries... I agree that the new code should avoid explicitly referring to constants like 9999, and I don't think there is any reason for this new code to assume NBASE=10000. Regards, Dean
> if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) Woth noting that (INT_MAX - INT_MAX / NBASE) / (NBASE - 1) == INT_MAX / NBASE for any NBASE > 1 >I don't think there is any reason for this new code to assume NBASE=10000 There is a comment on line 64 stating that value 10000 is hardcoded somewhere else, any other value is not recommended and a bunch of code is left for historical reasons. Best regards, Andrey Borodin, Octonica & Ural Federal University. 2016-07-27 13:57 GMT+05:00 Dean Rasheed <dean.a.rasheed@gmail.com>: > On 27 July 2016 at 07:33, Andrew Borodin <borodin@octonica.com> wrote: >>>I think we could do carry every 0x7FFFFFF / 10000 accumulation, couldn't we? >> >> I feel that I have to elaborate a bit. Probably my calculations are wrong. >> >> Lets assume we already have accumulated INT_MAX of 9999 digits in >> previous-place accumulator. That's almost overflow, but that's not >> overflow. Carring that accumulator to currents gives us INT_MAX/10000 >> carried sum. >> So in current-place accumulator we can accumulate: ( INT_MAX - INT_MAX >> / 10000 ) / 9999, where 9999 is max value dropped in current-place >> accumulator on each addition. >> That is INT_MAX * 9999 / 99990000 or simply INT_MAX / 10000. >> >> If we use unsigned 32-bit integer that is 429496. Which is 43 times >> less frequent carring. >> >> As a bonus, we get rid of 9999 const in the code (: >> >> Please correct me if I'm wrong. >> > > This is basically the same problem that has already been solved in > mul_var() and other places in numeric.c, so in this case it could be > coded using something like > > accum->maxdig += NBASE - 1; > if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) > Propagate carries... > > I agree that the new code should avoid explicitly referring to > constants like 9999, and I don't think there is any reason for this > new code to assume NBASE=10000. > > Regards, > Dean
On 27 July 2016 at 09:57, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > it could be > coded using something like > > accum->maxdig += NBASE - 1; > if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) > Propagate carries... > Oops, that's not quite right because maxdig is actually epresents the max possible value divided by NBASE-1 in mul_var(), so actually it ought to have been accum->maxdig++ in the first line. Regards, Dean
On 27 July 2016 at 10:17, Andrew Borodin <borodin@octonica.com> wrote: >> if (accum->maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1)) > Woth noting that (INT_MAX - INT_MAX / NBASE) / (NBASE - 1) == INT_MAX > / NBASE for any NBASE > 1 Interesting. I think it's clearer the way it is in mul_var() though, because the intention is to avoid letting digits exceed INT_MAX - INT_MAX/NBASE, so that there is no danger of overflow in the carry propagation step. The long form makes that clearer (and is just as efficient, since these expressions will be evaluated by the preprocessor). Regards, Dean
Andrew Borodin <borodin@octonica.com> writes: >> I don't think there is any reason for this new code to assume NBASE=10000 > There is a comment on line 64 stating that value 10000 is hardcoded > somewhere else, any other value is not recommended and a bunch of code > is left for historical reasons. Doesn't matter: please use NBASE when you mean NBASE. 10000 is just a magic number, and therefore bad coding style. For that matter, spelling INT_MAX as 0x7FFFFFF is also not per project style. regards, tom lane
I've tested patch with this query https://gist.github.com/x4m/fee16ed1a55217528f036983d88771b4 Test specs were: Ubuntu 14 LTS VM, dynamic RAM, hypervisor Windows Server 2016, SSD disk, core i5-2500. Configuration: out of the box master make. On 10 test runs timing of select statement: AVG 3739.8 ms, STD 435.4193 With patch applied (as is) : 3017.8 ms, STD 319.893 Increase of overflow const showed no statistically viable performance improvement (though, I do not worth doing). 2016-07-27 17:32 GMT+05:00 Tom Lane <tgl@sss.pgh.pa.us>: > For that matter, spelling INT_MAX as 0x7FFFFFF is also not per project style. Sure, 0x7FFFFFF was not for code but for metal arithmetics. I would even add that INT_MAX is system-dependent and varies in different specs. I'd suggest INT32_MAX. Best regards, Andrey Borodin, Octonica & Ural Federal University.
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation: tested, failed This is a review of a patch "Optimizing numeric SUM() aggregate" by Heikki Linnakangas https://www.postgresql.org/message-id/flat/c0545351-a467-5b76-6d46-4840d1ea8aa4@iki.fi#c0545351-a467-5b76-6d46-4840d1ea8aa4@iki.fi This review contains summarization of all my posts regarding this patch and a little bit of additional suggestions. Contents & Purpose ================== This patch improves performance of aggregates computation by delaying numeric overflow carring between NBASE-digit arbitrary-lengtharithmetic. This is possible because 32-bit int is used for storage of every NBASE-digit, where NBASE is10000. Patch changes file numeric.c only. Folder of numeric.c does not contain README. That is why all documentation of a patchis done in comments. I consider documentation sufficient. Initial Run =========== Patch can be applied cleanly to current HEAD. Regression tests path before and after patch application. Performance =========== I've tested patch with this query CREATE TABLE avg_test AS SELECT (random() * 999)::decimal(5,2) as d FROM generate_series(1, 1000000) s; SELECT avg(d) a , avg(d+0) s0 , avg(d+1) s1 , avg(d+2) s2, avg(d+3) s3 , avg(d+4) s4 , avg(d+5) s5, avg(d+6) s6 , avg(d+7)s7, avg(d+8) s8 , avg(d+9) s9 FROM avg_test; Test specs were: Ubuntu 14 LTS VM, dynamic RAM, hypervisor Windows Server 2016, SSD disk, core i5-2500. Configuration: out of the box master make. On 10 test runs timing of select statement: AVG 3739.8 ms, STD 435.4193 With patch applied (as is) : 3017.8 ms, STD 319.893 I suppose this proves performance impact of a patch. I don't think that sum calculation was a common bottleneck, but certainlypatch will slightly improve performance of a very huge number of queries. Suggestions =========== 1. Currenlty overflow is carried every 9999 addition. I suggest that it is possibe to do it only every (INT32_MAX - INT32_MAX/ NBASE) / (NBASE - 1) addition. Please note that with this overflow count it will be neccesary to check whethertwo finishing carrings are required. 2. Aggregates and numeric regression tests do not containt any test case covering overflows. I recommend adding sum withnumer 1 000 000 of 99 999 999 values. May be you will come up with more clever overflow testing. Conclusion ========== This patch is important and thorough work, but I'd suggest to consider some polishing on two points listed above. The new status of this patch is: Waiting on Author
> 3 авг. 2016 г., в 22:47, Andrey Borodin <amborodin@acm.org> написал(а): > > make installcheck-world: tested, failed > Implements feature: tested, failed > Spec compliant: tested, failed > Documentation: tested, failed > > This is a review of a patch "Optimizing numeric SUM() aggregate" by Heikki Linnakangas Oh, I failed to use those checkboxes in web app. In my review please read "tested, failed" as "tested, passed". I'm sorry. Best regards, Andrey Borodin.
On 2016-08-03 17:47:20 +0000, Andrey Borodin wrote: > I suppose this proves performance impact of a patch. I don't think > that sum calculation was a common bottleneck, but certainly patch will > slightly improve performance of a very huge number of queries. They commonly are in OLAP style workloads.
On 08/03/2016 08:47 PM, Andrey Borodin wrote: > 1. Currenlty overflow is carried every 9999 addition. I suggest that > it is possibe to do it only every (INT32_MAX - INT32_MAX / NBASE) / > (NBASE - 1) addition. Please note that with this overflow count it > will be neccesary to check whether two finishing carrings are > required. I left it at (NBASE -1), but added a comment on that. I couldn't measure any performance difference between that and a larger interval, so it didn't seem worthwhile. The code wouldn't be much more complicated, but the math required to explain what the safe maximum is, would be :-). > 2. Aggregates and numeric regression tests do not containt > any test case covering overflows. I recommend adding sum with numer 1 > 000 000 of 99 999 999 values. May be you will come up with more > clever overflow testing. Hmm. Added a couple of simple SUM() queries to the regression tests, so that we have at least some test coverage of the overflows. But nothing very clever. Looking closer, I don't think this implementation is actually dependent on NBASE being 10000. So I removed the comment changes that suggested so. I did some more minor comment fixes, and committed. Thanks for the review! - Heikki