Re: The Future of Aggregation - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: The Future of Aggregation |
Date | |
Msg-id | 2120479484.7962669.1433861571649.JavaMail.yahoo@mail.yahoo.com Whole thread Raw |
In response to | The Future of Aggregation (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: The Future of Aggregation
Re: The Future of Aggregation Re: The Future of Aggregation |
List | pgsql-hackers |
David Rowley <david.rowley@2ndquadrant.com> wrote: > It appears to me that there's quite a few new features and > optimisations on the not too distant horizon which will require > adding yet more fields into pg_aggregate. > > These are things along the lines of: > 3. Auto-updating Materialized views (ones which contain aggregate > functions) Yes, that's certainly on the road map. The recent work to add support for inverse transition functions already goes a huge way toward allowing optimization of incremental maintenance of aggregates in materialized views. > Items 1-4 above I believe require support of "Aggregate State > Combine Support" -> https://commitfest.postgresql.org/5/131/ Yeah, that seems likely to extend optimized cases even further. > 5. Dependant Aggregates > Item 5 makes items 1-4 a bit more complex as with this item > there's opportunity for very good performance improvements by > allowing aggregates like AVG(x) also perform all the required > work to allow SUM(x) and COUNT(x) to be calculated for "free" in > a query containing all 3 aggregates. Not only CPU is saved, but the optimizations for materialized views would require the aggregate function's transition state to be saved in each row, and the duplicate state information among these functions would be a waste of space. > I've discussed item 5 off-list with Simon and he mentioned that > we might invent a transition state and transition functions which > can have parts switched on and off much like how calcSumX2 > controls if do_numeric_accum() should calculate sum(x*x) or not. > The problem with this is that if we ever want to store aggregate > states in an auto-updating materialized view, then this generic > aggregate state will have to contain at least 3 fields (to store > count(x), sum(x) and sum(x*x)), and those 3 fields would have to > be stored even if the aggregate was just a simple count(*). Yeah, I think we want to preserve the ability of count() to have a simple state, and implement dependent aggregates as discussed in the other thread -- where (as I understood it) having sum(x), count(x), and avg(x) in a query would avoid the row-by-row work for sum(x) and count(x), and just invoke a final function to extract those values from the transition state of the avg(x) aggregate. I see incremental maintenance of materialized views taking advantage of the same sort of behavior, only maintaining the state for avg(x) during incremental maintenance and *at the end* pulling the values for sum(x) and count(x) out of that. > The idea I discussed in the link in item 5 above gets around this > problem, but it's a perhaps more surprise filled implementation > as it will mean "select avg(x),sum(x),count(x) from t" is > actually faster than "select sum(x),count(x) from t" as the agg > state for avg() will satisfy sum and count too. I'm skeptical that it will be noticeably faster. It's easy to see why this optimization will make a query *with all three* faster, but I would not expect the process of accumulating the sum and count to be about the same speed whether performed by one transition function or two. Of course I could be persuaded by a benchmark showing otherwise. > The purpose of this email is to open the doors for discussion > about this so that nobody goes off and develops feature X into a > corner and disallows feature Y and so that we end up with the > most optimal solution that does not trip us up in the future. That laudable. So far I don't see anything that will do anything but make the materialized view maintenance easier, at least if dependent aggregates are implemented as described in the other thread. > I'm interested to hear if Kevin or Amit have had any time to give > this any thought before. It would be good to ensure we all have > the same vision here. It sounds great to me. I had thought about the need to deal with these issues at some point to allow optimizations to the incremental maintenance of materialized views (there needs to be a fall-back of recalculating from scratch where such optimizations are not possible, but people will not be happy with the performance of that for simple cases where it is intuitively clear that we could do better). You have developed the ideas farther than I have, and (at least on a first review) I like what I'm seeing. There may be some devils in the details down the road, but I would say that what you're doing looks like it will dovetail nicely with what's on my road map. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: