Re: Eager aggregation, take 3 - Mailing list pgsql-hackers
From | Richard Guo |
---|---|
Subject | Re: Eager aggregation, take 3 |
Date | |
Msg-id | CAMbWs4-sEi3GeZA2trW87OMDuPhJQ6QhZ-KPMNFdCqC6BM+UMQ@mail.gmail.com Whole thread Raw |
In response to | Re: Eager aggregation, take 3 (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Fri, Sep 5, 2025 at 11:50 PM Robert Haas <robertmhaas@gmail.com> wrote: > Like Matheus, I think a GUC is reasonable. A significant danger here > appears to be the possibility of a performance cliff, where queries > are optimized very different when the ratio is 9.99 vs. 10.01, say. It > would be nice if there were some way to mitigate that danger, but at > least a GUC avoids chaining the performance of the whole system to a > hard-coded value. Yeah, I think the performance cliff issue does exist. It might be mitigated by carefully selecting the threshold value to ensure that small differences in the average group size near the boundary don't cause big performance swings with and without eager aggregation, but this doesn't seem like an easy task. How is this issue avoided in other thresholds? For example, with min_parallel_table_scan_size, is there a performance cliff when the table size is 7.99MB vs. 8.01MB, where a parallel scan is considered in the latter case but not the former? > It might be worth considering whether there are heuristics other than > the group size that could help here. Possibly that's just making > things more complicated to no benefit. It seems to me, for example, > that reducing 100 rows to 10 is quite different from reducing a > million rows to 100,000. On the whole, the latter seems more likely to > work out well, but it's tricky, because the effort expended per group > can be arbitrarily high. I think we do want to let the cost model make > most of the decisions, and just use this threshold to prune ideas that > are obviously bad at an early stage. That said, it's worth thinking > about how this interacts with the just-considered-one-eager-agg > strategy. Does this threshold apply before or after that rule? If I understand correctly, this means that we need to explore each join level to find out the most optimal position for applying partial aggregation. For example, suppose Agg(B) reduces 100 rows to 10, and Agg(A JOIN B) reduces a million rows to 100,000, it might be better to apply partial aggregation at the (A JOIN B) level rather than just over B. However, that's not always the case: the Agg(B) option can reduce the number of input rows to the join earlier, potentially outperforming the Agg(A JOIN B) approach. Therefore, we need to consider both options and compare their costs. This is actually what the patch used to do before I introduced the always-push-to-lowest heuristic. > For instance, consider AGG(FACT_TABLE JOIN DIMENSION_TABLE), like a > count of orders grouped by customer name. Aggregating on the dimension > table (in this case, the list of customers) is probably useless, but > aggregating on the join column of the fact table has a good chance of > being useful. If we consider only one of those strategies, we want it > to be the right one. This threshold could be the thing that helps us > to get it right. Now I see what you meant. However, in the current implementation, we only push partial aggregation down to relations that contain all the aggregation columns. So, in the case you mentioned, if the aggregation columns come from the dimension table, unfortunately, we don't have the option to partially aggregate the fact table. The paper does discuss several other transformations, such as "Eager Count", "Double Eager", and "Eager Split", that can perform partial aggregation on relations that don't contain aggregation columns, or even on both sides of the join. However, those are beyond the scope of this patch. - Richard
pgsql-hackers by date: