Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | CAPpHfdvyWLMGwvxaf=7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: POC: GROUP BY optimization
Re: POC: GROUP BY optimization |
List | pgsql-hackers |
Hi! On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Thank you for the fixes you've proposed. I didn't look much into > details yet, but I think the main concern Tom expressed in [1] is > whether the feature is reasonable at all. I think at this stage the > most important thing is to come up with convincing examples showing > how huge performance benefits it could cause. I will return to this > later today and will try to provide some convincing examples. I took a fresh look at 0452b461b, and have the following thoughts: 1) Previously, preprocess_groupclause() tries to reorder GROUP BY clauses to match the required ORDER BY order. It only reorders if GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa. So, both of them need to be satisfied by one ordering. 0452b461b also tries to match GROUP BY clauses to ORDER BY clauses, but takes into account an incremental sort. Actually, instead of removing preprocess_groupclause(), we could just teach it to take incremental sort into account. 2) The get_useful_group_keys_orderings() function takes into account 3 orderings of pathkeys and clauses: original order as written in GROUP BY, matching ORDER BY clauses as much as possible, and matching the input path as much as possible. Given that even before 0452b461b, preprocess_groupclause() could change the original order of GROUP BY clauses, so we don't need to distinguish the first two. We could just consider output of new preprocess_groupclause() taking into account an incremental sort and the ordering matching the input path as much as possible. This seems like significant simplification. Let me also describe my thoughts about the justification of the feature itself. As Tom pointed upthread, Sort + Grouping is generally unlikely faster than Hash Aggregate. The main point of this feature is being able to match the order of GROUP BY clauses to the order of the input path. That input path could be Index Scan or Subquery. Let's concentrate on Index Scan. Index Scan can give us the required order, so we can do grouping without Sort or with significantly cheaper Incremental Sort. That could be much faster than Hash Aggregate. But if we scan the full table (or its significant fraction), Index Scan is typically much more expensive than Sequential Scan because of random page access. However, there are cases when Index Scan could be cheaper. 1) If the heap row is wide and the index contains all the required columns, Index Only Scan can be cheaper than Sequential Scan because of lesser volume. 2) If the query predicate is selective enough and matches the index, Index Scan might be significantly cheaper. One may say that if the query predicate is selective enough then there are not that many matching rows, so aggregating them either way isn't a big deal anyway. However, given that data volumes are growing tremendously, it's not hard to imagine that the index selected a small fraction of table rows, but they are still enough to be costly for aggregating. Therefore, I think there are significant cases where matching GROUP BY clauses to the order of the input path could give a substantial improvement over Hash Aggregate. While there are some particular use-cases by Jian He, I hope that above could give some rationale. ------ Regards, Alexander Korotkov
pgsql-hackers by date: