Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs
From | Mathias Kunter |
---|---|
Subject | Re: BUG #17964: Missed query planner optimization |
Date | |
Msg-id | 8ca5637b-32c4-cb17-2de6-10f9248c5857@gmail.com Whole thread Raw |
In response to | Re: BUG #17964: Missed query planner optimization (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: BUG #17964: Missed query planner optimization
|
List | pgsql-bugs |
Am 07.06.23 um 14:36 schrieb David Rowley: > On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote: >> If I simply change the original query from this: >> >>> SELECT ... WHERE id IN (SELECT ...); >> >> into this: >> >>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); >> >> then Postgres uses an index scan, and the query is orders of magnitude >> faster. > > What's going on here is that ... Thank you very much for the explanation, fully understood. > The main problem here is that in some cases the first of your queries > will be faster and in other cases the 2nd will be faster. It'll depend > on how many rows are in each table. So, really to find the fastest > plan, the planner would have to consider both options. Yes, considering both options is exactly what I'm suggesting. > The join search is going to be the slowest part of planning > when there are a few tables to join, so doing that twice could add > quite a bit of overhead to the planner. Yes, but I'd argue that we have to put that additional planning overhead into perspective. The planning time overhead will typically be in the order of milliseconds. Also, the additional planning time will not depend on the number of rows within the tables. In contrast to that, the execution time overhead can be in the order of minutes, since it will depend on the number of rows which are involved. After all, Postgres will unnecessarily execute a sequential scan, and the size of the scanned table might be gigabytes. > You might also then consider > how many times you'd need to perform the join search if there were, > say, 5 IN clauses. To exhaustively find the best plan we'd need to > try all possible combinations of converting each IN clause to a > semi-join vs leaving the qual alone. We don't necessarily have to use exhaustive search, a heuristic would already be a major improvement. For example, use the estimated number of rows returned by the subselect of each IN clause. If this number is below a certain threshold (which is yet to be defined), then rewrite the corresponding IN clause to ANY. As a final verification step, check the rewritten query against the original query, and only use the rewritten query if the estimated costs are lower. I think something like this should be possible to do with reasonable planner overhead. > I'm not all that sure you're likely to see us > making any improvements here. I'm very sorry to hear this. Please note that I stumbled upon this issue through a real-world use case. One of my queries was very slow. I looked at the query plan of the affected query, and I saw this: > Planning Time: 1.267 ms > Execution Time: 69566.632 ms > cost=3941783.88..3941783.92 Just by changing one IN clause to ANY, I now see this: > Planning Time: 1.103 ms > Execution Time: 0.232 ms > cost=1514.95..1515.62 The query time went from over a minute to about a millisecond! I couldn't believe it. Here, the overhead to find the better query plan would obviously have paid off hugely. > I suggest just rewriting the query in a > way that it executes more quickly for you. Yes, I'm happily doing that now. The problem is that most other Postgres users don't know that they might have to replace IN with ANY in order to massively speed up their queries. How should they know? This certainly comes unexpected, and this also isn't documented anywhere, as far as I can see. Thanks Mathias
pgsql-bugs by date: