Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs
From | Mathias Kunter |
---|---|
Subject | Re: BUG #17964: Missed query planner optimization |
Date | |
Msg-id | b7440c1d-db4a-fb43-bd85-3a4f02277204@gmail.com Whole thread Raw |
In response to | Re: BUG #17964: Missed query planner optimization (Mathias Kunter <mathiaskunter@gmail.com>) |
Responses |
Re: BUG #17964: Missed query planner optimization
|
List | pgsql-bugs |
I'm sorry, but I have to bring this up again. As it currently stands, the query planner isn't able to find an adequate query plan for simple real-life queries like the following one, where we simply want to find books by either author name or publisher name: > SELECT * FROM book WHERE > author_id IN (SELECT id FROM author WHERE name = 'some_name') OR > publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name'); Complete example here: https://dbfiddle.uk/q6_4NuDX The issue could be fixed quite easily by implementing a heuristic, the optimized query will execute a few THOUSAND times faster, most people have no clue that they could use ANY(ARRAY()) as a workaround, and still this optimization isn't something worth to be implemented? One of the more complex queries of our project now even executes 300,000 (that is: THREE HUNDRED THOUSAND) times faster with the optimization applied (execution time 72436.509 vs. 0.201 ms). See original query plan here: https://pastebin.com/raw/JsY1PzG3 See optimized query plan here: https://pastebin.com/raw/Xvq7zUY2 So, please consider implementing this optimization. It will be a HUGE performance improvement for a lot of queries used out there. Thank you. Mathias Am 07.06.23 um 18:53 schrieb Mathias Kunter: > Am 07.06.23 um 14:36 schrieb David Rowley: >> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> >> wrote: >>> If I simply change the original query from this: >>> >>>> SELECT ... WHERE id IN (SELECT ...); >>> >>> into this: >>> >>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...)); >>> >>> then Postgres uses an index scan, and the query is orders of magnitude >>> faster. >> >> What's going on here is that ... > > Thank you very much for the explanation, fully understood. > >> The main problem here is that in some cases the first of your queries >> will be faster and in other cases the 2nd will be faster. It'll depend >> on how many rows are in each table. So, really to find the fastest >> plan, the planner would have to consider both options. > > Yes, considering both options is exactly what I'm suggesting. > >> The join search is going to be the slowest part of planning >> when there are a few tables to join, so doing that twice could add >> quite a bit of overhead to the planner. > > Yes, but I'd argue that we have to put that additional planning overhead > into perspective. The planning time overhead will typically be in the > order of milliseconds. Also, the additional planning time will not > depend on the number of rows within the tables. > > In contrast to that, the execution time overhead can be in the order of > minutes, since it will depend on the number of rows which are involved. > After all, Postgres will unnecessarily execute a sequential scan, and > the size of the scanned table might be gigabytes. > >> You might also then consider >> how many times you'd need to perform the join search if there were, >> say, 5 IN clauses. To exhaustively find the best plan we'd need to >> try all possible combinations of converting each IN clause to a >> semi-join vs leaving the qual alone. > > We don't necessarily have to use exhaustive search, a heuristic would > already be a major improvement. > > For example, use the estimated number of rows returned by the subselect > of each IN clause. If this number is below a certain threshold (which is > yet to be defined), then rewrite the corresponding IN clause to ANY. As > a final verification step, check the rewritten query against the > original query, and only use the rewritten query if the estimated costs > are lower. > > I think something like this should be possible to do with reasonable > planner overhead. > >> I'm not all that sure you're likely to see us >> making any improvements here. > > I'm very sorry to hear this. Please note that I stumbled upon this issue > through a real-world use case. One of my queries was very slow. I looked > at the query plan of the affected query, and I saw this: > >> Planning Time: 1.267 ms >> Execution Time: 69566.632 ms >> cost=3941783.88..3941783.92 > > Just by changing one IN clause to ANY, I now see this: > >> Planning Time: 1.103 ms >> Execution Time: 0.232 ms cost=1514.95..1515.62 > > The query time went from over a minute to about a millisecond! I > couldn't believe it. Here, the overhead to find the better query plan > would obviously have paid off hugely. > >> I suggest just rewriting the query in a >> way that it executes more quickly for you. > > Yes, I'm happily doing that now. > > The problem is that most other Postgres users don't know that they might > have to replace IN with ANY in order to massively speed up their > queries. How should they know? This certainly comes unexpected, and this > also isn't documented anywhere, as far as I can see. > > Thanks > > Mathias
pgsql-bugs by date: