Home > mailing lists

Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs

From	Mathias Kunter
Subject	Re: BUG #17964: Missed query planner optimization
Date	June 15, 2023 15:53:14
Msg-id	b7440c1d-db4a-fb43-bd85-3a4f02277204@gmail.com Whole thread Raw
In response to	Re: BUG #17964: Missed query planner optimization (Mathias Kunter <mathiaskunter@gmail.com>)
Responses	Re: BUG #17964: Missed query planner optimization
List	pgsql-bugs

Tree view

I'm sorry, but I have to bring this up again.

As it currently stands, the query planner isn't able to find an adequate 
query plan for simple real-life queries like the following one, where we 
simply want to find books by either author name or publisher name:

> SELECT * FROM book WHERE
> author_id IN (SELECT id FROM author WHERE name = 'some_name') OR
> publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name');

Complete example here: https://dbfiddle.uk/q6_4NuDX

The issue could be fixed quite easily by implementing a heuristic, the 
optimized query will execute a few THOUSAND times faster, most people 
have no clue that they could use ANY(ARRAY()) as a workaround, and still 
this optimization isn't something worth to be implemented?

One of the more complex queries of our project now even executes 300,000 
(that is: THREE HUNDRED THOUSAND) times faster with the optimization 
applied (execution time 72436.509 vs. 0.201 ms).

See original query plan here: https://pastebin.com/raw/JsY1PzG3
See optimized query plan here: https://pastebin.com/raw/Xvq7zUY2

So, please consider implementing this optimization. It will be a HUGE 
performance improvement for a lot of queries used out there. Thank you.

Mathias


Am 07.06.23 um 18:53 schrieb Mathias Kunter:
> Am 07.06.23 um 14:36 schrieb David Rowley:
>> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> 
>> wrote:
>>> If I simply change the original query from this:
>>>
>>>> SELECT ... WHERE id IN (SELECT ...);
>>>
>>> into this:
>>>
>>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>>>
>>> then Postgres uses an index scan, and the query is orders of magnitude
>>> faster.
>>
>> What's going on here is that ...
> 
> Thank you very much for the explanation, fully understood.
> 
>> The main problem here is that in some cases the first of your queries
>> will be faster and in other cases the 2nd will be faster. It'll depend
>> on how many rows are in each table. So, really to find the fastest
>> plan, the planner would have to consider both options.
> 
> Yes, considering both options is exactly what I'm suggesting.
> 
>> The join search is going to be the slowest part of planning
>> when there are a few tables to join, so doing that twice could add
>> quite a bit of overhead to the planner.
> 
> Yes, but I'd argue that we have to put that additional planning overhead 
> into perspective. The planning time overhead will typically be in the 
> order of milliseconds. Also, the additional planning time will not 
> depend on the number of rows within the tables.
> 
> In contrast to that, the execution time overhead can be in the order of 
> minutes, since it will depend on the number of rows which are involved. 
> After all, Postgres will unnecessarily execute a sequential scan, and 
> the size of the scanned table might be gigabytes.
> 
>> You might also then consider
>> how many times you'd need to perform the join search if there were,
>> say, 5 IN clauses.  To exhaustively find the best plan we'd need to
>> try all possible combinations of converting each IN clause to a
>> semi-join vs leaving the qual alone.
> 
> We don't necessarily have to use exhaustive search, a heuristic would 
> already be a major improvement.
> 
> For example, use the estimated number of rows returned by the subselect 
> of each IN clause. If this number is below a certain threshold (which is 
> yet to be defined), then rewrite the corresponding IN clause to ANY. As 
> a final verification step, check the rewritten query against the 
> original query, and only use the rewritten query if the estimated costs 
> are lower.
> 
> I think something like this should be possible to do with reasonable 
> planner overhead.
> 
>> I'm not all that sure you're likely to see us
>> making any improvements here.
> 
> I'm very sorry to hear this. Please note that I stumbled upon this issue 
> through a real-world use case. One of my queries was very slow. I looked 
> at the query plan of the affected query, and I saw this:
> 
>> Planning Time: 1.267 ms
>> Execution Time: 69566.632 ms
>> cost=3941783.88..3941783.92
> 
> Just by changing one IN clause to ANY, I now see this:
> 
>> Planning Time: 1.103 ms
>> Execution Time: 0.232 ms cost=1514.95..1515.62
> 
> The query time went from over a minute to about a millisecond! I 
> couldn't believe it. Here, the overhead to find the better query plan 
> would obviously have paid off hugely.
> 
>> I suggest just rewriting the query in a
>> way that it executes more quickly for you.
> 
> Yes, I'm happily doing that now.
> 
> The problem is that most other Postgres users don't know that they might 
> have to replace IN with ANY in order to massively speed up their 
> queries. How should they know? This certainly comes unexpected, and this 
> also isn't documented anywhere, as far as I can see.
> 
> Thanks
> 
> Mathias

pgsql-bugs by date:

From: Tom Lane
Date: 15 June 2023, 13:28:46
Subject: Re: BUG #17977: PorstGreSQL in a jail crashes randomly with Signal 10 bus error

From: Zu-Ming Jiang
Date: 15 June 2023, 16:26:13
Subject: Server closed the connection unexpectedly

Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs

Previous

Next