Home > mailing lists

Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs

From	Mathias Kunter
Subject	Re: BUG #17964: Missed query planner optimization
Date	June 7, 2023 16:53:32
Msg-id	8ca5637b-32c4-cb17-2de6-10f9248c5857@gmail.com Whole thread Raw
In response to	Re: BUG #17964: Missed query planner optimization (David Rowley <dgrowleyml@gmail.com>)
Responses	Re: BUG #17964: Missed query planner optimization
List	pgsql-bugs

Tree view

Am 07.06.23 um 14:36 schrieb David Rowley:
> On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote:
>> If I simply change the original query from this:
>>
>>> SELECT ... WHERE id IN (SELECT ...);
>>
>> into this:
>>
>>> SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>>
>> then Postgres uses an index scan, and the query is orders of magnitude
>> faster.
> 
> What's going on here is that ...

Thank you very much for the explanation, fully understood.

> The main problem here is that in some cases the first of your queries
> will be faster and in other cases the 2nd will be faster. It'll depend
> on how many rows are in each table. So, really to find the fastest
> plan, the planner would have to consider both options.

Yes, considering both options is exactly what I'm suggesting.

> The join search is going to be the slowest part of planning
> when there are a few tables to join, so doing that twice could add
> quite a bit of overhead to the planner.

Yes, but I'd argue that we have to put that additional planning overhead 
into perspective. The planning time overhead will typically be in the 
order of milliseconds. Also, the additional planning time will not 
depend on the number of rows within the tables.

In contrast to that, the execution time overhead can be in the order of 
minutes, since it will depend on the number of rows which are involved. 
After all, Postgres will unnecessarily execute a sequential scan, and 
the size of the scanned table might be gigabytes.

> You might also then consider
> how many times you'd need to perform the join search if there were,
> say, 5 IN clauses.  To exhaustively find the best plan we'd need to
> try all possible combinations of converting each IN clause to a
> semi-join vs leaving the qual alone.

We don't necessarily have to use exhaustive search, a heuristic would 
already be a major improvement.

For example, use the estimated number of rows returned by the subselect 
of each IN clause. If this number is below a certain threshold (which is 
yet to be defined), then rewrite the corresponding IN clause to ANY. As 
a final verification step, check the rewritten query against the 
original query, and only use the rewritten query if the estimated costs 
are lower.

I think something like this should be possible to do with reasonable 
planner overhead.

> I'm not all that sure you're likely to see us
> making any improvements here.

I'm very sorry to hear this. Please note that I stumbled upon this issue 
through a real-world use case. One of my queries was very slow. I looked 
at the query plan of the affected query, and I saw this:

> Planning Time: 1.267 ms
> Execution Time: 69566.632 ms
> cost=3941783.88..3941783.92

Just by changing one IN clause to ANY, I now see this:

> Planning Time: 1.103 ms
> Execution Time: 0.232 ms 
> cost=1514.95..1515.62

The query time went from over a minute to about a millisecond! I 
couldn't believe it. Here, the overhead to find the better query plan 
would obviously have paid off hugely.

> I suggest just rewriting the query in a
> way that it executes more quickly for you.

Yes, I'm happily doing that now.

The problem is that most other Postgres users don't know that they might 
have to replace IN with ANY in order to massively speed up their 
queries. How should they know? This certainly comes unexpected, and this 
also isn't documented anywhere, as far as I can see.

Thanks

Mathias

pgsql-bugs by date:

From: David Rowley
Date: 07 June 2023, 13:00:22
Subject: Re: BUG #17967: [msvc-x86] Failed to build with error LNK2019: unresolved external symbol __BitScanReverse64 referen

From: Jeff Davis
Date: 07 June 2023, 17:36:29
Subject: pg_dump assertion failure with "-n pg_catalog"

Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs

Previous

Next