Home > mailing lists

Re: IN or EXISTS - Mailing list pgsql-performance

From	Andy Colson
Subject	Re: IN or EXISTS
Date	August 31, 2011 10:59:29
Msg-id	4E5E3E5F.4000807@squeakycode.net Whole thread Raw
In response to	Re: IN or EXISTS (Craig Ringer <ringerc@ringerc.id.au>)
Responses	Re: IN or EXISTS
List	pgsql-performance

Tree view

On 8/30/2011 8:33 PM, Craig Ringer wrote:
> On 31/08/2011 4:30 AM, Andy Colson wrote:
>> Hi all,
>>
>> I have read things someplace saying not exists was better than not
>> in... or something like that. Not sure if that was for in/exists and
>> not in/not exists, and for a lot of records or not.
>>
> `EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary
> to build a list of values then iterate over them to check for a match.
> By contrast, `EXISTS' may use a simple index lookup or the like to test
> for the presence of a value.
>
> On the other hand, the `IN' subquery is uncorrelated needs only run
> once, where the `EXISTS' subquery is correlated and has to run once for
> every outer record. That means that the `IN' list approach can be a lot
> faster where the subquery in question is relatively time consuming for
> the number of values it returns. For example, if the `IN' query returns
> only 5 values and takes 100ms, you're scanning 1 million records in the
> outer query, and the subquery `EXISTS' version would take 50ms, using
> `IN' is a no-brainer since 1 million times 50ms will be a lot slower
> than 1 times 100ms plus the time required to scan 5 elements 1 million
> times.
>
> Another complication is the possible presence of NULL in an IN list.
> Getting NULLs in `IN' lists is a common source of questions on this
> list, because people are quite surprised by how it works. EXISTS avoids
> the NULL handling issue (and in the process demonstrates how woefully
> inconsistent SQL's handling of NULL really is).
>
> Theoretically the query planner could transform:
>
> SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE
> z.y_id IS NOT NULL);
>
> into:
>
> SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)
>
> ... or vice versa depending on which it thought would be faster. AFAIK
> it doesn't currently do this. To be able to do it the planner would need
> to know how to estimate the cost of scanning an `IN' result list. It'd
> also need to be able to use constraints on the target table to prove
> that the result of the `IN' may not contain nulls. To transform the
> EXISTS version into the IN version where it'd be more efficient, it'd
> also have to be able to use constraints on the target table to prove
> that results of a SELECT would be unique without explicit deduplication.
>
> All this makes me wonder ... does Pg currently support sorting IN lists
> and using a binary search? It'd be pretty nice to be able to prove that:
>
> SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);
>
> is equvalent to:
>
> SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id
> IS NOT NULL)
>
> ... and either transform it to an EXISTS test or add an ORDER BY z_id
> and flag the resultset as sorted so a binary search could be done on it
> whenever a row hits the IN test.
>
> --
> Craig Ringer
>

Yeah... my current code uses IN.  Most of my updates are small, so my
inner list is 500 integers.  It runs fine.  What I'm worried about is
when I update the entire table, so my inner list is 60k integers.  Maybe
I'm just worrying for naught.  I tested a table with 100k rows, ran both
with explain analyzes, and they look the same:

  Delete  (cost=11186.26..20817.60 rows=25911 width=12) (actual
time=408.138..408.138 rows=0 loops=1)
    ->  Hash Semi Join  (cost=11186.26..20817.60 rows=25911 width=12)
(actual time=61.997..182.573 rows=105434 loops=1)
          Hash Cond: (public.general.gid = upd.general.gid)
          ->  Seq Scan on general  (cost=0.00..9113.11 rows=25911
width=10) (actual time=0.004..42.364 rows=105434 loops=1)
          ->  Hash  (cost=9868.34..9868.34 rows=105434 width=10) (actual
time=61.958..61.958 rows=105434 loops=1)
                Buckets: 16384  Batches: 1  Memory Usage: 4531kB
                ->  Seq Scan on general  (cost=0.00..9868.34 rows=105434
width=10) (actual time=0.003..34.372 rows=105434 loops=1)

With or without an index, (even if I ANALYZE it) it still does a table
scan and builds a hash.  Both IN and EXISTS act the same way.

I assume:
Buckets: 16384  Batches: 1  Memory Usage: 4531kB

That means a total of 4.5 meg of ram was used for the hash, so if my
work_mem was lower than that it would swap?  (or choose a different plan?)

I'll only ever be running one update at a time, so I'm not worried about
multiple connections running at once.

Anyway, I'll just leave it alone (and stop optimizing things that dont
need it)

-Andy

pgsql-performance by date:

From: "Kevin Grittner"
Date: 31 August 2011, 09:40:46
Subject: Re: Query performance issue

From: "Tomas Vondra"
Date: 31 August 2011, 11:19:45
Subject: Re: IN or EXISTS

Re: IN or EXISTS - Mailing list pgsql-performance

Previous

Next