Re: IN or EXISTS - Mailing list pgsql-performance
From | Andy Colson |
---|---|
Subject | Re: IN or EXISTS |
Date | |
Msg-id | 4E5E3E5F.4000807@squeakycode.net Whole thread Raw |
In response to | Re: IN or EXISTS (Craig Ringer <ringerc@ringerc.id.au>) |
Responses |
Re: IN or EXISTS
|
List | pgsql-performance |
On 8/30/2011 8:33 PM, Craig Ringer wrote: > On 31/08/2011 4:30 AM, Andy Colson wrote: >> Hi all, >> >> I have read things someplace saying not exists was better than not >> in... or something like that. Not sure if that was for in/exists and >> not in/not exists, and for a lot of records or not. >> > `EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary > to build a list of values then iterate over them to check for a match. > By contrast, `EXISTS' may use a simple index lookup or the like to test > for the presence of a value. > > On the other hand, the `IN' subquery is uncorrelated needs only run > once, where the `EXISTS' subquery is correlated and has to run once for > every outer record. That means that the `IN' list approach can be a lot > faster where the subquery in question is relatively time consuming for > the number of values it returns. For example, if the `IN' query returns > only 5 values and takes 100ms, you're scanning 1 million records in the > outer query, and the subquery `EXISTS' version would take 50ms, using > `IN' is a no-brainer since 1 million times 50ms will be a lot slower > than 1 times 100ms plus the time required to scan 5 elements 1 million > times. > > Another complication is the possible presence of NULL in an IN list. > Getting NULLs in `IN' lists is a common source of questions on this > list, because people are quite surprised by how it works. EXISTS avoids > the NULL handling issue (and in the process demonstrates how woefully > inconsistent SQL's handling of NULL really is). > > Theoretically the query planner could transform: > > SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE > z.y_id IS NOT NULL); > > into: > > SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id) > > ... or vice versa depending on which it thought would be faster. AFAIK > it doesn't currently do this. To be able to do it the planner would need > to know how to estimate the cost of scanning an `IN' result list. It'd > also need to be able to use constraints on the target table to prove > that the result of the `IN' may not contain nulls. To transform the > EXISTS version into the IN version where it'd be more efficient, it'd > also have to be able to use constraints on the target table to prove > that results of a SELECT would be unique without explicit deduplication. > > All this makes me wonder ... does Pg currently support sorting IN lists > and using a binary search? It'd be pretty nice to be able to prove that: > > SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z); > > is equvalent to: > > SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id > IS NOT NULL) > > ... and either transform it to an EXISTS test or add an ORDER BY z_id > and flag the resultset as sorted so a binary search could be done on it > whenever a row hits the IN test. > > -- > Craig Ringer > Yeah... my current code uses IN. Most of my updates are small, so my inner list is 500 integers. It runs fine. What I'm worried about is when I update the entire table, so my inner list is 60k integers. Maybe I'm just worrying for naught. I tested a table with 100k rows, ran both with explain analyzes, and they look the same: Delete (cost=11186.26..20817.60 rows=25911 width=12) (actual time=408.138..408.138 rows=0 loops=1) -> Hash Semi Join (cost=11186.26..20817.60 rows=25911 width=12) (actual time=61.997..182.573 rows=105434 loops=1) Hash Cond: (public.general.gid = upd.general.gid) -> Seq Scan on general (cost=0.00..9113.11 rows=25911 width=10) (actual time=0.004..42.364 rows=105434 loops=1) -> Hash (cost=9868.34..9868.34 rows=105434 width=10) (actual time=61.958..61.958 rows=105434 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 4531kB -> Seq Scan on general (cost=0.00..9868.34 rows=105434 width=10) (actual time=0.003..34.372 rows=105434 loops=1) With or without an index, (even if I ANALYZE it) it still does a table scan and builds a hash. Both IN and EXISTS act the same way. I assume: Buckets: 16384 Batches: 1 Memory Usage: 4531kB That means a total of 4.5 meg of ram was used for the hash, so if my work_mem was lower than that it would swap? (or choose a different plan?) I'll only ever be running one update at a time, so I'm not worried about multiple connections running at once. Anyway, I'll just leave it alone (and stop optimizing things that dont need it) -Andy
pgsql-performance by date: