Thread: NOT IN clause performing badly

NOT IN clause performing badly

From

Christian Rudow

Date:

26 July 1999, 05:09:33

I have a problem with NOT IN queries
------------------------------------
Has anyone experinced the same problem, or even found a solution to it ?

-- simple NOT IN clause does not perform
-- create a table with an integer primary key and any number and type  of attributes.
-- load approx. 10'000 records and vacuum;
-- then select a subset of say 8000 id's into a seperate table.
-- now try to find out the Id's of the remaining 2000 records

-- I would use a NOT IN clause
-- but this is what happens :

drop table tempx1;

select mytable.id      into table tempx1      from mytable, othertable      where mytable.id = othertable.id      and
othertable.someattribute= "Somevalue";
 

-- this will take only a few seconds
-- on platform : PG6.4 SusE 6.0 PII/300mhz
-- rdbms startup : su - postgres -c with parameters :
-- "/sbin/startproc -l $LOGFILE $H -B 256 -i -o -F -D$DATADIR"

-- select count(*) from tempx1 : now returns 8000 rows

select id from mytable      where id NOT IN (      select id from tempx1      );

-- this will take some 100 minutes at least !!!
-- but it does finish ! no crash or hangup.
-- NOTICE:  QUERY PLAN:
-- Seq Scan on mytable  (cost=958.89 size=10000 width=4)
--   SubPlan
--     ->  Seq Scan on tempx1  (cost=0.00 size=0 width=4)
--
-- an index on mytbale(id) exists.

Questions
---------
Is this a known problem ?
What is causing the problem ?
Is it a problem of RDBMS parameters ?
Is there an alternative to the NOT IN clause ?


I'd appreciate any help in this matter.
ThanX
Chris
-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Christian Rudow                 E-Mail: Christian.Rudow@thinx.ch
ThinX networked business services    Stahlrain 10, CH-5200 Brugg
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: [SQL] NOT IN clause performing badly

From

Herouth Maoz

Date:

26 July 1999, 07:12:34

At 12:09 +0300 on 26/07/1999, Christian Rudow wrote:

> Questions
> ---------
> Is this a known problem ?
> What is causing the problem ?
> Is it a problem of RDBMS parameters ?
> Is there an alternative to the NOT IN clause ?

NOT IN is an inefficient way of running such a query. I would suggest:

- After creating tempx1, create an index on its id field
- Use the following instead of NOT IN:

select id from mytable      where NOT EXISTS (      select * from tempx1    where tempx1.id = mytable.id      );

What is the difference?

In the NOT IN version, for each of the 10000 ids in mytable, it has to
search through 8000 records to see whether the current mytable.id is there
or not. This would give you 80000000 operations. Especially since there is
no index on tempx1.

The EXISTS query I gave uses the index on tempx1.id for each of the 10000
ids. Thus, you have 10000 * log 8000, more or less.

But you have to count in the penalty of creating the index on tempx1.id in
the first place.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma