Thread: NOT IN clause performing badly
I have a problem with NOT IN queries ------------------------------------ Has anyone experinced the same problem, or even found a solution to it ? -- simple NOT IN clause does not perform -- create a table with an integer primary key and any number and type of attributes. -- load approx. 10'000 records and vacuum; -- then select a subset of say 8000 id's into a seperate table. -- now try to find out the Id's of the remaining 2000 records -- I would use a NOT IN clause -- but this is what happens : drop table tempx1; select mytable.id into table tempx1 from mytable, othertable where mytable.id = othertable.id and othertable.someattribute= "Somevalue"; -- this will take only a few seconds -- on platform : PG6.4 SusE 6.0 PII/300mhz -- rdbms startup : su - postgres -c with parameters : -- "/sbin/startproc -l $LOGFILE $H -B 256 -i -o -F -D$DATADIR" -- select count(*) from tempx1 : now returns 8000 rows select id from mytable where id NOT IN ( select id from tempx1 ); -- this will take some 100 minutes at least !!! -- but it does finish ! no crash or hangup. -- NOTICE: QUERY PLAN: -- Seq Scan on mytable (cost=958.89 size=10000 width=4) -- SubPlan -- -> Seq Scan on tempx1 (cost=0.00 size=0 width=4) -- -- an index on mytbale(id) exists. Questions --------- Is this a known problem ? What is causing the problem ? Is it a problem of RDBMS parameters ? Is there an alternative to the NOT IN clause ? I'd appreciate any help in this matter. ThanX Chris -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Christian Rudow E-Mail: Christian.Rudow@thinx.ch ThinX networked business services Stahlrain 10, CH-5200 Brugg ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At 12:09 +0300 on 26/07/1999, Christian Rudow wrote: > Questions > --------- > Is this a known problem ? > What is causing the problem ? > Is it a problem of RDBMS parameters ? > Is there an alternative to the NOT IN clause ? NOT IN is an inefficient way of running such a query. I would suggest: - After creating tempx1, create an index on its id field - Use the following instead of NOT IN: select id from mytable where NOT EXISTS ( select * from tempx1 where tempx1.id = mytable.id ); What is the difference? In the NOT IN version, for each of the 10000 ids in mytable, it has to search through 8000 records to see whether the current mytable.id is there or not. This would give you 80000000 operations. Especially since there is no index on tempx1. The EXISTS query I gave uses the index on tempx1.id for each of the 10000 ids. Thus, you have 10000 * log 8000, more or less. But you have to count in the penalty of creating the index on tempx1.id in the first place. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma