Interesting Query Performance Question - Mailing list pgsql-admin
From | Jonathan Hoover |
---|---|
Subject | Interesting Query Performance Question |
Date | |
Msg-id | 68C107706843A54BA7D9CAEF123ADF05010346CAE6@SP2-EX07VS03.ds.corp.yahoo.com Whole thread Raw |
Responses |
Re: Interesting Query Performance Question
|
List | pgsql-admin |
Any explanation for this is appreciated... I had a query which just seemed to peg the processor. I decided to let it go all night, and after 16 hours, it was stillstuck at 100% processor usage. I figured something must be wrong, so I tried a few things. Using the following logic,I was able to get the results I was looking for in about 1 minute (instead of 16+ hours of no result). It all camedown to getting around the use of "NOT IN (subquery)" where "subquery" has 20,000,000 rows. "IN (subquery)" was quickas could be. Here's what I found. Tables: ------------------------------ | a | b | --------------|--------------- |guid vc(32) | guid vc(32) | |unitid vc(15)| unitid vc(15)| |ftime ts | | |source vc(10)| | ------------------------------ |65,000,000 rw| 500,000 rows | ------------------------------ Here is the query that hung up for 16 hours: A: SELECT unitid,count(guid) AS total FROM b WHERE guid NOT IN (SELECT DISTINCT guid FROM a) AND unitid IN (SELECT DISTINCT unitid FROM a) GROUP BY unitid ORDER BY total DESC; I thought I'd narrow down a.guid and a.unitid and substitute these for the subqueries. B: SELECT DISTINCT unitid INTO a_u_unitid FROM a; ( 1,082 rows) C: SELECT DISTINCT guid INTO a_u_guid FROM a; (20,000,000 rows) That didn't help on its own. So I decided to try the query without the "NOT". It finished in a minute or two. Granted, theresult is the opposite of what I wanted, but at least I knew the major source of the problem. I still had the substitutedqueries B & C in, so it looked like this: D: SELECT unitid,count(guid) AS total FROM b WHERE guid IN (SELECT guid FROM a_u_guid) AND unitid IN (SELECT unitid FROM a_u_unitid) GROUP BY unitid ORDER BY total DESC; Next, I decided to just get my "NOT" set of guid's from b itself. If my logic isn't correct here, someone let me know. Idid this, and got results very quickly (a minute perhaps): E: SELECT unitid,count(guid) AS total FROM b WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT guid FROM a_u_guid)) AND unitid IN (SELECT unitid FROM a_u_unitid) GROUP BY unitid ORDER BY total DESC; Keep in mind a has 65,000,000 rows and b only has 500,000. The above query (E) gave me the original results I was lookingfor (E gives the expected results from A, if A had ever finished, assuming my logic is correct). I then decided toget rid of the intermediate tables B & C, so I was back to: F: SELECT unitid,count(guid) AS total FROM b WHERE guid NOT IN (SELECT DISTINCT guid FROM b WHERE guid IN (SELECT DISTINCT guid FROM a) AND Unitid IN (SELECT DISTINCT unitid FROM a) GROUP BY unitid ORDER BY total DESC; While F was running, I started typing this email. It still has not finished, but I have a feeling it will within anotherhalf hour or so. What I am taking away from this is that the use of "NOT IN (subquery)" where subquery has a largenumber of results will literally shut down performance and never give a result. Further, I am taking away that creatingseparate tables with just the DISTINCT values from a table with millions of rows also saves time. The problem herebeing that the separate table is not auto-updating when values in the master table change. Query E then is apparently the way to go, but shouldn't there be a way to get the query planner to take these steps itself?If A had ever finished, I'd sure like to have seen an EXPLAIN ANALYZE on it. If F does finish, I made run an EXPLAINANALYZE on it versus E (I kind of expected F to finish while writing this email). Thanks, Jon
pgsql-admin by date: