Home > mailing lists

Re: Scrub one large table against another - Mailing list pgsql-performance

From	Brendan Curran
Subject	Re: Scrub one large table against another
Date	October 10, 2006 19:37:12
Msg-id	452C208C.5010603@gmail.com Whole thread Raw
In response to	Re: Scrub one large table against another (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Scrub one large table against another
List	pgsql-performance

Tree view


Tom Lane wrote:
> Brendan Curran <brendan.curran@gmail.com> writes:
>> CREATE TEMP TABLE temp_list_suppress(email_record_id int8);
>
>> INSERT INTO temp_list_suppress
>>     SELECT email_record_id from ONLY email_record er
>>     WHERE email_list_id = 9 AND email IN
>>     (select email from suppress);
>
>> CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );
>
>> INSERT INTO er_banned
>> SELECT * from ONLY email_record er WHERE EXISTS
>> (SELECT 1 from temp_list_suppress ts where er.email_record_id =
>> ts.email_record_id)';
>
>> DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
>>     (SELECT email_record_id from temp_list_suppress);
>
>> TRUNCATE TABLE temp_list_suppress;
>> DROP TABLE temp_list_suppress;
>
>> The performance is dreadful, is there a more efficient way to do this?
>
> Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
> If you don't even know which part is slow, it's hard to improve.

FIRST INSERT (Just the select is explained):
Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=32)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
               Sort Key: suppress.email
               ->  Seq Scan on suppress  (cost=0.00..393024.60
rows=21646260 width=25)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0)
                     Index Cond: (email_list_id = 13)

SECOND INSERT (Using EXISTS):
Seq Scan on email_record er  (cost=0.00..381554175.29 rows=62254164
width=1863)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using er_primeq_pk on er_primeq eq  (cost=0.00..3.03
rows=1 width=0)
           Index Cond: ($0 = email_record_id)

SECOND INSERT (Using IN):
Nested Loop  (cost=26545.94..2627497.28 rows=27134 width=1863)
   ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
         ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355
width=8)
   ->  Index Scan using email_record_pkey on email_record er
(cost=0.00..3.52 rows=1 width=1863)
         Index Cond: (er.email_record_id = "outer".email_record_id)
         Filter: (email_list_id = 13)

DELETE
Nested Loop  (cost=26545.94..2627497.28 rows=50846 width=6)
   ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
         ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355
width=8)
   ->  Index Scan using email_record_pkey on email_record
(cost=0.00..3.52 rows=1 width=14)
         Index Cond: (email_record.email_record_id =
"outer".email_record_id)
         Filter: (email_list_id = 9)


To get this explain data I used a sample "temp_suppress" table that
contained about 700k rows and was indexed but not analyzed...


>
> It would probably help to do an "ANALYZE temp_list_suppress" right after
> populating the temp table.  As you have it, the second insert and delete
> are being planned with nothing more than a row count (obtained during
> CREATE INDEX) and no stats about distribution of the table contents.
>
> Also, I'd be inclined to try replacing the EXISTS with an IN test;
> in recent PG versions the planner is generally smarter about IN.
> (Is there a reason why you are doing the INSERT one way and the
> DELETE the other?)
>
> BTW, that TRUNCATE right before the DROP seems quite useless,
> although it's not the main source of your problem.
>
>             regards, tom lane
>

pgsql-performance by date:

From: Tom Lane
Date: 10 October 2006, 19:14:11
Subject: Re: Scrub one large table against another

From: Tom Lane
Date: 10 October 2006, 19:47:19
Subject: Re: Scrub one large table against another

Re: Scrub one large table against another - Mailing list pgsql-performance

Previous

Next