Re: Scrub one large table against another - Mailing list pgsql-performance
From | Brendan Curran |
---|---|
Subject | Re: Scrub one large table against another |
Date | |
Msg-id | 452C208C.5010603@gmail.com Whole thread Raw |
In response to | Re: Scrub one large table against another (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Scrub one large table against another
|
List | pgsql-performance |
Tom Lane wrote: > Brendan Curran <brendan.curran@gmail.com> writes: >> CREATE TEMP TABLE temp_list_suppress(email_record_id int8); > >> INSERT INTO temp_list_suppress >> SELECT email_record_id from ONLY email_record er >> WHERE email_list_id = 9 AND email IN >> (select email from suppress); > >> CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id ); > >> INSERT INTO er_banned >> SELECT * from ONLY email_record er WHERE EXISTS >> (SELECT 1 from temp_list_suppress ts where er.email_record_id = >> ts.email_record_id)'; > >> DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN >> (SELECT email_record_id from temp_list_suppress); > >> TRUNCATE TABLE temp_list_suppress; >> DROP TABLE temp_list_suppress; > >> The performance is dreadful, is there a more efficient way to do this? > > Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps? > If you don't even know which part is slow, it's hard to improve. FIRST INSERT (Just the select is explained): Hash Join (cost=8359220.68..9129843.00 rows=800912 width=32) Hash Cond: (("outer".email)::text = ("inner".email)::text) -> Unique (cost=4414093.19..4522324.49 rows=21646260 width=25) -> Sort (cost=4414093.19..4468208.84 rows=21646260 width=25) Sort Key: suppress.email -> Seq Scan on suppress (cost=0.00..393024.60 rows=21646260 width=25) -> Hash (cost=3899868.47..3899868.47 rows=4606808 width=32) -> Bitmap Heap Scan on email_record er (cost=38464.83..3899868.47 rows=4606808 width=32) Recheck Cond: (email_list_id = 13) -> Bitmap Index Scan on list (cost=0.00..38464.83 rows=4606808 width=0) Index Cond: (email_list_id = 13) SECOND INSERT (Using EXISTS): Seq Scan on email_record er (cost=0.00..381554175.29 rows=62254164 width=1863) Filter: (subplan) SubPlan -> Index Scan using er_primeq_pk on er_primeq eq (cost=0.00..3.03 rows=1 width=0) Index Cond: ($0 = email_record_id) SECOND INSERT (Using IN): Nested Loop (cost=26545.94..2627497.28 rows=27134 width=1863) -> HashAggregate (cost=26545.94..33879.49 rows=733355 width=8) -> Seq Scan on er_primeq (cost=0.00..24712.55 rows=733355 width=8) -> Index Scan using email_record_pkey on email_record er (cost=0.00..3.52 rows=1 width=1863) Index Cond: (er.email_record_id = "outer".email_record_id) Filter: (email_list_id = 13) DELETE Nested Loop (cost=26545.94..2627497.28 rows=50846 width=6) -> HashAggregate (cost=26545.94..33879.49 rows=733355 width=8) -> Seq Scan on er_primeq (cost=0.00..24712.55 rows=733355 width=8) -> Index Scan using email_record_pkey on email_record (cost=0.00..3.52 rows=1 width=14) Index Cond: (email_record.email_record_id = "outer".email_record_id) Filter: (email_list_id = 9) To get this explain data I used a sample "temp_suppress" table that contained about 700k rows and was indexed but not analyzed... > > It would probably help to do an "ANALYZE temp_list_suppress" right after > populating the temp table. As you have it, the second insert and delete > are being planned with nothing more than a row count (obtained during > CREATE INDEX) and no stats about distribution of the table contents. > > Also, I'd be inclined to try replacing the EXISTS with an IN test; > in recent PG versions the planner is generally smarter about IN. > (Is there a reason why you are doing the INSERT one way and the > DELETE the other?) > > BTW, that TRUNCATE right before the DROP seems quite useless, > although it's not the main source of your problem. > > regards, tom lane >
pgsql-performance by date: