Home > mailing lists

Re: Using ctid column changes plan drastically - Mailing list pgsql-performance

From	Thomas Kellerer
Subject	Re: Using ctid column changes plan drastically
Date	July 25, 2012 05:10:47
Msg-id	juo9lb$4hm$1@dough.gmane.org Whole thread Raw
In response to	Re: Using ctid column changes plan drastically (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Using ctid column changes plan drastically
List	pgsql-performance

Tree view

Tom Lane, 24.07.2012 19:12:
> Well, it would only help if you're running a PG version that's new
> enough to recognize the NOT EXISTS as an anti-join; and even then,
> it's possible that joining on a tid column forecloses enough plan
> types that you don't get any real benefit.  But I'm just guessing.
> Can you show exactly what you tried and what EXPLAIN ANALYZE results
> you got?
>

I am using 9.1.4 (as I said in my initial post).

I finally found a solution that runs fine:

DELETE FROM dupes a
WHERE EXISTS (SELECT 1
               FROM  dupes b
               WHERE b.first_name = a.first_name
                 AND b.last_name = a.last_name
                 AND b.ctid > a.ctid);

The execution plan for this is:

Delete on public.dupes a  (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2419.334..2419.334 rows=0 loops=1)
   Buffers: shared hit=18029
   ->  Merge Semi Join  (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2043.674..2392.707 rows=17097
loops=1)
         Output: a.ctid, b.ctid
         Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = b.last_name))
         Join Filter: (b.ctid > a.ctid)
         Buffers: shared hit=930
         ->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1024.195..1030.051 rows=75000 loops=1)
               Output: a.ctid, a.first_name, a.last_name
               Sort Key: a.first_name, a.last_name
               Sort Method: quicksort  Memory: 8870kB
               Buffers: shared hit=465
               ->  Seq Scan on public.dupes a  (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.025..23.234
rows=75000loops=1) 
                     Output: a.ctid, a.first_name, a.last_name
                     Buffers: shared hit=465
         ->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1019.148..1028.483 rows=105841 loops=1)
               Output: b.ctid, b.first_name, b.last_name
               Sort Key: b.first_name, b.last_name
               Sort Method: quicksort  Memory: 8870kB
               Buffers: shared hit=465
               ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.017..19.133
rows=75000loops=1) 
                     Output: b.ctid, b.first_name, b.last_name
                     Buffers: shared hit=465
Total runtime: 2420.953 ms

Which is a lot better than the plan using "WHERE ctid NOT IN (.....)":

Delete on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=582515.094..582515.094 rows=0
loops=1)
   Buffers: shared hit=18027
   ->  Seq Scan on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=1038.164..582332.927
rows=17097loops=1) 
         Output: dupes.ctid
         Filter: (NOT (SubPlan 1))
         Buffers: shared hit=930
         SubPlan 1
           ->  Materialize  (cost=1777.50..1890.00 rows=7500 width=20) (actual time=0.001..2.283 rows=35552
loops=75000)
                 Output: (min(b.ctid)), b.first_name, b.last_name
                 Buffers: shared hit=465
                 ->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=20) (actual time=90.964..120.228 rows=57903
loops=1)
                       Output: min(b.ctid), b.first_name, b.last_name
                       Buffers: shared hit=465
                       ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 width=20) (actual
time=0.008..25.515rows=75000 loops=1) 
                             Output: b.ctid, b.first_name, b.last_name
                             Buffers: shared hit=465
Total runtime: 582517.711 ms

Using "WHERE id NOT IN (...)" is the fastest way:

Delete on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) (actual time=187.949..187.949 rows=0 loops=1)
   Buffers: shared hit=18490
   ->  Seq Scan on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) (actual time=125.351..171.108 rows=17097
loops=1)
         Output: dupes.ctid
         Filter: (NOT (hashed SubPlan 1))
         Buffers: shared hit=930
         SubPlan 1
           ->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=18) (actual time=73.131..93.421 rows=57903
loops=1)
                 Output: min(b.id), b.first_name, b.last_name
                 Buffers: shared hit=465
                 ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 width=18) (actual time=0.004..8.515
rows=75000loops=1) 
                       Output: b.id, b.first_name, b.last_name
                       Buffers: shared hit=465
Total runtime: 189.222 ms

Regards
Thomas

pgsql-performance by date:

From: Aleksei Arefjev
Date: 25 July 2012, 04:38:15
Subject: Re: transactions start time

From: AI Rumman
Date: 25 July 2012, 05:40:50
Subject: Re: Why do I need more time with partition table?

Re: Using ctid column changes plan drastically - Mailing list pgsql-performance

Previous

Next