Thread: Deleting orphaned records (not exists is very slow)

Deleting orphaned records (not exists is very slow)

From

Tim Uckun

Date:

27 September 2010, 09:02:22

What is the best strategy for deleting orphaned records from a large table.

The usual NOT IN is very slow so there must be a better way in
postgres for dealing with these.

I know it's best not to have the orphans in the first place but in
this case it happened and I need to clean up before I can put in some
referential integrity.

Re: Deleting orphaned records (not exists is very slow)

From

Grzegorz Jaśkiewicz

Date:

27 September 2010, 12:16:21

prior to 8.4 not in will be slow. Just use left join.

Re: Deleting orphaned records (not exists is very slow)

From

Michael Glaesemann

Date:

27 September 2010, 20:23:57

On Sep 27, 2010, at 8:02 , Tim Uckun wrote:

> What is the best strategy for deleting orphaned records from a large table.
>
> The usual NOT IN is very slow so there must be a better way in
> postgres for dealing with these.

If the table is large, I sometimes use the following pattern:

1. Create a trigger on the referenced table (foo) to delete rows from the
referencing table (bar) when they're deleted from the referenced table.
This is a poor man's ON DELETE CASCADE and prevents any more rows from being
orphaned.

2. Create a table with the keys of the referencing table which are no longer
in the referenced table:

CREATE TABLE orphaned_bar
SELECT keycol
  FROM bar
  LEFT JOIN foo USING (keycol)
  WHERE foo.keycol IS NULL;

keycol may be multiple columns if you've got a multi-column key.

3. You're then free to delete the rows from bar however you wish, using orphaned_bar.
You might want to do them in one go, or in batches. You'll likely want to create an
index on orphaned_bar.keycol.

You can then add your foreign key and get rid of the trigger on foo when you're done.

Hope this helps.

Michael Glaesemann
grzm seespotcode net

Re: Deleting orphaned records (not exists is very slow)

From

Tim Uckun

Date:

27 September 2010, 20:37:33

> If the table is large, I sometimes use the following pattern:

The table is very large so I will use your advice thanks.

Re: Deleting orphaned records (not exists is very slow)

From

Grzegorz Jaśkiewicz

Date:

27 September 2010, 21:22:22

On Tue, Sep 28, 2010 at 12:37 AM, Tim Uckun <timuckun@gmail.com> wrote:
>> If the table is large, I sometimes use the following pattern:
>
> The table is very large so I will use your advice thanks.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

The other thing that just hit my mind, is that you mind need to bump
up work_mem a bit, just for the session. so
SET work_mem=32M
<your query>

and that should make it slightly faster.

--
GJ

Re: Deleting orphaned records (not exists is very slow)

From

Tim Uckun

Date:

27 September 2010, 23:08:37

>
> The other thing that just hit my mind, is that you mind need to bump
> up work_mem a bit, just for the session. so
> SET work_mem=32M
> <your query>
>
> and that should make it slightly faster.


I tried the method suggested. I created a table of IDs and a field
called "mark". I indexed both fields.

Then I did the following.

update to_be_deleted set mark = true where ctid  = any (array( select
ctid from to_be_deleted limit 10));

Followed by ....

DELETE FROM  table  WHERE (id in (select id from to_be_deleted where
mark = true))

This query took an extremely long time. I stopped it after about
fifteen minutes which seems outrageous to me because it's only trying
to delete ten records.

In the end I wrote a ruby script that does this.

loop do
      break if (to_be_deleted = ToBeDeleted.limit(10).map{|t| t.id}).size == 0
       ids =   to_be_deleted.join(',')
       SearchResult.delete_all "id in (#{ids})"
       ToBeDeleted.delete_all "id in (#{ids})"
       @logger.debug "Deleted #{ids}"
end


This is running now. It's running reasonably fast.  I presume it will
keep getting faster as the number or records on both tables keep
getting smaller.

Honestly there was no need for any of this. I can't believe I just
wasted a couple of hours trying to get this to go only to resort to
writing a ruby script.

The original query I had written was....

delete from
table_name
where id in
in
(SELECT id
FROM table_name  tb
LEFT OUTER JOIN other_table ot ON tb.id = ot.table_name_id
WHERE ot.id Is Null)


This should have "just worked" but in this case I would estimate it
would take a couple of months given the number of records in the
database.




Thanks for the advice  but man what a hassle.

Re: Deleting orphaned records (not exists is very slow)

From

Michael Glaesemann

Date:

27 September 2010, 23:49:09

On Sep 27, 2010, at 22:08 , Tim Uckun wrote:

> update to_be_deleted set mark = true where ctid  = any (array( select
> ctid from to_be_deleted limit 10));

Why are you messing with ctid? Does the table have no key? If not, you should fix that first.

Michael Glaesemann
grzm seespotcode net

Re: Deleting orphaned records (not exists is very slow)

From

Tim Uckun

Date:

28 September 2010, 00:20:26

> Why are you messing with ctid? Does the table have no key? If not, you should fix that first.
>

I got the idea from here

http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks

Re: Deleting orphaned records (not exists is very slow)

From

Michael Glaesemann

Date:

28 September 2010, 00:56:46

On Sep 27, 2010, at 23:20 , Tim Uckun wrote:

>> Why are you messing with ctid? Does the table have no key? If not, you should fix that first.
>>
>
> I got the idea from here
>
> http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks

If your table already has a key (some column or combination of columns that is unique per row),
there's really no need to use ctid. The only reason they're using ctid on that page because
they have duplicate rows: the table *doesn't* have a key and they have no other way to specify
rows uniquely. Given you reference an id column, I suspect your your table already has a key,
so you should just use that.

ctid is an implementation detail of PostgreSQL rather than part of the logical design of the
database: it really shouldn't be used unless you absolutely have to.

Anyway, sounds like you got it sussed out. Good luck with straightening out the rest of your data!

Michael Glaesemann
grzm seespotcode net