Thread: Deleting orphaned records (not exists is very slow)
What is the best strategy for deleting orphaned records from a large table. The usual NOT IN is very slow so there must be a better way in postgres for dealing with these. I know it's best not to have the orphans in the first place but in this case it happened and I need to clean up before I can put in some referential integrity.
prior to 8.4 not in will be slow. Just use left join.
On Sep 27, 2010, at 8:02 , Tim Uckun wrote: > What is the best strategy for deleting orphaned records from a large table. > > The usual NOT IN is very slow so there must be a better way in > postgres for dealing with these. If the table is large, I sometimes use the following pattern: 1. Create a trigger on the referenced table (foo) to delete rows from the referencing table (bar) when they're deleted from the referenced table. This is a poor man's ON DELETE CASCADE and prevents any more rows from being orphaned. 2. Create a table with the keys of the referencing table which are no longer in the referenced table: CREATE TABLE orphaned_bar SELECT keycol FROM bar LEFT JOIN foo USING (keycol) WHERE foo.keycol IS NULL; keycol may be multiple columns if you've got a multi-column key. 3. You're then free to delete the rows from bar however you wish, using orphaned_bar. You might want to do them in one go, or in batches. You'll likely want to create an index on orphaned_bar.keycol. You can then add your foreign key and get rid of the trigger on foo when you're done. Hope this helps. Michael Glaesemann grzm seespotcode net
> If the table is large, I sometimes use the following pattern: The table is very large so I will use your advice thanks.
On Tue, Sep 28, 2010 at 12:37 AM, Tim Uckun <timuckun@gmail.com> wrote: >> If the table is large, I sometimes use the following pattern: > > The table is very large so I will use your advice thanks. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general The other thing that just hit my mind, is that you mind need to bump up work_mem a bit, just for the session. so SET work_mem=32M <your query> and that should make it slightly faster. -- GJ
> > The other thing that just hit my mind, is that you mind need to bump > up work_mem a bit, just for the session. so > SET work_mem=32M > <your query> > > and that should make it slightly faster. I tried the method suggested. I created a table of IDs and a field called "mark". I indexed both fields. Then I did the following. update to_be_deleted set mark = true where ctid = any (array( select ctid from to_be_deleted limit 10)); Followed by .... DELETE FROM table WHERE (id in (select id from to_be_deleted where mark = true)) This query took an extremely long time. I stopped it after about fifteen minutes which seems outrageous to me because it's only trying to delete ten records. In the end I wrote a ruby script that does this. loop do break if (to_be_deleted = ToBeDeleted.limit(10).map{|t| t.id}).size == 0 ids = to_be_deleted.join(',') SearchResult.delete_all "id in (#{ids})" ToBeDeleted.delete_all "id in (#{ids})" @logger.debug "Deleted #{ids}" end This is running now. It's running reasonably fast. I presume it will keep getting faster as the number or records on both tables keep getting smaller. Honestly there was no need for any of this. I can't believe I just wasted a couple of hours trying to get this to go only to resort to writing a ruby script. The original query I had written was.... delete from table_name where id in in (SELECT id FROM table_name tb LEFT OUTER JOIN other_table ot ON tb.id = ot.table_name_id WHERE ot.id Is Null) This should have "just worked" but in this case I would estimate it would take a couple of months given the number of records in the database. Thanks for the advice but man what a hassle.
On Sep 27, 2010, at 22:08 , Tim Uckun wrote: > update to_be_deleted set mark = true where ctid = any (array( select > ctid from to_be_deleted limit 10)); Why are you messing with ctid? Does the table have no key? If not, you should fix that first. Michael Glaesemann grzm seespotcode net
> Why are you messing with ctid? Does the table have no key? If not, you should fix that first. > I got the idea from here http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks
On Sep 27, 2010, at 23:20 , Tim Uckun wrote: >> Why are you messing with ctid? Does the table have no key? If not, you should fix that first. >> > > I got the idea from here > > http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks If your table already has a key (some column or combination of columns that is unique per row), there's really no need to use ctid. The only reason they're using ctid on that page because they have duplicate rows: the table *doesn't* have a key and they have no other way to specify rows uniquely. Given you reference an id column, I suspect your your table already has a key, so you should just use that. ctid is an implementation detail of PostgreSQL rather than part of the logical design of the database: it really shouldn't be used unless you absolutely have to. Anyway, sounds like you got it sussed out. Good luck with straightening out the rest of your data! Michael Glaesemann grzm seespotcode net