Re: best way to do bulk delete? - Mailing list pgsql-novice
From | Merlin Moncure |
---|---|
Subject | Re: best way to do bulk delete? |
Date | |
Msg-id | CAHyXU0wism0HkYj9A9=9AKD=07igQ8MZ3vuvfcZ6s3L8B_fing@mail.gmail.com Whole thread Raw |
In response to | best way to do bulk delete? (pg noob <pgnube@gmail.com>) |
Responses |
Re: best way to do bulk delete?
|
List | pgsql-novice |
On Fri, Oct 28, 2011 at 9:30 AM, pg noob <pgnube@gmail.com> wrote: > > Greetings, > > I have an occasional recurring use case where I need to delete a large > number of rows from multiple tables as quickly as possible, > something like a bulk-delete. > > A typical example would be deleting a few hundred thousand rows at once from > a set of tables each containing 1 to 2 million rows, > but in a worst case scenario it could be as large as 1 million rows from a > set of tables each containing 4 to 5 million rows. > > These tables of course have indexes on them as well as foreign key > constraints and cascade deletes to other tables. > > I can't simply truncate the tables because the rows being deleted are > subsets of the total amount of data in the tables. > > These tables have heavy insert/update/delete activity going on at the same > time but mostly not on the same set of rows that > is being bulk-deleted (though there may be some update activity going on > which accesses those rows and hasn't yet quiesced). > > What is the best way of going about this? > > I've considered a few options. > > One option is to open a single transaction, issue delete statements that > delete huge numbers of rows from each > table in question (probably with a DELETE USING SQL query) and then commit > the transaction. > My concern with this approach is that it will hold a huge number of row > locks while in progress and may take a long > time to complete and could introduce deadlocks if it competes with other > updates that have acquired row locks out of order. > > Another option would be to delete one row per transaction or a smaller set > of rows as part of a transaction and then > commit the transaction, repeating in a loop. > This has the advantage that if the transaction hits an error and has to be > rolled back it doesn't have to redo the entire > delete operation again, and it doesn't hold as many row locks for as long a > time. The drawback is that I believe this approach > would be a lot slower. > > And finally, I've considered the idea of using COPY to copy the data that > needs to be kept to temporary tables, > truncating the original tables and then copying the data back. I believe > this would be the most efficient way to > do the delete but the implementation is quite a bit more complicated than > the first two options I described, > and has implications for how to deal with error scenarios or database/system > crashes while the operation is in progress. you don't need to COPY -- just insert/select, like this: CREATE TABLE keeprows(LIKE old_table INCLUDING INDEXES INCLUDING CONSTRAINTS); INSERT INTO keeprows SELECT * FROM old_table WHERE ... DROP TABLE old_table; ALTER TABLE keeprows RENAME to old_table CASCADE; <reset RI rules> This procedure is not side effect free and has some caveats: for example, if you have code that is dependent on the table's composite type that will also drop and has to be reconstructed. Perhaps a better way to go if you can structure your code around it is to simply partition your table around when it gets loaded and dropped -- then your deletion becomes 'DROP TABLE' with no extra processing. merlin
pgsql-novice by date: