POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com Whole thread Raw |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
Hello hackers, The following sequence creates an orphaned file: BEGIN; CREATE TABLE t (); <kill -9 this backend> Occasionally there are reports of systems that have managed to produce a lot of them, perhaps through ENOSPC-induced panics, OOM signals or buggy/crashing extensions etc. The most recent example I found in the archives involved 1.7TB of unexpected files and some careful cleanup work. Relation files are created eagerly, and rollback is handled by pushing PendingRelDelete objects onto the pendingDeletes list, to be discarded on commit or processed on abort. That's effectively a kind of specialised undo log, but it's in memory only, so it's less persistent than the effects it is supposed to undo. Here's a proof-of-concept patch that plugs the gap using the undo log technology we're developing as part of the zHeap project. Note that zHeap is not involved here: the SMGR module is acting as a direct client of the undo machinery. Example: postgres=# begin; BEGIN postgres=# create table t1 (); CREATE TABLE postgres=# create table t2 (); CREATE TABLE ... now we can see that this transaction has some undo data (discard < insert): postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs; logno | discard | insert | xid | pid -------+------------------+------------------+-----+------- 0 | 00000000000021EF | 0000000000002241 | 581 | 18454 (1 row) ... and, if the test_undorecord module is installed, we can inspect the records it holds: postgres=# call dump_undo_records(0); NOTICE: 0000000000002224: Storage: CREATE dbid=12655, tsid=1663, relfile=24594 NOTICE: 00000000000021EF: Storage: CREATE dbid=12655, tsid=1663, relfile=24591 CALL If we COMMIT, the undo data is discarded by advancing the discard pointer (tail) to match the insert pointer (head). If we ROLLBACK, either explicitly or automatically by crashing and recovering, then the files will be unlinked and the insert pointer will be rewound; either way the undo log eventually finishes up "empty" again (discard == insert). This is done with a system of per-rmgr-ID record types and callbacks, similar to redo. The rollback action are either executed immediately or offloaded to an undo worker process, depending on simple heuristics. Of course this isn't free, and the current patch makes table creation slower. The goal is to make sure that there is no scenario (kill -9, power cut etc) in which there can be a new relation file on disk, but not a corresponding undo record that would unlink that file if the transaction later has to roll back. Currently, that means that we need to flush the WAL record that will create the undo record that will unlink the file *before* we create the relation file. I suspect that could be mitigated quite easily, by deferring file creation in a backend-local queue until forced by access or commit. I didn't try to do that in this basic version. There are probably other ways to solve the specific problem of orphaned files, but this approach is built on a general reusable facility and I think it is a nice way to show the undo concepts, and how they are separate from zheap. Specifically, it demonstrates the more traditional of the two uses for undo logs: a reliable way to track actions that must be performed on rollback. (The other use is: seeing past versions of data, for vacuumless MVCC; that's a topic for later). Patches 0001-0006 are development snapshots of material posted on other threads already[1][2], hacked around by me to make this possible (see those threads for further developments in those patches including some major strengthening work, coming soon). The subject of this thread is 0007, the core of which is just a couple of hundred lines written by me, based on an idea from Robert Haas. Personally I think it'd be a good feature to get into PostgreSQL 12, and I will add it to the CF that is about to start to seek feedback. It passes make check on Unix and Windows, though currently it's failing some of the TAP tests for reasons I'm looking into (possibly due to bugs in the lower level patches, not sure). Thanks for reading, [1] https://www.postgresql.org/message-id/flat/CAEepm=2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: