Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+hUKG+MpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
Hello, Aside from code changes based on review (and I have more to come of those), the attached experimental patchset (also at https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol that, I hope, allows for better concurrency, reliability and readability, and removes a bunch of TODO notes about questionable interlocking. However, I'm not quite done figuring out if the bufmgr interaction is right and will be manageable on the undoaccess side, so I'm hoping to get some feedback, not asking for anyone to rebase on top of it yet. Previously, there were two LWLocks used to make sure that all discarding was prevented while anyone was reading or writing data in any part of an undo log, and (probably more problematically) vice versa. Here's a new approach that removes that blocking: 1. Anyone is allowed to try to read or write data at any UndoRecPtr that has been allocated, through the buffer pool (though you'd usually want to check it with UndoRecPtrIsDiscarded() first, and only rely on the system I'm describing to deal with races). 2. ReadBuffer() might return InvalidBuffer. This can happen for a cache miss, if the smgrread implementation wants to indicate that the buffer has been discarded/truncated and that is expected (md.c won't ever do that, but undofile.c can). 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently unpinned buffers, and marks as BM_DISCARDED any that happen to be pinned right now, so they can't be immediately invalidated. Such buffers are never written back and are eligible for reuse on the next clock sweep, even if they're written into by a backend that managed to do that when we were trying to discard. 4. In order to make this work, I needed to track an extra offset 'begin' that says what physical storage exists. So [begin, end) give you the range of physical undo space (that is, files that exist on disk) and [discard, insert) give you the range of active data within it. There are now four offsets per log in shm and in the pg_stat_undo_logs view. 5. Separating begin from discard allows the WAL logging for UndoLogDiscard() to do filesystem actions before logging, and other effects after logging, which have several nice properties if you work through the various crash scenarios. This allowed a lot of direct UndoLogSlot access and locking code to be removed from undodiscard.c and undoaccess.c, because now they can just proceed as normal, as long as they are prepared to give up whenever the buffer manager tells them the buffer they're asking for has evaporated. Once they've pinned a buffer, they don't need to care if it becomes (or already was) BM_DISCARDED; attempts to dirty it will be silently ignore and eventually it'll be reclaimed. It also gets rid of 'oldest_data', which was another scheme tagging along behind the discard pointer. So now I'd like to get feedback on the sanity of this scheme. I'm not saying it doesn't have bugs right now -- I've been trying to figure out good ways to test it and I'm not quite there yet -- but the concept. One observation I have is that there were already code paths in undoaccess.c that can tolerate InvalidBuffer in recovery, due to the potentially different discard timing for DO vs REDO. I think that's a point in favour of this scheme, but I can see that it's inconvenient to have to deal with InvalidBuffer whenever you read. Some other changes in this patch set: 1. There is a superuser-only procedure pg_force_discard_undo(logno) that can discard on command. This can be used to get a system unwedged if rollback actions are failing. For example, if you insert an elog(ERROR, "boo!") into the smgr_undo() and then roll back a table creation, you'll see a discard worker repeatedly reporting the error, and pg_stat_undo_logs will show that the undo log space never gets freed. This can be fixed with CALL pg_force_discard_undo(<logno>). 2. There is a superuser-only testing-only procedure pg_force_switch_undo(logno) that can be used to force a transaction is currently writing to that log number to switch to a new one, as if it hit the end of the undo log (the 1TB address space within each undo log). This is good for exercising code that eg rolls back stuff spread over two undo logs. 3. When I was removing oldest_data from UndoLogSlot, I started wondering why wait_fxmin was in there, as it was almost the last reason why discard worker code needed to know about slots. Since we currently have only a single discard worker, and no facility for coordinating more than one discard worker, I think its bookkeeping might as well be backend local. Here I made the stupidest change that would work: a hash table to hold per-logno wait_fxmin. I'm not entirely sure what data structure we really want for this -- it's all a bit brute force right now. Thoughts? I pulled in the latest code from undoprocessing as of today, and I might be a bit confused about "Defect and enhancement in multi-log support" some of which I have squashed into the make undolog patch. BTW undoprocessing builds with initialized variable warnings in xact.c on clang today. -- Thomas Munro https://enterprisedb.com
Attachment
pgsql-hackers by date: