Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues) - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues) |
Date | |
Msg-id | 201102182143.p1ILhJo27167@momjian.us Whole thread Raw |
In response to | Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
|
List | pgsql-hackers |
Did we make any progress on this? Is it a TODO? --------------------------------------------------------------------------- Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> ... and PANIC is absolutely, entirely, 100% unacceptable here. �I don't > >> think you understand the context. �We've already written the truncate > >> action to WAL (as we must: WAL before data change). �If we PANIC, that > >> means we'll PANIC again during WAL replay. �So that solution means a > >> down, and perhaps unrecoverably broken, database. > > > All right, that would be bad. > > Actually ... after some study of the code, I find that I'm holding this > proposal to a higher standard than the current code maintains. > According to our normal rules for applying WAL-loggable data changes, > there should be a critical section wrapping the application of the > data changes with making the WAL entry. RelationTruncate fails to > do any such thing: it just pushes out the WAL entry and then calls > smgrtruncate, and so any error inside the latter just results in > elog(ERROR). Whereupon the system state is entirely different from what > WAL says it should be. So my previous gut feeling that we need to > rethink this whole thing seems justified. > > I traced through everything that leads to an ftruncate() call in the > backend as of HEAD, and found that we have these cases to worry about: > > mdunlink calls ftruncate directly, and does nothing worse than > elog(WARNING) on failure. This is fine because all it wants to do > is save some disk space until the file gets deleted "for real" at > the next checkpoint. Failure only means we waste disk space > temporarily, and is surely not cause for panic. > > All other calls proceed (sometimes indirectly) from RelationTruncate > or replay of the WAL record it emits. We have not thought hard > about the order of the various truncations this does and whether or > not we have a self-consistent state if it fails partway through. > If we don't want to make the whole thing a critical section, we need > to figure that out. > > RelationTruncate (and its subsidiary RelationTruncateIndexes) is called > from heap_truncate_one_rel (which itself does things in an unsafe order), > and from lazy_truncate_heap in VACUUM. > > heap_truncate_one_rel has (indirectly) two call sources: > > from ExecuteTruncate for a locally created rel, where we don't care, > and would definitely rather NOT have a panic on error: just rolling back > the transaction is fine thank you very much. > > from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS. > Here again, so far as heaps are concerned, rollback on error would be > plenty since any inserted rows would then just be dead. The tricky > part is the indexes for such a table. If we truncate them before > truncating the heap, then the worst possible case is an internally > inconsistent index on a temp table, which will be automatically cleaned > up during the next successful commit in its session. So it's pretty > hard to justify a PANIC response here either. > > So it seems like the only case where there is really grounds for PANIC > on failure is the VACUUM case. And there we might apply Heikki's idea > of trying to zero the untruncatable pages first. > > I'm thinking that we need some sort of what-to-do-on-error flag passed > into RelationTruncate, plus at least order-of-operations fixes in > several other places, if not a wholesale refactoring of this whole call > stack. But I'm running out of steam and don't have a concrete proposal > to make right now. In any case, we've got more problems here than just > the original one of forgetting dirty buffers too soon. > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
pgsql-hackers by date: