Re: 'Invalid lp' during heap_xlog_delete - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: 'Invalid lp' during heap_xlog_delete |
Date | |
Msg-id | 20191206230640.2dvdjpcgn46q3ks2@alap3.anarazel.de Whole thread Raw |
In response to | 'Invalid lp' during heap_xlog_delete (Daniel Wood <hexexpert@comcast.net>) |
Responses |
Re: 'Invalid lp' during heap_xlog_delete
|
List | pgsql-hackers |
Hi, On 2019-11-08 12:46:51 -0800, Daniel Wood wrote: > Page on disk has empty lp 1 > * Insert into page lp 1 > > checkpoint START. Redo eventually starts here. > ** Delete all rows on page. It's worthwhile to note that this part cannot happen without full page writes disabled. By dint of a checkpoint having stared previously, this will otherwise always include an FPW (or be marked as WILL_INIT by a previous record, which functionally is equivalent). > autovac truncate > DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty > checkpoint completes If I understand correctly, the DropRelFileNodeBuffers() needs to happen before the BufferSync() reaches the buffer containing the page with the deletion, but before the relevant file(s) are truncated. And obviously the deletion needs to have finished modifying the page in question. Not a conflict with what you wrote, just confirming. > crash > smgrtruncate - Not reached This seems like a somewhat confusing description to me, because smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you mean by "smgrtruncate" is not the function, but the smgr_truncate() callback? > Even if we reach the truncate we don't fsync it till the next > checkpoint. So on filesystems which delay metadata updates a crash > can lose the truncate. I think that's probably fine though. Leaving the issue of checkpoint completing inbetween the DropRelFileNodeBuffers() and the actual truncation aside, we'd have the WAL logged truncation truncating the file. I don't think it's unreasonable to except a filesystem that claims to support running without full_page_writes (I've seen several such claims turning out not to be true under load), to preserve either the original page contents or the new file size after a a crash. If your filesystem doesn't, you really ought not to use it with FPW = off. I do wonder, halfway related, if there's an argument that XLogReadBufferForRedoExtended() ought to return something other than BLK_NEEDS_REDO for pages read during recovery that are all-zeroes, at least for some RBM_* modes. > Once we do the fsync(), for the truncate, the REDO read will return > BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the > fsync() or crashing before the truncate, the delete redo depends on > the most recent version of the page having been written by the > checkpoint. We also need to correctly replay this on a standby, so I don't think just adding an smgrimmedsync() is the answer. We'd not replay the truncation standbys / during PITR, unless I miss something. So we'd just end up with the same problem in slightly different situations. > Is DropRelFileNodeBuffers purely for performance or would there be any > correctness problems if not done. There would be correctness problems if we left that out - the on-disk state and the in-memory state would diverge. To me it sounds the fix here would be to rejigger the RelationTruncate() sequence for truncation of the main for as follows: 1) MyPgXact->delayChkpt = true 2) XLogInsert(XLOG_SMGR_TRUNCATE) 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(), and then calls the smgr_truncate callback) 4) MyPgXact->delayChkpt = false I'm not worried about the increased delayChkpt = true time. Compared with the frequency of RecordTransactionCommit() this seems harmless. I'm inclined to think that we should make the XLogFlush() in the RelationNeedsWAL() branch of RelationTruncate() unconditional. Performing the truncation on the filesystem level without actually having persisted the corresponding WAL record is dangerous. I think we need to backpatch a fix for this (even if one were to consider fpw = off unsupported). I think there's enough other nasty edge cases here. While fpw=on fixes the deletion case at hand, I think you could very well end up in a nasty situation in other cases where either redo location or the actual checkpoint record would fall between the WAL record and the actual truncation. Imagine e.g. a base backup starting in such a situation - you'd potentially end up with a relation that contains old data, without later replaying the truncation record. The window isn't huge, but also not negligible. I'll start a separate thread about whether we need to do a good bit of the work in smgrtruncate() / smgrdounlink() / ... in critical sections. - Andres
pgsql-hackers by date: