Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage() - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage() |
Date | |
Msg-id | 20151129041549.GA1852531@tornado.leadboat.com Whole thread Raw |
In response to | Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage() (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()
|
List | pgsql-hackers |
On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote: > On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote: > > /* > > * Optional array of WAL flush LSNs associated with entries in the SLRU > > * pages. If not zero/NULL, we must flush WAL before writing pages (true > > * for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[] > > * has lsn_groups_per_page entries per buffer slot, each containing the > > * highest LSN known for a contiguous group of SLRU entries on that slot's > > * page. > > */ > > XLogRecPtr *group_lsn; > > int lsn_groups_per_page; > > > Here's the multixact.c comment justifying it: > > * XLOG interactions: this module generates an XLOG record whenever a new > * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record > * whenever a new MultiXactId is defined. This allows us to completely > * rebuild the data entered since the last checkpoint during XLOG replay. > * Because this is possible, we need not follow the normal rule of > * "write WAL before data"; the only correctness guarantee needed is that > * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a > * checkpoint is considered complete. If a page does make it to disk ahead > * of corresponding WAL records, it will be forcibly zeroed before use anyway. > * Therefore, we don't need to mark our pages with LSN information; we have > * enough synchronization already. > > The comment's justification is incomplete, though. What of pages filled over > the course of multiple checkpoint cycles? Having studied this further, I think it is safe for a reason given elsewhere: * Note: we need not flush this XLOG entry to disk before proceeding. The * only way for the MXID to be referenced from any data page is for * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates * an XLOG record that must follow ours. The normal LSN interlock between * the data page and that XLOG record will ensure that our XLOG record * reaches disk first. If the SLRU members/offsets data reaches disk * sooner than the XLOG record, we do not care because we'll overwrite it * with zeroes unless the XLOG record is there too; see notes at top of * this file. I find no flaw in the first three sentences. In most cases, one of multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed overwrite the widowed mxid data with zeroes before the system again writes data in that vicinity. We can fail to do that if a backend stalls just after calling GetNewMultiXactId(), then recovers and updates SLRU pages long after other backends filled the balance of those pages. That's still okay; if we never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid will survive recovery. I drafted the attached comment update to consolidate and slightly correct this justification. (If we were designing the multixact subsystem afresh today, I'd vote for following the write-WAL-before-data rule despite believing it is not needed. The simplicity would be worth it.) While contemplating the "stalls ... just after calling GetNewMultiXactId()" scenario, I noticed a race condition not involving any write-WAL-before-data violation. MultiXactIdCreateFromMembers() and callees have this skeleton: GetNewMultiXactId LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE) ExtendMultiXactOffset() ExtendMultiXactMember() START_CRIT_SECTION() (MultiXactState->nextMXact)++ MultiXactState->nextOffset += nmembers LWLockRelease(MultiXactGenLock) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID) RecordNewMultiXact LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE) *offptr = offset; /* update pg_multixact/offsets SLRU page */ LWLockRelease(MultiXactOffsetControlLock) LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE) *memberptr = members[i].xid; /* update pg_multixact/members SLRU page */ *flagsptr = flagsval; /* update pg_multixact/members SLRU page */ LWLockRelease(MultiXactMemberControlLock) END_CRIT_SECTION Between GetNewMultiXactId() and XLogInsert(), other backends will be free to create higher-numbered mxids. They will skip over SLRU space the current backend reserved. Future GetMultiXactIdMembers() calls rely critically on the current backend eventually finishing: * 2. The next multixact may still be in process of being filled in: that * is, another process may have done GetNewMultiXactId but not yet written * the offset entry for that ID. In that scenario, it is guaranteed that * the offset entry for that multixact exists (because GetNewMultiXactId * won't release MultiXactGenLock until it does) but contains zero * (because we are careful to pre-zero offset pages). Because * GetNewMultiXactId will never return zero as the starting offset for a * multixact, when we read zero as the next multixact's offset, we know we * have this case. We sleep for a bit and try again. A crash while paused this way makes permanent the zero offset. After recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers() on the immediate previous mxid hangs indefinitely. Also, pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have not thoroughly considered how best to fix these. Test procedure: -- backend 1 checkpoint; create table victim0 (c) as select 1; create table stall1 (c) as select 1; create table last2 (c) as select 1; begin; select c from victim0 for key share; select c from stall1 for key share; select c from last2 for key share; -- backend 2 begin; update victim0 set c = c + 1; rollback; -- burn one mxid -- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at -- XLogInsert() in MultiXactIdCreateFromMembers(): -- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c select c from stall1 for key share; -- stall in gdb while making an mxid -- backend 3 select c from last2 for key share; -- use one more mxid; flush WAL -- in GDB session, issue command: kill -- backend 1, after recovery select c from victim0; -- hangs, uncancelable -- backend 2, after recovery: assertion failure select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;
Attachment
pgsql-hackers by date: