WAL dirty-buffer management bug - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | WAL dirty-buffer management bug |
Date | |
Msg-id | 16678.1143744716@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: WAL dirty-buffer management bug
Re: WAL dirty-buffer management bug |
List | pgsql-hackers |
I added some notes to src/backend/access/transam/README explaining the protocol for executing a WAL-logged action: --- 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s) to be modified. 2. START_CRIT_SECTION() (Any error during the next two steps must cause a PANIC because the shared buffers will contain unlogged changes, which we have to ensure don't get to disk. Obviously, you should check conditions such as whether there's enough free space on the page before you start the critical section.) 3. Apply the required changes to the shared buffer(s). 4. Build a WAL log record and pass it to XLogInsert(); then update the page's LSN and TLI using the returned XLOG location. For instance, recptr = XLogInsert(rmgr_id, info, rdata); PageSetLSN(dp, recptr); PageSetTLI(dp, ThisTimeLineID); 5. END_CRIT_SECTION() 6. Unlock and write the buffer(s): LockBuffer(buffer, BUFFER_LOCK_UNLOCK); WriteBuffer(buffer); (Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it dirty and unpins it. The write will not happen until a checkpoint occurs or the shared buffer is needed for another page.) --- This is pretty much what heapam and btree currently do, but on looking at it I think it's got a problem: we really ought to mark the buffer dirty before releasing the critical section. Otherwise, if there's an elog(ERROR) before the WriteBuffer call is reached, the backend would go on about its business, and we'd have changes in a disk buffer that isn't marked dirty. The changes would be uncommitted, presumably, because of the error --- but nonetheless this could result in inconsistency down the road. One example scenario is:1. We insert a tuple at, say, index 3 on a page.2. elog after making the XLOG entry, butbefore WriteBuffer.3. page is later discarded from shared buffers; since it's not marked dirty, it'll just be droppedwithout writing it.4. Later we need to insert another tuple in same table, and we again choose index 3 on this pageas the place to put it.5. system crash leads to replay from WAL. Now we'll have two different WAL records trying to insert tuple 3. Not good. While there's not much chance of an elog in just LockBuffer/WriteBuffer, some of the code paths in btree do quite a bit of stuff before they write the buffer (this was probably done to avoid multiple entries to bufmgr, back when the BufMgrLock was a severe source of contention). So I think there's a nontrivial risk of index corruption from this coding practice. I'm thinking we should change the code and the README to specify that you must mark the buffer dirty before you can END_CRIT_SECTION(). Comments? regards, tom lane
pgsql-hackers by date: