Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Refactoring the checkpointer's fsync request queue |
Date | |
Msg-id | CAEepm=10A5eySmc9=sPje6nNbaSBYB0gQ3Rb0dDvph-Ei2H0cQ@mail.gmail.com Whole thread Raw |
In response to | Re: Refactoring the checkpointer's fsync request queue (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Refactoring the checkpointer's fsync request queue
|
List | pgsql-hackers |
On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > While testing this patch with frequent checkpoints I've stumbled upon an > > interesting error, that happened already after I finished one test: > > > > TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574) Fixed in the 0001 patch (and a similar problem in the WIN32 branch). On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > > That sounds a little like you are proposing to go back to the way > > > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > > > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > > > the division of labor wouldn't be quite the same. > > > > But is there an argument against it? The checkpointer would still be > > creating checkpoints including running fsync, but the background > > writer would be, erm, writing, erm, in the background. > > I don't know. I guess the fact that the checkpointer is still > performing the fsyncs is probably a key point. I mean, in the old > division of labor, fsyncs could interrupt the background writing that > was supposed to be happening. Robert explained off-list that BgBufferSync() and BufferSync() have rather different goals, and performing them in the same loop without major reengineering to merge their logic would probably not work out well. So I'm abandoning that plan for now (though it could perhaps be interesting if done right). I do have a new plan though... On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres@anarazel.de> wrote: > ... I've previously wondered whether > there's any way we could delay the write to a point where the buffer is > not locked anymore - as far as I can tell it's actually not required for > correctness that we send the fsync request before unlocking. It's > architecturally a bit dicey tho :( ... and it's basically what Andres said there ^^^. The specific hazard I wondered about is when a checkpoint begins after BufferAlloc() calls pwrite() but before it calls sendto(), so that we fail to fsync() a file that was modified before the checkpoint LSN. But, AFAICS, assuming we call sendto() before we update the buffer header, there are only two possibilities from the point of view of BufferAlloc(): 1. The checkpointer's BufferSync() loop arrives before we update the buffer header, so it sees the buffer as dirty, writes it out (again), remembers that the segment is dirty, and then when we eventually get the buffer header lock we see that it's not dirty anymore and we just skip the buffer. 2. The checkpointer's BufferSync() loop arrives after we updated the buffer header, so it sees it as invalid (or some later state), which means that we have already called sendto() (before we updated the header). Either way, the checkpointer finishes up calling fsync() before the checkpoint completes, as it should, and the worst that can happen due to bad timing is a harmless double pwrite(). I noticed a subtle problem though. Suppose we have case 2 above. After BufferSync() returns in the checkpointer, our backend has called sendto() to register a dirty file. In v2 the checkpointer runs AbsorbAllFsyncRequests() to drain the pipe until it sees a message for the current cycle (that is, it absorbs messages for the previous cycle). That's obviously not good enough, since backends race to call sendto() and a message for cycle n - 1 might be hiding behind a message for cycle n. So I propose to drain the pipe until it is empty or we see a message for cycle n + 1 (where n is the current cycle before we start draining, meaning that we ran out of fds and forced a new cycle in FlushFsyncRequestQueueIfNecessary()). I think that works, because although we can't be sure that we'll receive all messages for n - 1 before we receive a message for n due to races on the insert side, we *can* be certain that we'll receive all messages for n - 1 before we receive a message for n + 1, because we know that they were already in the pipe before we began. In the happy case, our system never runs out of fds so the pipe will eventually be empty, since backends send at most one message per cycle per file and the number of files is finite, and in the sad case, there are too many dirty files per cycle, so we keep creating new cycles while absorbing, but again the loop is bounded because we know that seeing n + 1 is enough for our purpose (which is to fsync all files that were already mentioned in messages send to the pipe before we started our loop). That's implemented in the 0002 patch, separated for ease of review. The above theories cover BufferAlloc()'s interaction with a checkpoint, and that seems to be the main story. I'm not entirely sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg special case init fork stuff), but I've run out of brain power for now and wanted to post an update. -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: