Thread: Condition variables vs interrupts
Hi, While testing something unrelated, Tomas reported[1] that he could make a parallel worker ignore a SIGTERM and hang forever in ConditionVariableSleep(). I looked into this and realised that it's more likely in master. Commit 1321509f refactored the latch wait loop to look a little bit more like other examples* by putting CHECK_FOR_INTERRUPTS() after ResetLatch(), whereas previously it was at the top of the loop. ConditionVariablePrepareToSleep() was effectively relying on that order when it reset the latch without its own CFI(). The bug goes back to the introduction of CVs however, because there was no guarantee that you'd ever reach ConditionVariableSleep(). You could call ConditionVariablePrepareToSleep(), test your condition, decide you're done, then call ConditionVariableCancelSleep(), then reach some other WaitLatch() with no intervening CFI(). It might be hard to find a code path that actually does that without a coincidental CFI() to save you, but at least in theory the problem exists. I think we should probably just remove the unusual ResetLatch() call, rather than adding a CFI(). See attached. Thoughts? *It can't quite be exactly like the two patterns shown in latch.h, namely { Reset, Test, Wait } and { Test, Wait, Reset }, because the real test is external to this function; we have the other possible rotation { Wait, Reset, Test }, and this code is only run if the client's test failed. Really it's a nested loop, with the outer loop belonging to the caller, so we have { Test', { Wait, Reset, Test } }. However, CHECK_FOR_INTERRUPTS() also counts as a test of work to do, and AFAICS it always belongs between Reset and Wait, no matter how far you rotate the loop. I wonder if latch.h should mention that. [1] https://www.postgresql.org/message-id/20191217232124.3dtrycatgfm6oxxb%40development
Attachment
On Fri, Dec 20, 2019 at 12:05:34PM +1300, Thomas Munro wrote: > I think we should probably just remove the unusual ResetLatch() call, > rather than adding a CFI(). See attached. Thoughts? I agree: removing the ResetLatch() and having the wait event code deal with it is a better way to go. I wonder why the reset was done in the first place? -- Shawn Debnath Amazon Web Services (AWS)
On Sat, Dec 21, 2019 at 2:10 PM Shawn Debnath <sdn@amazon.com> wrote: > On Fri, Dec 20, 2019 at 12:05:34PM +1300, Thomas Munro wrote: > > I think we should probably just remove the unusual ResetLatch() call, > > rather than adding a CFI(). See attached. Thoughts? > > I agree: removing the ResetLatch() and having the wait event code deal > with it is a better way to go. I wonder why the reset was done in the > first place? Thanks for the review. I was preparing to commit this today, but I think I've spotted another latch protocol problem in the stable branches only and I'd to get some more eyeballs on it. I bet it's really hard to hit, but ConditionVariableSleep()'s return path (ie when the CV has been signalled) forgets that the the latch is multiplexed and latches are merged: just because it was set by ConditionVariableSignal() doesn't mean it wasn't *also* set by die() or some other interrupt, and yet at the point we return, we've reset the latch and not run CFI(), and there's nothing to say the caller won't then immediately wait on the latch in some other code path, and sleep like a log despite the earlier delivery of (say) SIGTERM. If I'm right about that, I think the solution is to move that CFI down in the stable branches (which you already did for master in commit 1321509f).
Attachment
On Tue, Dec 24, 2019 at 3:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Dec 21, 2019 at 2:10 PM Shawn Debnath <sdn@amazon.com> wrote: > > On Fri, Dec 20, 2019 at 12:05:34PM +1300, Thomas Munro wrote: > > > I think we should probably just remove the unusual ResetLatch() call, > > > rather than adding a CFI(). See attached. Thoughts? > > > > I agree: removing the ResetLatch() and having the wait event code deal > > with it is a better way to go. I wonder why the reset was done in the > > first place? I have pushed this on master only. > Thanks for the review. I was preparing to commit this today, but I > think I've spotted another latch protocol problem in the stable > branches only and I'd to get some more eyeballs on it. I bet it's > really hard to hit, but ConditionVariableSleep()'s return path (ie > when the CV has been signalled) forgets that the the latch is > multiplexed and latches are merged: just because it was set by > ConditionVariableSignal() doesn't mean it wasn't *also* set by die() > or some other interrupt, and yet at the point we return, we've reset > the latch and not run CFI(), and there's nothing to say the caller > won't then immediately wait on the latch in some other code path, and > sleep like a log despite the earlier delivery of (say) SIGTERM. If > I'm right about that, I think the solution is to move that CFI down in > the stable branches (which you already did for master in commit > 1321509f). I'm not going to back-patch this (ie the back-branches version from my previous email) without more discussion, but I still think it's subtly broken.