global barrier & atomics in signal handlers (Re: Atomic operationswithin spinlocks) - Mailing list pgsql-hackers

From Andres Freund
Subject global barrier & atomics in signal handlers (Re: Atomic operationswithin spinlocks)
Date
Msg-id 20200609193723.eu5ilsjxwdpyxhgz@alap3.anarazel.de
Whole thread Raw
In response to Re: Atomic operations within spinlocks  (Andres Freund <andres@anarazel.de>)
Responses Re: global barrier & atomics in signal handlers (Re: Atomicoperations within spinlocks)
List pgsql-hackers
Hi,

On 2020-06-08 23:08:47 -0700, Andres Freund wrote:
> On 2020-06-05 17:19:26 -0700, Andres Freund wrote:
> > I wrote a patch for this, and when I got around to to testing it, I
> > found that our tests currently don't pass when using both
> > --disable-spinlocks and --disable-atomics. Turns out to not be related
> > to the issue above, but the global barrier support added in 13.
> >
> > That *reads* two 64 bit atomics in a signal handler. Which is normally
> > fine, but not at all cool when atomics (or just 64 bit atomics) are
> > backed by spinlocks. Because we can "self interrupt" while already
> > holding the spinlock.
> >
> > It looks to me that that's a danger whenever 64bit atomics are backed by
> > spinlocks, not just when both --disable-spinlocks and --disable-atomics
> > are used. But I suspect that it's really hard to hit the tiny window of
> > danger when those options aren't used. While we have buildfarm animals
> > testing each of those separately, we don't have one that tests both
> > together...
> >
> > I'm not really sure what to do about that issue. The easisest thing
> > would probably be to change the barrier generation to 32bit (which
> > doesn't have to use locks for reads in any situation).  I tested doing
> > that, and it fixes the hangs for me.
> >
> >
> > Randomly noticed while looking at the code:
> >     uint64        flagbit = UINT64CONST(1) << (uint64) type;
> >
> > that shouldn't be 64bit, right?
> 
> Attached is a series of patches addressing these issues, of varying
> quality:
> 
> 1) This fixes the above mentioned issue in the global barrier code by
>    using 32bit atomics. That might be fine, or it might not. I just
>    included it here because otherwise the tests cannot be run fully.

Hm. Looking at this again, perhaps the better fix would be to simply not
look at the concrete values of the barrier inside the signal handler?
E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
ProcSignalBarrierPending to be set. And then have
ProcessProcSignalBarrier do the check that's currently in
CheckProcSignalBarrier()?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Why is pq_begintypsend so slow?
Next
From: Ranier Vilela
Date:
Subject: Re: Speedup usages of pg_*toa() functions