Home > mailing lists

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers

From	Greg Burd
Subject	Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Date	August 16 17:00:09
Msg-id	64F044AB-08FD-46E6-80FA-C912F42E77D2@burd.me Whole thread Raw
In response to	Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected (Thomas Munro <thomas.munro@gmail.com>)
List	pgsql-hackers

Tree view

> On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
>>> 1. They use CAS in sem_post() because they want to report EOVERFLOW if
>>> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
>>> just used fetch-and-add.  Is that bad?  I noticed their alternative
>>> older version (remove "_new" from [1]) also uses a whole 32 bit
>>> counter like me, so perhaps isn't so worried about exceeding a narrow
>>> field of bits, and does it my way.
>>
>> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
>> the sem_post() function shall fail if the maximum allowable value of the
>> semaphore would be exceeded [EOVERFLOW].  So, maybe a CAS is the way to
>> go after all?
>
> Hmm.  Before last year, POSIX *didn't* specify EOVERFLOW.  I was
> looking at one of the earlier versions.  But I see now that POSIX 2024
> has just added it!  So the question is: do we care?

I’m on the fence about "caring", if your code fixes the issue we could
simply add a comment about eventually needing to support EOVERFLOW.  I
don't think what you have in the patch is wrong and we're solving our
issue, not the worlds inconsistencies.

> Supposing posix_sema.c checked that the maximum number of backends
> didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
> today if you later exceed it in sem_post() you'll get either FATAL:
> EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
> including a hang due to a corrupted counter, I guess).  I don't know
> if systems with a low enough value exist in the wild, but if this
> patch were to define its own PG_SEM_VALUE_MAX and then redefine
> SEM_VALUE_MAX to that, it certainly wouldn't be one of them.  It would
> naturally use UINT32_MAX, and then that check would be tautological
> given the types involved, not to mention our own lower limits.  So
> perhaps it's not worth more than a comment?
>
> I've found only very brief discussions in the standards community so far:
>
> https://austingroupbugs.net/view.php?id=315
> https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow

I see what you mean, a comment that captures this decision and we're
done IMO.

>
>> So if I understand it the way your sem_trywait() works is:
>> * Try non-blocking acquire with pg_sem_trywait()
>> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
>> * When woken up: Loop back to try again
>>
>> This is essentially how sem_trywait() implementations work in Linux IIRC,
>> they use futexes or similar kernel primitives for efficient blocking.
>
> Yeah.  Linux gave us the name futex and seems to have established a
> sort of de facto standard semantics for it, as seen here in macOS.

Indeed, the "fast userspace mutex".  *sigh* ;-P

I was calling this out because an implementation of sem_trywait() that
used sem_wait() that wasn't based on futex would be less correct AFAICT.

> It's not the only way to skin that cat... IIUC there were various
> older systems with the double-check to avoid the race against user
> space, and "address as wait channel" wasn't new either, but I think at
> least some of the other things like that might have required a
> waitlist in user space; I'm not too sure about that history though, a
> lot of it was probably closed source.  It's not even too clear to me
> how that waitlist tradeoff works out, but in any case, that is exactly
> how a fallback implementation would probably work if we ever decided
> we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
> into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
> sizeof(uint32) and maybe even do away with the struct and just use
> pg_atomic_uint32 directly then you could hash the address to select a
> cv from a table of (say) 256 of them, with the annoying detail that
> you'd probably need a separate wake/wait function pair for {DSM
> segment handle, offset} for memory mapped at different addresses.  But
> we don't need to make decisions about any of that vapourware to solve
> this immediate single-platform problem.


>> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
>
> We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
> we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
> unrelated) system SEM_VALUE_MAX.  Might even get a compiler warning if
> you tried it.

Oops, yep.  You are right. :)

>> Overall, I like the approach, it addresses a real limitation on macOS and
>> could be expanded in the future to cover other similar issues on other
>> platforms.
>
> Thanks for looking!  And especially for the sanity check on the
> definitions of sem_wait()/sem_trywait().

Sure thing, anytime.  Not sure I helped much...

-greg

pgsql-hackers by date:

From: Robert Treat
Date: 16 August, 16:41:57
Subject: Re: Adding REPACK [concurrently]

From: Tom Lane
Date: 16 August, 17:22:50
Subject: Re: Retail DDL

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers

Previous

Next