Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers
From | Greg Burd |
---|---|
Subject | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected |
Date | |
Msg-id | 64F044AB-08FD-46E6-80FA-C912F42E77D2@burd.me Whole thread Raw |
In response to | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
> On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote: >>> 1. They use CAS in sem_post() because they want to report EOVERFLOW if >>> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I >>> just used fetch-and-add. Is that bad? I noticed their alternative >>> older version (remove "_new" from [1]) also uses a whole 32 bit >>> counter like me, so perhaps isn't so worried about exceeding a narrow >>> field of bits, and does it my way. >> >> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that >> the sem_post() function shall fail if the maximum allowable value of the >> semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to >> go after all? > > Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was > looking at one of the earlier versions. But I see now that POSIX 2024 > has just added it! So the question is: do we care? I’m on the fence about "caring", if your code fixes the issue we could simply add a comment about eventually needing to support EOVERFLOW. I don't think what you have in the patch is wrong and we're solving our issue, not the worlds inconsistencies. > Supposing posix_sema.c checked that the maximum number of backends > didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose > today if you later exceed it in sem_post() you'll get either FATAL: > EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely > including a hang due to a corrupted counter, I guess). I don't know > if systems with a low enough value exist in the wild, but if this > patch were to define its own PG_SEM_VALUE_MAX and then redefine > SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would > naturally use UINT32_MAX, and then that check would be tautological > given the types involved, not to mention our own lower limits. So > perhaps it's not worth more than a comment? > > I've found only very brief discussions in the standards community so far: > > https://austingroupbugs.net/view.php?id=315 > https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow I see what you mean, a comment that captures this decision and we're done IMO. > >> So if I understand it the way your sem_trywait() works is: >> * Try non-blocking acquire with pg_sem_trywait() >> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently >> * When woken up: Loop back to try again >> >> This is essentially how sem_trywait() implementations work in Linux IIRC, >> they use futexes or similar kernel primitives for efficient blocking. > > Yeah. Linux gave us the name futex and seems to have established a > sort of de facto standard semantics for it, as seen here in macOS. Indeed, the "fast userspace mutex". *sigh* ;-P I was calling this out because an implementation of sem_trywait() that used sem_wait() that wasn't based on futex would be less correct AFAICT. > It's not the only way to skin that cat... IIUC there were various > older systems with the double-check to avoid the race against user > space, and "address as wait channel" wasn't new either, but I think at > least some of the other things like that might have required a > waitlist in user space; I'm not too sure about that history though, a > lot of it was probably closed source. It's not even too clear to me > how that waitlist tradeoff works out, but in any case, that is exactly > how a fallback implementation would probably work if we ever decided > we wanted pg_futex everywhere: stick a ConditionVariable (or similar) > into the struct, or if you want to preserve sizeof(pg_futex_u32) == > sizeof(uint32) and maybe even do away with the struct and just use > pg_atomic_uint32 directly then you could hash the address to select a > cv from a table of (say) 256 of them, with the annoying detail that > you'd probably need a separate wake/wait function pair for {DSM > segment handle, offset} for memory mapped at different addresses. But > we don't need to make decisions about any of that vapourware to solve > this immediate single-platform problem. >> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX > > We'd be checking if a uint32 is > UINT32_MAX, assuming again we that > we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely > unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if > you tried it. Oops, yep. You are right. :) >> Overall, I like the approach, it addresses a real limitation on macOS and >> could be expanded in the future to cover other similar issues on other >> platforms. > > Thanks for looking! And especially for the sanity check on the > definitions of sem_wait()/sem_trywait(). Sure thing, anytime. Not sure I helped much... -greg
pgsql-hackers by date: