Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected |
Date | |
Msg-id | CA+hUKG+4XVv7xinPA2h7MNCnaTpD+0DRW_rCjE-AdPwW3OGd1Q@mail.gmail.com Whole thread Raw |
In response to | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected ("Burd, Greg" <greg@burd.me>) |
Responses |
Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected |
List | pgsql-hackers |
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote: > > 1. They use CAS in sem_post() because they want to report EOVERFLOW if > > you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I > > just used fetch-and-add. Is that bad? I noticed their alternative > > older version (remove "_new" from [1]) also uses a whole 32 bit > > counter like me, so perhaps isn't so worried about exceeding a narrow > > field of bits, and does it my way. > > The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that > the sem_post() function shall fail if the maximum allowable value of the > semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to > go after all? Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was looking at one of the earlier versions. But I see now that POSIX 2024 has just added it! So the question is: do we care? Supposing posix_sema.c checked that the maximum number of backends didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose today if you later exceed it in sem_post() you'll get either FATAL: EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely including a hang due to a corrupted counter, I guess). I don't know if systems with a low enough value exist in the wild, but if this patch were to define its own PG_SEM_VALUE_MAX and then redefine SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would naturally use UINT32_MAX, and then that check would be tautological given the types involved, not to mention our own lower limits. So perhaps it's not worth more than a comment? I've found only very brief discussions in the standards community so far: https://austingroupbugs.net/view.php?id=315 https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow > So if I understand it the way your sem_trywait() works is: > * Try non-blocking acquire with pg_sem_trywait() > * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently > * When woken up: Loop back to try again > > This is essentially how sem_trywait() implementations work in Linux IIRC, > they use futexes or similar kernel primitives for efficient blocking. Yeah. Linux gave us the name futex and seems to have established a sort of de facto standard semantics for it, as seen here in macOS. It's not the only way to skin that cat... IIUC there were various older systems with the double-check to avoid the race against user space, and "address as wait channel" wasn't new either, but I think at least some of the other things like that might have required a waitlist in user space; I'm not too sure about that history though, a lot of it was probably closed source. It's not even too clear to me how that waitlist tradeoff works out, but in any case, that is exactly how a fallback implementation would probably work if we ever decided we wanted pg_futex everywhere: stick a ConditionVariable (or similar) into the struct, or if you want to preserve sizeof(pg_futex_u32) == sizeof(uint32) and maybe even do away with the struct and just use pg_atomic_uint32 directly then you could hash the address to select a cv from a table of (say) 256 of them, with the annoying detail that you'd probably need a separate wake/wait function pair for {DSM segment handle, offset} for memory mapped at different addresses. But we don't need to make decisions about any of that vapourware to solve this immediate single-platform problem. > Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX We'd be checking if a uint32 is > UINT32_MAX, assuming again we that we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if you tried it. > Overall, I like the approach, it addresses a real limitation on macOS and > could be expanded in the future to cover other similar issues on other > platforms. Thanks for looking! And especially for the sanity check on the definitions of sem_wait()/sem_trywait().
pgsql-hackers by date: