Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected - Mailing list pgsql-hackers

From Tom Lane
Subject Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Date
Msg-id 2682137.1754860980@sss.pgh.pa.us
Whole thread Raw
In response to `pg_ctl init` crashes when run concurrently; semget(2) suspected  (Gavin Panella <gavinpanella@gmail.com>)
Responses Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
List pgsql-hackers
Gavin Panella <gavinpanella@gmail.com> writes:
> I have many tests which spin up clusters using `pg_ctl init`, each in its
> own single-use temporary directory. Each test is run for every PostgreSQL
> installation found on the host machine. These tests are often run
> concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
> sporadic failures on macOS:
>   FATAL:  could not create semaphores: Invalid argument
>   DETAIL:  Failed system call was semget(176163502, 20, 03600).
>   child process exited with exit code 1

Ugh.

> I think it's related to the increase of SEMAS_PER_SET
> in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
> in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
> semget(2) on macOS.

Yeah, I guess you're right; but it's not macOS-specific AFAICS.
"man semget" quoth

     [EINVAL]           The number of semaphores requested is either less than
                        1 or greater than the system imposed maximum per set.

     [EINVAL]           A semaphore identifier exists for the argument key,
                        but the number of semaphores in the set associated
                        with it is less than nsems, and nsems is non-zero.

This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec.  So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key.  Nonetheless, key collisions must be expected, and we have
little right to assume that a pre-existing semaphore set must have
enough semaphores.

Using the same errno for these cases was really quite poorly chosen,
because in the first case we should not retry with another key, but
in the second case we should.  If the problem is that SEMMSL is too
small, and we retry, it'll be an infinite loop.

I don't care for your patch because (a) I'm not sure that the result
of asking for zero semaphores is portable (the Linux man page suggests
that it might be allowed there); and (b) if the problem is SEMMSL,
your code will fall into the infinite loop.

My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so.  This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.

If we do retry, the semget() in IpcSemaphoreCreate will fail again,
so we'll just skip over that semaphore key and try another.  It
would be marginally nicer to release the semaphore set if it's
unused, but I'm not sure it'd be safe to assume that a set with
too few semaphores is one of ours.  It certainly didn't come from
a prior cycle of life of the same postmaster.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Gavin Panella
Date:
Subject: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Next
From: Tom Lane
Date:
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected