Thread: [HACKERS] Unportable implementation of background worker start

[HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

20 April 2017, 01:04:07

I chanced to notice that on gaur/pademelon, the "select_parallel"
regression test sometimes takes a great deal longer than normal,
for no obvious reason.  It does eventually terminate, but sometimes
only after 10 to 20 minutes rather than the couple dozen seconds
that are typical on that slow machine.

After a fair amount of hair-pulling, I traced the problem to the
fact that maybe_start_bgworker() will only start at most one worker
per call; after that, it sets StartWorkerNeeded = true and returns,
opining in a comment that ServerLoop will make another call shortly.

Unfortunately, that's hogwash.  It happens to work reliably on Linux,
because according to signal(7)
      The following interfaces are never restarted after being interrupted by      a signal handler, regardless of the
useof SA_RESTART; they always fail      with the error EINTR when interrupted by a signal handler: 
          ... select(2) ...

However, that's a Linux-ism.  What POSIX 2008 says about it, in the
select(2) reference page, is that
If SA_RESTART has been set for the interrupting signal, it isimplementation-defined whether the function restarts or
returnswith[EINTR]. 

HPUX apparently adopts the "restart" definition, and as we've previously
found out, not only does select(2) not return immediately but it actually
seems to reset the timeout timer to its original value.  (Some googling
suggests that restarting occurs on SVR4-derived kernels but not
BSD-derived ones.)

So what happens, if several RegisterDynamicBackgroundWorker requests
arrive in a single iteration of the postmaster's sigusr1_handler,
is that the first one is serviced thanks to the maybe_start_bgworker
call appearing in sigusr1_handler, and then we return to the select()
call and sleep.  The next start request is serviced only when the
typically-60-second select timeout expires, or more usually when some
other interrupt arrives at the postmaster.  In the regression tests,
what seems to happen is that we get a PMSIGNAL_START_AUTOVAC_WORKER
from the autovac launcher every thirty seconds, allowing one more bgworker
to get launched each time we go through sigusr1_handler.

Here's an actual trace of one select_parallel.sql query trying to
launch four parallel workers; I added a bunch of elog(LOG) calls
to help diagnose what was going on:

2017-04-19 17:25:33.448 EDT [8827] LOG:  signaling postmaster for startup of slot 1
2017-04-19 17:25:33.448 EDT [8827] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:25:33.448 EDT [8827] LOG:  signaling postmaster for startup of slot 2
2017-04-19 17:25:33.448 EDT [8827] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:25:33.448 EDT [8827] LOG:  signaling postmaster for startup of slot 3
2017-04-19 17:25:33.448 EDT [8827] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:25:33.448 EDT [8827] LOG:  signaling postmaster for startup of slot 4
2017-04-19 17:25:33.448 EDT [8827] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:25:33.563 EDT [6456] LOG:  entering sigusr1_handler
2017-04-19 17:25:33.563 EDT [6456] LOG:  registering background worker "parallel worker for PID 8827"
2017-04-19 17:25:33.563 EDT [6456] LOG:  registering background worker "parallel worker for PID 8827"
2017-04-19 17:25:33.563 EDT [6456] LOG:  registering background worker "parallel worker for PID 8827"
2017-04-19 17:25:33.563 EDT [6456] LOG:  registering background worker "parallel worker for PID 8827"
2017-04-19 17:25:33.563 EDT [6456] LOG:  entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1
2017-04-19 17:25:33.563 EDT [6456] LOG:  starting background worker process "parallel worker for PID 8827"
2017-04-19 17:25:33.566 EDT [8871] LOG:  starting parallel worker 3
2017-04-19 17:25:33.642 EDT [8871] LOG:  exiting parallel worker 3
2017-04-19 17:25:33.642 EDT [8871] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:25:33.647 EDT [6456] LOG:  leaving sigusr1_handler
2017-04-19 17:25:33.647 EDT [6456] LOG:  entering reaper
2017-04-19 17:25:33.647 EDT [6456] LOG:  unregistering background worker "parallel worker for PID 8827"
2017-04-19 17:25:33.647 EDT [6456] LOG:  reaped bgworker pid 8871 status 0
2017-04-19 17:25:33.647 EDT [6456] LOG:  leaving reaper
2017-04-19 17:26:03.114 EDT [6456] LOG:  entering sigusr1_handler
2017-04-19 17:26:03.115 EDT [6456] LOG:  entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1
2017-04-19 17:26:03.115 EDT [6456] LOG:  starting background worker process "parallel worker for PID 8827"
2017-04-19 17:26:03.118 EDT [8874] LOG:  starting parallel worker 2
2017-04-19 17:26:03.164 EDT [8874] LOG:  exiting parallel worker 2
2017-04-19 17:26:03.164 EDT [8874] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:26:03.185 EDT [6456] LOG:  leaving sigusr1_handler
2017-04-19 17:26:03.185 EDT [6456] LOG:  entering reaper
2017-04-19 17:26:03.186 EDT [6456] LOG:  unregistering background worker "parallel worker for PID 8827"
2017-04-19 17:26:03.186 EDT [6456] LOG:  reaped bgworker pid 8874 status 0
2017-04-19 17:26:03.186 EDT [6456] LOG:  leaving reaper
2017-04-19 17:26:03.284 EDT [6456] LOG:  entering reaper
2017-04-19 17:26:03.284 EDT [6456] LOG:  leaving reaper
2017-04-19 17:26:33.378 EDT [6456] LOG:  entering sigusr1_handler
2017-04-19 17:26:33.378 EDT [6456] LOG:  entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1
2017-04-19 17:26:33.378 EDT [6456] LOG:  starting background worker process "parallel worker for PID 8827"
2017-04-19 17:26:33.382 EDT [8876] LOG:  starting parallel worker 1
2017-04-19 17:26:33.428 EDT [8876] LOG:  exiting parallel worker 1
2017-04-19 17:26:33.428 EDT [8876] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:26:33.452 EDT [6456] LOG:  leaving sigusr1_handler
2017-04-19 17:26:33.453 EDT [6456] LOG:  entering reaper
2017-04-19 17:26:33.453 EDT [6456] LOG:  unregistering background worker "parallel worker for PID 8827"
2017-04-19 17:26:33.453 EDT [6456] LOG:  reaped bgworker pid 8876 status 0
2017-04-19 17:26:33.453 EDT [6456] LOG:  leaving reaper
2017-04-19 17:26:33.560 EDT [6456] LOG:  entering reaper
2017-04-19 17:26:33.560 EDT [6456] LOG:  leaving reaper
2017-04-19 17:27:03.114 EDT [6456] LOG:  entering sigusr1_handler
2017-04-19 17:27:03.115 EDT [6456] LOG:  entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1
2017-04-19 17:27:03.115 EDT [6456] LOG:  starting background worker process "parallel worker for PID 8827"
2017-04-19 17:27:03.118 EDT [8879] LOG:  starting parallel worker 0
2017-04-19 17:27:03.167 EDT [8879] LOG:  exiting parallel worker 0
2017-04-19 17:27:03.167 EDT [8879] STATEMENT:  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
2017-04-19 17:27:03.174 EDT [6456] LOG:  leaving sigusr1_handler
2017-04-19 17:27:03.174 EDT [6456] LOG:  entering reaper
2017-04-19 17:27:03.175 EDT [6456] LOG:  unregistering background worker "parallel worker for PID 8827"
2017-04-19 17:27:03.184 EDT [6456] LOG:  reaped bgworker pid 8879 status 0
2017-04-19 17:27:03.184 EDT [6456] LOG:  leaving reaper

While I haven't yet tested it, it seems like a fix might be as simple
as deleting these lines in maybe_start_bgworker:
           /*            * Have ServerLoop call us again.  Note that there might not            * actually *be* another
runnableworker, but we don't care all            * that much; we will find out the next time we run.            */
    StartWorkerNeeded = true;           return; 

So I'm wondering what the design rationale was for only starting one
bgworker per invocation.

It also appears to me that do_start_bgworker's treatment of fork
failure is completely brain dead.  Did anyone really think about
that case?
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Alvaro Herrera

Date:

20 April 2017, 01:24:15

Tom Lane wrote:

> While I haven't yet tested it, it seems like a fix might be as simple
> as deleting these lines in maybe_start_bgworker:
> 
>             /*
>              * Have ServerLoop call us again.  Note that there might not
>              * actually *be* another runnable worker, but we don't care all
>              * that much; we will find out the next time we run.
>              */
>             StartWorkerNeeded = true;
>             return;
> 
> So I'm wondering what the design rationale was for only starting one
> bgworker per invocation.

The rationale was that there may be other tasks waiting for postmaster
attention, and if there are many bgworkers needing to be started, the
other work may be delayed for a long time.  This is not the first time
that this rationale has been challenged, but so far there hasn't been
any good reason to change it.  One option is to just remove it as you
propose, but a different one is to stop using select(2) in ServerLoop,
because those behavior differences seem to make it rather unusable.

> It also appears to me that do_start_bgworker's treatment of fork
> failure is completely brain dead.  Did anyone really think about
> that case?

Hmm, I probably modelled it on autovacuum without giving that case much
additional consideration.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

20 April 2017, 01:37:22

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> So I'm wondering what the design rationale was for only starting one
>> bgworker per invocation.

> The rationale was that there may be other tasks waiting for postmaster
> attention, and if there are many bgworkers needing to be started, the
> other work may be delayed for a long time.  This is not the first time
> that this rationale has been challenged, but so far there hasn't been
> any good reason to change it.  One option is to just remove it as you
> propose, but a different one is to stop using select(2) in ServerLoop,
> because those behavior differences seem to make it rather unusable.

Hm.  Do you have a more-portable alternative?
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Alvaro Herrera

Date:

20 April 2017, 01:44:24

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> So I'm wondering what the design rationale was for only starting one
> >> bgworker per invocation.
> 
> > The rationale was that there may be other tasks waiting for postmaster
> > attention, and if there are many bgworkers needing to be started, the
> > other work may be delayed for a long time.  This is not the first time
> > that this rationale has been challenged, but so far there hasn't been
> > any good reason to change it.  One option is to just remove it as you
> > propose, but a different one is to stop using select(2) in ServerLoop,
> > because those behavior differences seem to make it rather unusable.
> 
> Hm.  Do you have a more-portable alternative?

I was thinking in a WaitEventSet from latch.c.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

20 April 2017, 01:56:26

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Hm.  Do you have a more-portable alternative?

> I was thinking in a WaitEventSet from latch.c.

Yeah, some googling turns up the suggestion that a self-pipe is a portable
way to get consistent semantics from select(); latch.c has already done
that.  I suppose that using latch.c would be convenient in that we'd have
to write little new code, but it's a tad annoying to add the overhead of a
self-pipe on platforms where we don't need it (which seems to be most).
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

20 April 2017, 02:52:41

On 2017-04-19 18:56:26 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Hm.  Do you have a more-portable alternative?
> 
> > I was thinking in a WaitEventSet from latch.c.
> 
> Yeah, some googling turns up the suggestion that a self-pipe is a portable
> way to get consistent semantics from select(); latch.c has already done
> that.  I suppose that using latch.c would be convenient in that we'd have
> to write little new code, but it's a tad annoying to add the overhead of a
> self-pipe on platforms where we don't need it (which seems to be most).

FWIW, I'd wished before that we used something a bit more modern than
select() if available... It's nice to be able to listen to a larger
number of sockets without repeated O(sockets) overhead.

BTW, we IIRC had discussed removing the select() backed latch
implementation in this release.  I'll try to dig up that discussion.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

20 April 2017, 03:06:05

Andres Freund <andres@anarazel.de> writes:
> FWIW, I'd wished before that we used something a bit more modern than
> select() if available... It's nice to be able to listen to a larger
> number of sockets without repeated O(sockets) overhead.

[ raised eyebrow... ]  Is anyone really running postmasters with enough
listen sockets for that to be meaningful?

> BTW, we IIRC had discussed removing the select() backed latch
> implementation in this release.  I'll try to dig up that discussion.

Might be sensible.  Even my pet dinosaurs have poll(2).  We should
check the buildfarm to see if the select() implementation is being
tested at all.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

20 April 2017, 07:50:13

I wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> Hm.  Do you have a more-portable alternative?

>> I was thinking in a WaitEventSet from latch.c.

My first reaction was that that sounded like a lot more work than removing
two lines from maybe_start_bgworker and adjusting some comments.  But on
closer inspection, the slow-bgworker-start issue isn't the only problem
here.  On a machine that restarts select()'s timeout after an interrupt,
as (at least) HPUX does, the postmaster will actually never iterate
ServerLoop's loop except immediately after receiving a new connection
request.  The stream of interrupts from the autovac launcher is alone
sufficient to prevent the initial 60-second timeout from ever elapsing.
So if there are no new connection requests for awhile, none of the
housekeeping actions in ServerLoop get done.

Most of those actions are concerned with restarting failed background
tasks, which is something we could get by without --- it's unlikely
that those tasks would fail without causing a database-wide restart,
and then there really isn't going to be much need for them until at least
one new connection request has arrived.  But the last step in that loop is
concerned with touching the sockets and lock files to prevent aggressive
/tmp cleaners from removing them, and that's something that can't be let
slide, or we might as well not have the logic at all.

I've confirmed experimentally that on my HPUX box, a postmaster not
receiving new connections for an hour or more in fact fails to update
the mod times on the sockets and lock files.  So that problem isn't
hypothetical.

So it's looking to me like we do need to do something about this, and
ideally back-patch it all the way.  But WaitEventSet doesn't exist
before 9.6.  Do we have the stomach for back-patching that into
stable branches?
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 02:23:28

I wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> Hm.  Do you have a more-portable alternative?

>> I was thinking in a WaitEventSet from latch.c.

> My first reaction was that that sounded like a lot more work than removing
> two lines from maybe_start_bgworker and adjusting some comments.  But on
> closer inspection, the slow-bgworker-start issue isn't the only problem
> here.  On a machine that restarts select()'s timeout after an interrupt,
> as (at least) HPUX does, the postmaster will actually never iterate
> ServerLoop's loop except immediately after receiving a new connection
> request.

I had a go at making the postmaster use a WaitEventSet, and immediately
ran into problems: latch.c is completely unprepared to be used anywhere
except a postmaster child process.  I think we can get around the issue
for the self-pipe, as per the attached untested patch.  But there remains
a problem: we should do a FreeWaitEventSet() after forking a child
process to ensure that postmaster children aren't running around with
open FDs for the postmaster's stuff.  This is no big problem in a regular
Unix build; we can give ClosePostmasterPorts() the responsibility.
It *is* a problem in EXEC_BACKEND children, which won't have inherited
the WaitEventSet data structure.  Maybe we could ignore the problem for
Unix EXEC_BACKEND builds, since we consider those to be for debug
purposes only, not for production.  But I don't think we can get away
with it for Windows --- or are the HANDLEs in a Windows WaitEventSet
not inheritable resources?

            regards, tom lane

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4798370..d4ac54c 100644
*** a/src/backend/storage/ipc/latch.c
--- b/src/backend/storage/ipc/latch.c
*************** static volatile sig_atomic_t waiting = f
*** 129,134 ****
--- 129,136 ----
  /* Read and write ends of the self-pipe */
  static int    selfpipe_readfd = -1;
  static int    selfpipe_writefd = -1;
+ /* Process owning the self-pipe --- needed for checking purposes */
+ static int    selfpipe_owner_pid = 0;

  /* Private function prototypes */
  static void sendSelfPipeByte(void);
*************** InitializeLatchSupport(void)
*** 158,164 ****
  #ifndef WIN32
      int            pipefd[2];

!     Assert(selfpipe_readfd == -1);

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
--- 160,202 ----
  #ifndef WIN32
      int            pipefd[2];

!     if (IsUnderPostmaster)
!     {
!         /*
!          * We might have inherited connections to a self-pipe created by the
!          * postmaster.  It's critical that child processes create their own
!          * self-pipes, of course, and we really want them to close the
!          * inherited FDs for safety's sake.
!          */
!         if (selfpipe_owner_pid != 0)
!         {
!             /* Assert we go through here but once in a child process */
!             Assert(selfpipe_owner_pid != MyProcPid);
!             /* Release postmaster's pipe */
!             close(selfpipe_readfd);
!             close(selfpipe_writefd);
!             /* Clean up, just for safety's sake; we'll set these in a bit */
!             selfpipe_readfd = selfpipe_writefd = -1;
!             selfpipe_owner_pid = 0;
!         }
!         else
!         {
!             /*
!              * Postmaster didn't create a self-pipe ... or else we're in an
!              * EXEC_BACKEND build, and don't know because we didn't inherit
!              * values for the static variables.  (In that case we'll fail to
!              * close the inherited FDs, but that seems acceptable since
!              * EXEC_BACKEND builds aren't meant for production on Unix.)
!              */
!             Assert(selfpipe_readfd == -1);
!         }
!     }
!     else
!     {
!         /* In postmaster or standalone backend, assert we do this but once */
!         Assert(selfpipe_readfd == -1);
!         Assert(selfpipe_owner_pid == 0);
!     }

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
*************** InitializeLatchSupport(void)
*** 176,188 ****

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a backend-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
--- 214,227 ----

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
+     selfpipe_owner_pid = MyProcPid;
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a process-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
*************** InitLatch(volatile Latch *latch)
*** 193,199 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
--- 232,238 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
*************** OwnLatch(volatile Latch *latch)
*** 256,262 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #endif

      if (latch->owner_pid != 0)
--- 295,301 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #endif

      if (latch->owner_pid != 0)
*************** DisownLatch(volatile Latch *latch)
*** 289,295 ****
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * backend-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
--- 328,334 ----
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * process-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
*************** FreeWaitEventSet(WaitEventSet *set)
*** 597,603 ****
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a backend-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
--- 636,642 ----
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a process-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 02:32:48

On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
> or are the HANDLEs in a Windows WaitEventSet not inheritable
> resources?

I think we have control over that. According to
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx
CreateProcess() has to be called with bInheritHandles = true (which we
do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true
too.  The latter we already only do for InitSharedLatch(), but not for
InitLatch(), nor for the WSACreateEvent's created for sockets - those
apparently can never be inherited.

So that kind of sounds like it should be doable.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 02:39:15

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
>> or are the HANDLEs in a Windows WaitEventSet not inheritable
>> resources?

> I think we have control over that. According to
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx
> CreateProcess() has to be called with bInheritHandles = true (which we
> do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true
> too.  The latter we already only do for InitSharedLatch(), but not for
> InitLatch(), nor for the WSACreateEvent's created for sockets - those
> apparently can never be inherited.

> So that kind of sounds like it should be doable.

Ah, good.  I'll add a comment about that and press on.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 02:44:20

On 2017-04-20 00:50:13 -0400, Tom Lane wrote:
> I wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >> Tom Lane wrote:
> >>> Hm.  Do you have a more-portable alternative?
> 
> >> I was thinking in a WaitEventSet from latch.c.
> 
> My first reaction was that that sounded like a lot more work than removing
> two lines from maybe_start_bgworker and adjusting some comments.  But on
> closer inspection, the slow-bgworker-start issue isn't the only problem
> here.  On a machine that restarts select()'s timeout after an interrupt,
> as (at least) HPUX does, the postmaster will actually never iterate
> ServerLoop's loop except immediately after receiving a new connection
> request.  The stream of interrupts from the autovac launcher is alone
> sufficient to prevent the initial 60-second timeout from ever elapsing.
> So if there are no new connection requests for awhile, none of the
> housekeeping actions in ServerLoop get done.

FWIW, I vaguely remember somewhat related issues on x86/linux too.  On
busy machines in autovacuum_freeze_max_age territory (pretty frequent
thing these days), the signalling frequency caused problems in
postmaster, but I unfortunately don't remember the symptoms very well.

> So it's looking to me like we do need to do something about this, and
> ideally back-patch it all the way.  But WaitEventSet doesn't exist
> before 9.6.  Do we have the stomach for back-patching that into
> stable branches?

Hm, that's not exactly no code - on the other hand it's (excepting
the select(2) path) reasonably well exercised.  What we could do is to
convert master, see how the beta process likes it, and then backpatch?
These don't like super pressing issues.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 02:53:02

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
>>> or are the HANDLEs in a Windows WaitEventSet not inheritable
>>> resources?

>> So that kind of sounds like it should be doable.

> Ah, good.  I'll add a comment about that and press on.

So ... what would you say to replacing epoll_create() with
epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
represent inheritable-across-exec resources on any platform,
making it a lot easier to deal with the EXEC_BACKEND case.

AFAIK, both APIs are Linux-only, and epoll_create1() is not much
newer than epoll_create(), so it seems like we'd not be giving up
much portability if we insist on epoll_create1.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 02:59:19

On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
> >>> or are the HANDLEs in a Windows WaitEventSet not inheritable
> >>> resources?
> 
> >> So that kind of sounds like it should be doable.
> 
> > Ah, good.  I'll add a comment about that and press on.
> 
> So ... what would you say to replacing epoll_create() with
> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
> represent inheritable-across-exec resources on any platform,
> making it a lot easier to deal with the EXEC_BACKEND case.
> 
> AFAIK, both APIs are Linux-only, and epoll_create1() is not much
> newer than epoll_create(), so it seems like we'd not be giving up
> much portability if we insist on epoll_create1.

I'm generally quite in favor of using CLOEXEC as much as possible in our
tree.  I'm a bit concerned with epoll_create1's availability tho - the
glibc support for it was introduced in 2.9, whereas epoll_create is in
2.3.2.  On the other hand 2.9 was released 2008-11-13.   If we remain
concerned we could just fcntl(fd, F_SETFD, FD_CLOEXEC) instead - that
should only be like three lines more code or such, and should be
available for a lot longer.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 03:05:02

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
>> So ... what would you say to replacing epoll_create() with
>> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
>> represent inheritable-across-exec resources on any platform,
>> making it a lot easier to deal with the EXEC_BACKEND case.

> I'm generally quite in favor of using CLOEXEC as much as possible in our
> tree.  I'm a bit concerned with epoll_create1's availability tho - the
> glibc support for it was introduced in 2.9, whereas epoll_create is in
> 2.3.2.  On the other hand 2.9 was released 2008-11-13.

Also, if it's not there we'd fall back to using plain poll(), which is
not so awful that we need to work hard to avoid it.  I'd just as soon
keep the number of combinations down.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 03:07:21

On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
> >> So ... what would you say to replacing epoll_create() with
> >> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
> >> represent inheritable-across-exec resources on any platform,
> >> making it a lot easier to deal with the EXEC_BACKEND case.
> 
> > I'm generally quite in favor of using CLOEXEC as much as possible in our
> > tree.  I'm a bit concerned with epoll_create1's availability tho - the
> > glibc support for it was introduced in 2.9, whereas epoll_create is in
> > 2.3.2.  On the other hand 2.9 was released 2008-11-13.
> 
> Also, if it's not there we'd fall back to using plain poll(), which is
> not so awful that we need to work hard to avoid it.  I'd just as soon
> keep the number of combinations down.

Just using fcntl(SET, CLOEXEC) wound't increase the number of
combinations?

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 03:10:41

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
>> Also, if it's not there we'd fall back to using plain poll(), which is
>> not so awful that we need to work hard to avoid it.  I'd just as soon
>> keep the number of combinations down.

> Just using fcntl(SET, CLOEXEC) wound't increase the number of
> combinations?

True, if you just did it that way unconditionally.  But doesn't that
require an extra kernel call per CreateWaitEventSet()?
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 06:00:31

On 2017-04-20 20:10:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
> >> Also, if it's not there we'd fall back to using plain poll(), which is
> >> not so awful that we need to work hard to avoid it.  I'd just as soon
> >> keep the number of combinations down.
> 
> > Just using fcntl(SET, CLOEXEC) wound't increase the number of
> > combinations?
> 
> True, if you just did it that way unconditionally.  But doesn't that
> require an extra kernel call per CreateWaitEventSet()?

It does - the question is whether that matters much.  FE/BE uses a
persistent wait set, but unfortunately much of other latch users
don't. And some of them can be somewhat frequent - so I guess that'd
possibly be measurable.  Ok, so I'm on board with epoll1.

If somebody were to change more frequent latch users to use persistent
wait sets, that'd be good too.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 18:09:16

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-20 00:50:13 -0400, Tom Lane wrote:
>> My first reaction was that that sounded like a lot more work than removing
>> two lines from maybe_start_bgworker and adjusting some comments.  But on
>> closer inspection, the slow-bgworker-start issue isn't the only problem
>> here.

> FWIW, I vaguely remember somewhat related issues on x86/linux too.

After sleeping and thinking more, I've realized that the
slow-bgworker-start issue actually exists on *every* platform, it's just
harder to hit when select() is interruptable.  But consider the case
where multiple bgworker-start requests arrive while ServerLoop is
actively executing (perhaps because a connection request just came in).
The postmaster has signals blocked, so nothing happens for the moment.
When we go around the loop and reach
           PG_SETMASK(&UnBlockSig);

the pending SIGUSR1 is delivered, and sigusr1_handler reads all the
bgworker start requests, and services just one of them.  Then control
returns and proceeds to
           selres = select(nSockets, &rmask, NULL, NULL, &timeout);

But now there's no interrupt pending.  So the remaining start requests
do not get serviced until (a) some other postmaster interrupt arrives,
or (b) the one-minute timeout elapses.  They could be waiting awhile.

Bottom line is that any request for more than one bgworker at a time
faces a non-negligible risk of suffering serious latency.

I'm coming back to the idea that at least in the back branches, the
thing to do is allow maybe_start_bgworker to start multiple workers.
Is there any actual evidence for the claim that that might have
bad side effects?
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Alvaro Herrera

Date:

21 April 2017, 18:19:41

Tom Lane wrote:

> After sleeping and thinking more, I've realized that the
> slow-bgworker-start issue actually exists on *every* platform, it's just
> harder to hit when select() is interruptable.  But consider the case
> where multiple bgworker-start requests arrive while ServerLoop is
> actively executing (perhaps because a connection request just came in).
> The postmaster has signals blocked, so nothing happens for the moment.
> When we go around the loop and reach
> 
>             PG_SETMASK(&UnBlockSig);
> 
> the pending SIGUSR1 is delivered, and sigusr1_handler reads all the
> bgworker start requests, and services just one of them.  Then control
> returns and proceeds to
> 
>             selres = select(nSockets, &rmask, NULL, NULL, &timeout);
> 
> But now there's no interrupt pending.  So the remaining start requests
> do not get serviced until (a) some other postmaster interrupt arrives,
> or (b) the one-minute timeout elapses.  They could be waiting awhile.
> 
> Bottom line is that any request for more than one bgworker at a time
> faces a non-negligible risk of suffering serious latency.

Interesting.  It's hard to hit, for sure.

> I'm coming back to the idea that at least in the back branches, the
> thing to do is allow maybe_start_bgworker to start multiple workers.
>
> Is there any actual evidence for the claim that that might have
> bad side effects?

Well, I ran tests with a few dozen thousand sample workers and the
neglect for other things (such as connection requests) was visible, but
that's probably not a scenario many servers run often currently.  I
don't strongly object to the idea of removing the "return" in older
branches, since it's evidently a problem.  However, as bgworkers start
to be used more, I think we should definitely have some protection.  In
a system with a large number of workers available for parallel queries,
it seems possible for a high velocity server to get stuck in the loop
for some time.  (I haven't actually verified this, though.  My
experiments were with the early kind, static bgworkers.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 19:50:04

Attached is a lightly-tested draft patch that converts the postmaster to
use a WaitEventSet for waiting in ServerLoop.  I've got mixed emotions
about whether this is the direction to proceed, though.  It adds at least
a couple of kernel calls per postmaster signal delivery, and probably to
every postmaster connection acceptance (ServerLoop iteration), to fix
problems that are so corner-casey that we've never even known we had them
till now.

I'm looking longingly at pselect(2), ppoll(2), and epoll_pwait(2), which
would solve the problem without need for a self-pipe.  But the latter two
are Linux-only, and while pselect does exist in recent POSIX editions,
it's nonetheless got portability issues.  Googling suggests that on a
number of platforms, pselect is non-atomic, ie it's nothing but a
wrapper for sigprocmask/select/sigprocmask, which would mean that the
race condition I described in my previous message still exists.

Despite that, a pretty attractive proposal is to do, essentially,

#ifdef HAVE_PSELECT
        selres = pselect(nSockets, &rmask, NULL, NULL, &timeout, &UnBlockSig);
#else
        PG_SETMASK(&UnBlockSig);
        selres = select(nSockets, &rmask, NULL, NULL, &timeout);
        PG_SETMASK(&BlockSig);
#endif

This fixes the race on platforms where pselect exists and is correctly
implemented, and we're no worse off than before where that's not true.

The other component of the problem is the possibility that select() will
restart if the signal is marked SA_RESTART.  (Presumably that would apply
to pselect too.)  I am thinking that maybe the answer is "if it hurts,
don't do it" --- that is, in the postmaster maybe we shouldn't use
SA_RESTART, at least not for these signals.

A different line of thought is to try to provide a bulletproof solution,
but push the implementation problems down into latch.c --- that is, the
goal would be to provide a pselect-like variant of WaitEventSetWait that
is guaranteed to return if interrupted, as opposed to the current behavior
where it's guaranteed not to.  But that seems like quite a bit of work.

Whether or not we decide to change over the postmaster.c code, I think
it'd likely be a good idea to apply most or all of the attached changes
in latch.c.  Setting CLOEXEC on the relevant FDs is clearly a good thing,
and the other changes will provide some safety if some preloaded extension
decides to create a latch in the postmaster process.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6831342..658ba73 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 79,88 ****
  #include <netdb.h>
  #include <limits.h>

- #ifdef HAVE_SYS_SELECT_H
- #include <sys/select.h>
- #endif
-
  #ifdef USE_BONJOUR
  #include <dns_sd.h>
  #endif
--- 79,84 ----
*************** static bool LoadedSSL = false;
*** 379,384 ****
--- 375,386 ----
  static DNSServiceRef bonjour_sdref = NULL;
  #endif

+ /* WaitEventSet that ServerLoop uses to wait for connections */
+ static WaitEventSet *ServerWaitSet = NULL;
+
+ /* Latch used to ensure that signals interrupt the wait in ServerLoop */
+ static Latch PostmasterLatch;
+
  /*
   * postmaster.c - function prototypes
   */
*************** static int    ServerLoop(void);
*** 409,415 ****
  static int    BackendStartup(Port *port);
  static int    ProcessStartupPacket(Port *port, bool SSLdone);
  static void processCancelRequest(Port *port, void *pkt);
! static int    initMasks(fd_set *rmask);
  static void report_fork_failure_to_client(Port *port, int errnum);
  static CAC_state canAcceptConnections(void);
  static bool RandomCancelKey(int32 *cancel_key);
--- 411,417 ----
  static int    BackendStartup(Port *port);
  static int    ProcessStartupPacket(Port *port, bool SSLdone);
  static void processCancelRequest(Port *port, void *pkt);
! static void initServerWaitSet(void);
  static void report_fork_failure_to_client(Port *port, int errnum);
  static CAC_state canAcceptConnections(void);
  static bool RandomCancelKey(int32 *cancel_key);
*************** checkDataDir(void)
*** 1535,1541 ****
  }

  /*
!  * Determine how long should we let ServerLoop sleep.
   *
   * In normal conditions we wait at most one minute, to ensure that the other
   * background tasks handled by ServerLoop get done even when no requests are
--- 1537,1543 ----
  }

  /*
!  * Determine how long should we let ServerLoop sleep (in msec).
   *
   * In normal conditions we wait at most one minute, to ensure that the other
   * background tasks handled by ServerLoop get done even when no requests are
*************** checkDataDir(void)
*** 1543,1551 ****
   * we don't actually sleep so that they are quickly serviced.  Other exception
   * cases are as shown in the code.
   */
! static void
! DetermineSleepTime(struct timeval * timeout)
  {
      TimestampTz next_wakeup = 0;

      /*
--- 1545,1554 ----
   * we don't actually sleep so that they are quickly serviced.  Other exception
   * cases are as shown in the code.
   */
! static long
! DetermineSleepTime(void)
  {
+     long        timeout;
      TimestampTz next_wakeup = 0;

      /*
*************** DetermineSleepTime(struct timeval * time
*** 1558,1582 ****
          if (AbortStartTime != 0)
          {
              /* time left to abort; clamp to 0 in case it already expired */
!             timeout->tv_sec = SIGKILL_CHILDREN_AFTER_SECS -
                  (time(NULL) - AbortStartTime);
!             timeout->tv_sec = Max(timeout->tv_sec, 0);
!             timeout->tv_usec = 0;
          }
          else
!         {
!             timeout->tv_sec = 60;
!             timeout->tv_usec = 0;
!         }
!         return;
      }

      if (StartWorkerNeeded)
!     {
!         timeout->tv_sec = 0;
!         timeout->tv_usec = 0;
!         return;
!     }

      if (HaveCrashedWorker)
      {
--- 1561,1577 ----
          if (AbortStartTime != 0)
          {
              /* time left to abort; clamp to 0 in case it already expired */
!             timeout = SIGKILL_CHILDREN_AFTER_SECS -
                  (time(NULL) - AbortStartTime);
!             timeout = Max(timeout, 0) * 1000;
          }
          else
!             timeout = 60 * 1000;
!         return timeout;
      }

      if (StartWorkerNeeded)
!         return 0;

      if (HaveCrashedWorker)
      {
*************** DetermineSleepTime(struct timeval * time
*** 1617,1639 ****
          long        secs;
          int            microsecs;

          TimestampDifference(GetCurrentTimestamp(), next_wakeup,
                              &secs, µsecs);
!         timeout->tv_sec = secs;
!         timeout->tv_usec = microsecs;

          /* Ensure we don't exceed one minute */
!         if (timeout->tv_sec > 60)
!         {
!             timeout->tv_sec = 60;
!             timeout->tv_usec = 0;
!         }
      }
      else
!     {
!         timeout->tv_sec = 60;
!         timeout->tv_usec = 0;
!     }
  }

  /*
--- 1612,1630 ----
          long        secs;
          int            microsecs;

+         /* XXX this is stupidly inefficient */
          TimestampDifference(GetCurrentTimestamp(), next_wakeup,
                              &secs, µsecs);
!         timeout = secs * 1000 + microsecs / 1000;

          /* Ensure we don't exceed one minute */
!         if (timeout > 60 * 1000)
!             timeout = 60 * 1000;
      }
      else
!         timeout = 60 * 1000;
!
!     return timeout;
  }

  /*
*************** DetermineSleepTime(struct timeval * time
*** 1644,1662 ****
  static int
  ServerLoop(void)
  {
-     fd_set        readmask;
-     int            nSockets;
      time_t        last_lockfile_recheck_time,
                  last_touch_time;

      last_lockfile_recheck_time = last_touch_time = time(NULL);

!     nSockets = initMasks(&readmask);

      for (;;)
      {
!         fd_set        rmask;
!         int            selres;
          time_t        now;

          /*
--- 1635,1651 ----
  static int
  ServerLoop(void)
  {
      time_t        last_lockfile_recheck_time,
                  last_touch_time;

      last_lockfile_recheck_time = last_touch_time = time(NULL);

!     initServerWaitSet();

      for (;;)
      {
!         WaitEvent    event;
!         int            nevents;
          time_t        now;

          /*
*************** ServerLoop(void)
*** 1667,1742 ****
           * do nontrivial work.
           *
           * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
!          * any new connections, so we don't call select(), and just sleep.
           */
-         memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));
-
          if (pmState == PM_WAIT_DEAD_END)
          {
              PG_SETMASK(&UnBlockSig);

              pg_usleep(100000L); /* 100 msec seems reasonable */
!             selres = 0;

              PG_SETMASK(&BlockSig);
          }
          else
          {
!             /* must set timeout each time; some OSes change it! */
!             struct timeval timeout;

              /* Needs to run with blocked signals! */
!             DetermineSleepTime(&timeout);

              PG_SETMASK(&UnBlockSig);

!             selres = select(nSockets, &rmask, NULL, NULL, &timeout);

              PG_SETMASK(&BlockSig);
          }

-         /* Now check the select() result */
-         if (selres < 0)
-         {
-             if (errno != EINTR && errno != EWOULDBLOCK)
-             {
-                 ereport(LOG,
-                         (errcode_for_socket_access(),
-                          errmsg("select() failed in postmaster: %m")));
-                 return STATUS_ERROR;
-             }
-         }
-
          /*
!          * New connection pending on any of our sockets? If so, fork a child
!          * process to deal with it.
           */
!         if (selres > 0)
          {
!             int            i;
!
!             for (i = 0; i < MAXLISTEN; i++)
              {
!                 if (ListenSocket[i] == PGINVALID_SOCKET)
!                     break;
!                 if (FD_ISSET(ListenSocket[i], &rmask))
!                 {
!                     Port       *port;

!                     port = ConnCreate(ListenSocket[i]);
!                     if (port)
!                     {
!                         BackendStartup(port);

!                         /*
!                          * We no longer need the open socket or port structure
!                          * in this process
!                          */
!                         StreamClose(port->sock);
!                         ConnFree(port);
!                     }
                  }
              }
          }

          /* If we have lost the log collector, try to start a new one */
--- 1656,1719 ----
           * do nontrivial work.
           *
           * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
!          * any new connections, so we don't call WaitEventSetWait(), and just
!          * sleep.  XXX not ideal
           */
          if (pmState == PM_WAIT_DEAD_END)
          {
              PG_SETMASK(&UnBlockSig);

              pg_usleep(100000L); /* 100 msec seems reasonable */
!             nevents = 0;

              PG_SETMASK(&BlockSig);
          }
          else
          {
!             long        timeout;

              /* Needs to run with blocked signals! */
!             timeout = DetermineSleepTime();

              PG_SETMASK(&UnBlockSig);

!             nevents = WaitEventSetWait(ServerWaitSet, timeout, &event, 1, 0);

              PG_SETMASK(&BlockSig);
          }

          /*
!          * Deal with detected event, if any.
           */
!         if (nevents > 0)
          {
!             /*
!              * New connection pending on any of our sockets? If so, fork a
!              * child process to deal with it.
!              */
!             if (event.events & WL_SOCKET_READABLE)
              {
!                 Port       *port;

!                 port = ConnCreate(event.fd);
!                 if (port)
!                 {
!                     BackendStartup(port);

!                     /*
!                      * We no longer need the open socket or port structure in
!                      * this process
!                      */
!                     StreamClose(port->sock);
!                     ConnFree(port);
                  }
              }
+
+             /*
+              * Latch set?  If so, clear it.
+              */
+             else if (event.events & WL_LATCH_SET)
+                 ResetLatch(&PostmasterLatch);
          }

          /* If we have lost the log collector, try to start a new one */
*************** ServerLoop(void)
*** 1874,1903 ****
  }

  /*
!  * Initialise the masks for select() for the ports we are listening on.
!  * Return the number of sockets to listen on.
   */
! static int
! initMasks(fd_set *rmask)
  {
-     int            maxsock = -1;
      int            i;

!     FD_ZERO(rmask);

      for (i = 0; i < MAXLISTEN; i++)
      {
!         int            fd = ListenSocket[i];

!         if (fd == PGINVALID_SOCKET)
              break;
!         FD_SET(fd, rmask);
!
!         if (fd > maxsock)
!             maxsock = fd;
      }

!     return maxsock + 1;
  }


--- 1851,1879 ----
  }

  /*
!  * Create a WaitEventSet for ServerLoop() to wait on.  This includes the
!  * sockets we are listening on, plus the PostmasterLatch.
   */
! static void
! initServerWaitSet(void)
  {
      int            i;

!     ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1);

      for (i = 0; i < MAXLISTEN; i++)
      {
!         pgsocket    sock = ListenSocket[i];

!         if (sock == PGINVALID_SOCKET)
              break;
!         AddWaitEventToSet(ServerWaitSet, WL_SOCKET_READABLE, sock, NULL, NULL);
      }

!     InitializeLatchSupport();
!     InitLatch(&PostmasterLatch);
!     AddWaitEventToSet(ServerWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
!                       &PostmasterLatch, NULL);
  }


*************** ClosePostmasterPorts(bool am_syslogger)
*** 2436,2441 ****
--- 2412,2431 ----
      postmaster_alive_fds[POSTMASTER_FD_OWN] = -1;
  #endif

+     /*
+      * Close the postmaster's WaitEventSet, if any, to be sure that child
+      * processes don't accidentally mess with underlying file descriptors.
+      * (Note: in an EXEC_BACKEND build, this won't do anything because we
+      * didn't inherit the static pointer, much less the data structure itself.
+      * But it's OK because WaitEventSets don't contain any resources that can
+      * be inherited across exec().)
+      */
+     if (ServerWaitSet)
+     {
+         FreeWaitEventSet(ServerWaitSet);
+         ServerWaitSet = NULL;
+     }
+
      /* Close the listen sockets */
      for (i = 0; i < MAXLISTEN; i++)
      {
*************** SIGHUP_handler(SIGNAL_ARGS)
*** 2553,2558 ****
--- 2543,2551 ----
  #endif
      }

+     /* Force ServerLoop to iterate */
+     SetLatch(&PostmasterLatch);
+
      PG_SETMASK(&UnBlockSig);

      errno = save_errno;
*************** pmdie(SIGNAL_ARGS)
*** 2724,2729 ****
--- 2717,2725 ----
              break;
      }

+     /* Force ServerLoop to iterate */
+     SetLatch(&PostmasterLatch);
+
      PG_SETMASK(&UnBlockSig);

      errno = save_errno;
*************** reaper(SIGNAL_ARGS)
*** 3037,3042 ****
--- 3033,3041 ----
       */
      PostmasterStateMachine();

+     /* Force ServerLoop to iterate */
+     SetLatch(&PostmasterLatch);
+
      /* Done with signal handler */
      PG_SETMASK(&UnBlockSig);

*************** sigusr1_handler(SIGNAL_ARGS)
*** 5078,5083 ****
--- 5077,5085 ----
          signal_child(StartupPID, SIGUSR2);
      }

+     /* Force ServerLoop to iterate */
+     SetLatch(&PostmasterLatch);
+
      PG_SETMASK(&UnBlockSig);

      errno = save_errno;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4798370..92d2ff0 100644
*** a/src/backend/storage/ipc/latch.c
--- b/src/backend/storage/ipc/latch.c
***************
*** 62,67 ****
--- 62,71 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"

+ #ifndef FD_CLOEXEC
+ #define FD_CLOEXEC 1
+ #endif
+
  /*
   * Select the fd readiness primitive to use. Normally the "most modern"
   * primitive supported by the OS will be used, but for testing it can be
*************** static volatile sig_atomic_t waiting = f
*** 130,135 ****
--- 134,142 ----
  static int    selfpipe_readfd = -1;
  static int    selfpipe_writefd = -1;

+ /* Process owning the self-pipe --- needed for checking purposes */
+ static int    selfpipe_owner_pid = 0;
+
  /* Private function prototypes */
  static void sendSelfPipeByte(void);
  static void drainSelfPipe(void);
*************** InitializeLatchSupport(void)
*** 158,164 ****
  #ifndef WIN32
      int            pipefd[2];

!     Assert(selfpipe_readfd == -1);

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
--- 165,211 ----
  #ifndef WIN32
      int            pipefd[2];

!     if (IsUnderPostmaster)
!     {
!         /*
!          * We might have inherited connections to a self-pipe created by the
!          * postmaster.  It's critical that child processes create their own
!          * self-pipes, of course, and we really want them to close the
!          * inherited FDs for safety's sake.  On most platforms we can use
!          * F_SETFD/FD_CLOEXEC to make sure that happens, but if we lack that
!          * facility, close the FDs the hard way.
!          */
!         if (selfpipe_owner_pid != 0)
!         {
!             /* Assert we go through here but once in a child process */
!             Assert(selfpipe_owner_pid != MyProcPid);
! #ifndef F_SETFD
!             /* Release postmaster's pipe FDs, if we lack FD_CLOEXEC */
!             close(selfpipe_readfd);
!             close(selfpipe_writefd);
! #endif
!             /* Clean up, just for safety's sake; we'll set these below */
!             selfpipe_readfd = selfpipe_writefd = -1;
!             selfpipe_owner_pid = 0;
!         }
!         else
!         {
!             /*
!              * Postmaster didn't create a self-pipe ... or else we're in an
!              * EXEC_BACKEND build, and don't know because we didn't inherit
!              * values for the static variables.  (If we lack FD_CLOEXEC we'll
!              * fail to close the inherited FDs, but that seems acceptable
!              * since EXEC_BACKEND builds aren't meant for production on Unix.)
!              */
!             Assert(selfpipe_readfd == -1);
!         }
!     }
!     else
!     {
!         /* In postmaster or standalone backend, assert we do this but once */
!         Assert(selfpipe_readfd == -1);
!         Assert(selfpipe_owner_pid == 0);
!     }

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
*************** InitializeLatchSupport(void)
*** 166,188 ****
       * SetLatch won't block if the event has already been set many times
       * filling the kernel buffer. Make the read-end non-blocking too, so that
       * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
       */
      if (pipe(pipefd) < 0)
          elog(FATAL, "pipe() failed: %m");
      if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
!         elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
      if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
!         elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a backend-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
--- 213,244 ----
       * SetLatch won't block if the event has already been set many times
       * filling the kernel buffer. Make the read-end non-blocking too, so that
       * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+      * Also, if possible, make both FDs close-on-exec, since we surely do not
+      * want any child processes messing with them.
       */
      if (pipe(pipefd) < 0)
          elog(FATAL, "pipe() failed: %m");
      if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
!         elog(FATAL, "fcntl(F_SETFL) failed on read-end of self-pipe: %m");
      if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
!         elog(FATAL, "fcntl(F_SETFL) failed on write-end of self-pipe: %m");
! #ifdef F_SETFD
!     if (fcntl(pipefd[0], F_SETFD, FD_CLOEXEC) < 0)
!         elog(FATAL, "fcntl(F_SETFD) failed on read-end of self-pipe: %m");
!     if (fcntl(pipefd[1], F_SETFD, FD_CLOEXEC) < 0)
!         elog(FATAL, "fcntl(F_SETFD) failed on write-end of self-pipe: %m");
! #endif

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
+     selfpipe_owner_pid = MyProcPid;
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a process-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
*************** InitLatch(volatile Latch *latch)
*** 193,199 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
--- 249,255 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
*************** InitLatch(volatile Latch *latch)
*** 211,216 ****
--- 267,276 ----
   * containing the latch with ShmemInitStruct. (The Unix implementation
   * doesn't actually require that, but the Windows one does.) Because of
   * this restriction, we have no concurrency issues to worry about here.
+  *
+  * Note that other handles created in this module are never marked as
+  * inheritable.  Thus we do not need to worry about cleaning up child
+  * process references to postmaster-private latches or WaitEventSets.
   */
  void
  InitSharedLatch(volatile Latch *latch)
*************** OwnLatch(volatile Latch *latch)
*** 256,262 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #endif

      if (latch->owner_pid != 0)
--- 316,322 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #endif

      if (latch->owner_pid != 0)
*************** DisownLatch(volatile Latch *latch)
*** 289,295 ****
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * backend-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
--- 349,355 ----
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * process-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
*************** CreateWaitEventSet(MemoryContext context
*** 531,536 ****
--- 591,600 ----
      set->epoll_fd = epoll_create(nevents);
      if (set->epoll_fd < 0)
          elog(ERROR, "epoll_create failed: %m");
+     /* Attempt to make the FD non-inheritable; if fail, no big deal */
+ #ifdef F_SETFD
+     (void) fcntl(set->epoll_fd, F_SETFD, FD_CLOEXEC);
+ #endif
  #elif defined(WAIT_USE_WIN32)

      /*
*************** CreateWaitEventSet(MemoryContext context
*** 551,556 ****
--- 615,626 ----

  /*
   * Free a previously created WaitEventSet.
+  *
+  * Note: preferably, this shouldn't have to free any resources that could be
+  * inherited across an exec().  If it did, we'd likely leak those resources in
+  * many scenarios.  For the epoll() case, we attempt to ensure that by setting
+  * FD_CLOEXEC when the FD is created.  For the Windows case, we assume that
+  * the handles involved are non-inheritable.
   */
  void
  FreeWaitEventSet(WaitEventSet *set)
*************** FreeWaitEventSet(WaitEventSet *set)
*** 597,603 ****
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a backend-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
--- 667,673 ----
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a process-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 20:20:44

Hi,

On 2017-04-21 12:50:04 -0400, Tom Lane wrote:
> Attached is a lightly-tested draft patch that converts the postmaster to
> use a WaitEventSet for waiting in ServerLoop.  I've got mixed emotions
> about whether this is the direction to proceed, though.  It adds at least
> a couple of kernel calls per postmaster signal delivery, and probably to
> every postmaster connection acceptance (ServerLoop iteration), to fix
> problems that are so corner-casey that we've never even known we had them
> till now.

I'm not concerned much about the signal delivery paths, and I can't
quite imagine that another syscall in the accept path is going to be
measurable - worth ensuring though.

I do agree that it's a bit of a big stick for the back-branches...


> A different line of thought is to try to provide a bulletproof solution,
> but push the implementation problems down into latch.c --- that is, the
> goal would be to provide a pselect-like variant of WaitEventSetWait that
> is guaranteed to return if interrupted, as opposed to the current behavior
> where it's guaranteed not to.  But that seems like quite a bit of work.

Seems like a sane idea to me.  The use of latches has already grown due
to parallelism and it'll likely grow further - some of them seem likely
to also have to deal with such concerns.  I'd much rather centralize
things down to a common place.

On the other hand most types of our processes do SetLatch() in just
nearly all the relevant signal handlers anyway, so they're pretty close
to behaviour already.



> Whether or not we decide to change over the postmaster.c code, I think
> it'd likely be a good idea to apply most or all of the attached changes
> in latch.c.  Setting CLOEXEC on the relevant FDs is clearly a good thing,
> and the other changes will provide some safety if some preloaded extension
> decides to create a latch in the postmaster process.

On the principle, I agree.  Reading through the changes now.


> @@ -1667,76 +1656,64 @@ ServerLoop(void)
>           * do nontrivial work.
>           *
>           * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
> -         * any new connections, so we don't call select(), and just sleep.
> +         * any new connections, so we don't call WaitEventSetWait(), and just
> +         * sleep.  XXX not ideal
>           */

Couldn't we just deactive the sockets in the set instead?


>  /*
> - * Initialise the masks for select() for the ports we are listening on.
> - * Return the number of sockets to listen on.
> + * Create a WaitEventSet for ServerLoop() to wait on.  This includes the
> + * sockets we are listening on, plus the PostmasterLatch.
>   */
> -static int
> -initMasks(fd_set *rmask)
> +static void
> +initServerWaitSet(void)
>  {
> -    int            maxsock = -1;
>      int            i;
>  
> -    FD_ZERO(rmask);
> +    ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1);

Why are we using MAXLISTEN, rather than the actual number of things to
listen to?  The only benefit of this seems to be that we could
theoretically allow dynamic reconfiguration of the sockets a bit more
easily in the future, but that could just as well be done by recreating the set.

Random note: Do we actually have any code that errors out if too many
sockets are being listened to?


> @@ -2553,6 +2543,9 @@ SIGHUP_handler(SIGNAL_ARGS)
>  #endif
>      }
>  
> +    /* Force ServerLoop to iterate */
> +    SetLatch(&PostmasterLatch);
> +
>      PG_SETMASK(&UnBlockSig);
>  
>      errno = save_errno;
> @@ -2724,6 +2717,9 @@ pmdie(SIGNAL_ARGS)
>              break;
>      }
>  
> +    /* Force ServerLoop to iterate */
> +    SetLatch(&PostmasterLatch);
> +
>      PG_SETMASK(&UnBlockSig);
>  
>      errno = save_errno;
> @@ -3037,6 +3033,9 @@ reaper(SIGNAL_ARGS)
>       */
>      PostmasterStateMachine();
>  
> +    /* Force ServerLoop to iterate */
> +    SetLatch(&PostmasterLatch);
> +
>      /* Done with signal handler */
>      PG_SETMASK(&UnBlockSig);
>  
> @@ -5078,6 +5077,9 @@ sigusr1_handler(SIGNAL_ARGS)
>          signal_child(StartupPID, SIGUSR2);
>      }
>  
> +    /* Force ServerLoop to iterate */
> +    SetLatch(&PostmasterLatch);
> +
>      PG_SETMASK(&UnBlockSig);
>  
>      errno = save_errno;

I kind of would like, in master, take a chance of replace all the work
done in signal handlers, by just a SetLatch(), and do it outside of
signal handlers instead.  Forking from signal handlers is just plain
weird.



> diff --git a/src/backend/storage/ipc/latchindex 4798370..92d2ff0 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -62,6 +62,10 @@
>  #include "storage/pmsignal.h"
>  #include "storage/shmem.h"
>  
> +#ifndef FD_CLOEXEC
> +#define FD_CLOEXEC 1
> +#endif

Hm? Are we sure this is portable?  Is there really cases that have
F_SETFD, but not CLOEXEC?

Greetings,

Andres Freund

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 20:38:32

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> I'm coming back to the idea that at least in the back branches, the
>> thing to do is allow maybe_start_bgworker to start multiple workers.
>> 
>> Is there any actual evidence for the claim that that might have
>> bad side effects?

> Well, I ran tests with a few dozen thousand sample workers and the
> neglect for other things (such as connection requests) was visible, but
> that's probably not a scenario many servers run often currently.

Indeed.  I'm pretty skeptical that that's an interesting case, and if it
is, the current coding is broken anyway, because with that many workers
you are going to start noticing that running maybe_start_bgworker over
again for each worker is an O(N^2) proposition.  Admittedly, iterating
the loop in maybe_start_bgworker is really cheap compared to a fork(),
but eventually the big-O problem is going to eat your lunch.

> I don't strongly object to the idea of removing the "return" in older
> branches, since it's evidently a problem.  However, as bgworkers start
> to be used more, I think we should definitely have some protection.  In
> a system with a large number of workers available for parallel queries,
> it seems possible for a high velocity server to get stuck in the loop
> for some time.  (I haven't actually verified this, though.  My
> experiments were with the early kind, static bgworkers.)

It might be sensible to limit the number of workers launched per call,
but I think the limit should be quite a bit higher than 1 ... something
like 100 or 1000 might be appropriate.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 20:49:27

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-21 12:50:04 -0400, Tom Lane wrote:
>> Attached is a lightly-tested draft patch that converts the postmaster to
>> use a WaitEventSet for waiting in ServerLoop.  I've got mixed emotions
>> about whether this is the direction to proceed, though.  It adds at least
>> a couple of kernel calls per postmaster signal delivery, and probably to
>> every postmaster connection acceptance (ServerLoop iteration), to fix
>> problems that are so corner-casey that we've never even known we had them
>> till now.

> I'm not concerned much about the signal delivery paths, and I can't
> quite imagine that another syscall in the accept path is going to be
> measurable - worth ensuring though.
> ...
> On the other hand most types of our processes do SetLatch() in just
> nearly all the relevant signal handlers anyway, so they're pretty close
> to behaviour already.

True.  Maybe I'm being too worried.

>> * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
>> -         * any new connections, so we don't call select(), and just sleep.
>> +         * any new connections, so we don't call WaitEventSetWait(), and just
>> +         * sleep.  XXX not ideal
>> */

> Couldn't we just deactive the sockets in the set instead?

Yeah, I think it'd be better to do something like that.  The pg_usleep
call has the same issue of possibly not responding to interrupts.  The
risks are a lot less, since it's a much shorter wait, but I would rather
eliminate the separate code path in favor of doing it honestly.  Didn't
seem like something to fuss over in the first draft though.

>> +    ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1);

> Why are we using MAXLISTEN, rather than the actual number of things to
> listen to?

It'd take more code (ie, an additional scan of the array) to predetermine
that.  I figured the space-per-item in the WaitEventSet wasn't enough to
worry about ... do you think differently?

> Random note: Do we actually have any code that errors out if too many
> sockets are being listened to?

Yes, see StreamServerPort, about line 400.

> I kind of would like, in master, take a chance of replace all the work
> done in signal handlers, by just a SetLatch(), and do it outside of
> signal handlers instead.  Forking from signal handlers is just plain
> weird.

Yeah, maybe it's time.  But in v11, and not for back-patch.

>> +#ifndef FD_CLOEXEC
>> +#define FD_CLOEXEC 1
>> +#endif

> Hm? Are we sure this is portable?  Is there really cases that have
> F_SETFD, but not CLOEXEC?

Copied-and-pasted from our only existing use of FD_CLOEXEC, in libpq.
Might well be obsolete but I see no particular reason not to do it.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 21:08:21

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-04-21 12:50:04 -0400, Tom Lane wrote:
>>> +#ifndef FD_CLOEXEC
>>> +#define FD_CLOEXEC 1
>>> +#endif

>> Hm? Are we sure this is portable?  Is there really cases that have
>> F_SETFD, but not CLOEXEC?

> Copied-and-pasted from our only existing use of FD_CLOEXEC, in libpq.

Looking closer, that code dates to 
   Author: Tom Lane <tgl@sss.pgh.pa.us>   Branch: master Release: REL8_0_BR [7627b91cd] 2004-10-21 20:23:19 +0000
   Set the close-on-exec flag for libpq's socket to the backend, to avoid   any possible problems from child programs
executedby the client app.   Per suggestion from Elliot Lee of Red Hat.
 

and while the public discussion about it

https://www.postgresql.org/message-id/flat/18172.1098382248%40sss.pgh.pa.us

doesn't really say, I suspect the specific coding was Elliot's suggestion
as well.  It probably had some value at one time ... but I see that SUSv2
mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own
coding rules it ought to be okay to assume they're there.  I'm tempted to
rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see
if anything in the buildfarm complains.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 21:21:24

On 2017-04-21 14:08:21 -0400, Tom Lane wrote:
> but I see that SUSv2
> mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own
> coding rules it ought to be okay to assume they're there.  I'm tempted to
> rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see
> if anything in the buildfarm complains.

+1

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

21 April 2017, 21:40:06

Tom,

On 2017-04-21 13:49:27 -0400, Tom Lane wrote:
> >> * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
> >> -         * any new connections, so we don't call select(), and just sleep.
> >> +         * any new connections, so we don't call WaitEventSetWait(), and just
> >> +         * sleep.  XXX not ideal
> >> */
> 
> > Couldn't we just deactive the sockets in the set instead?
> 
> Yeah, I think it'd be better to do something like that.  The pg_usleep
> call has the same issue of possibly not responding to interrupts.  The
> risks are a lot less, since it's a much shorter wait, but I would rather
> eliminate the separate code path in favor of doing it honestly.  Didn't
> seem like something to fuss over in the first draft though.

Ok, cool.


> >> +    ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1);
> 
> > Why are we using MAXLISTEN, rather than the actual number of things to
> > listen to?
> 
> It'd take more code (ie, an additional scan of the array) to predetermine
> that.  I figured the space-per-item in the WaitEventSet wasn't enough to
> worry about ... do you think differently?

I'm not sure.  We do create an epoll handler with enough space, and that
has some overhead. Don't know whether that's worthwhile to care about.


> > I kind of would like, in master, take a chance of replace all the work
> > done in signal handlers, by just a SetLatch(), and do it outside of
> > signal handlers instead.  Forking from signal handlers is just plain
> > weird.
> 
> Yeah, maybe it's time.  But in v11, and not for back-patch.

Agreed.


- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 21:54:40

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-21 14:08:21 -0400, Tom Lane wrote:
>> but I see that SUSv2
>> mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own
>> coding rules it ought to be okay to assume they're there.  I'm tempted to
>> rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see
>> if anything in the buildfarm complains.

> +1

Done, we'll soon see what happens.

In the same area, I noticed that POSIX does not say that the success
result for fcntl(F_SETFD) and related cases is 0.  It says that the
failure result is -1 and the success result is some other value.
We seem to have this right in most places, but e.g. port/noblock.c
gets it wrong.  The lack of field complaints implies that just about
everybody actually does return 0 on success, but I still think it
would be a good idea to run around and make all the calls test
specifically for -1.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

21 April 2017, 23:13:55

Tom Lane <tgl@sss.pgh.pa.us> writes:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-04-21 14:08:21 -0400, Tom Lane wrote:
>>> but I see that SUSv2
>>> mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own
>>> coding rules it ought to be okay to assume they're there.  I'm tempted to
>>> rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see
>>> if anything in the buildfarm complains.

>> +1

> Done, we'll soon see what happens.

Should have seen this coming, I guess: some of the Windows critters are
falling over, apparently because they lack fcntl() altogether.  So the
#ifdef F_SETFD was really acting as a proxy for "#ifdef HAVE_FCNTL".

There's no HAVE_FCNTL test in configure ATM, and I'm thinking it would
be pretty silly to add one, since surely it would succeed on anything
Unix-y enough to run the configure script.

I'm guessing the best thing to do is put back #ifdef F_SETFD;
alternatively we might spell it like "#ifndef WIN32", but I'm unsure
if that'd do what we want on Cygwin or MinGW.

In non-Windows code paths in latch.c, we probably wouldn't need to
bother with #ifdef F_SETFD.

Hopefully we can leave in the removal of "#define FD_CLOEXEC".
Will wait a bit longer for more results.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

22 April 2017, 01:28:47

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> It also appears to me that do_start_bgworker's treatment of fork
>> failure is completely brain dead.  Did anyone really think about
>> that case?

> Hmm, I probably modelled it on autovacuum without giving that case much
> additional consideration.

Attached is a proposed patch that should make fork failure behave
sanely, ie it works much the same as a worker crash immediately after
launch.  I also refactored things a bit to make do_start_bgworker
fully responsible for updating the RegisteredBgWorker's state,
rather than doing just some of it as before.

I tested this by hot-wiring the fork_process call to fail some of
the time, which showed that the postmaster now seems to recover OK,
but parallel.c's logic is completely innocent of the idea that
worker-startup failure is possible.  The leader backend just freezes,
and nothing short of kill -9 on that backend will get you out of it.
Fixing that seems like material for a separate patch though.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c382345..8461c24 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** static void TerminateChildren(int signal
*** 420,425 ****
--- 420,426 ----
  #define SignalChildren(sig)               SignalSomeChildren(sig, BACKEND_TYPE_ALL)

  static int    CountChildren(int target);
+ static bool assign_backendlist_entry(RegisteredBgWorker *rw);
  static void maybe_start_bgworker(void);
  static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
  static pid_t StartChildProcess(AuxProcType type);
*************** bgworker_forkexec(int shmem_slot)
*** 5531,5543 ****
   * Start a new bgworker.
   * Starting time conditions must have been checked already.
   *
   * This code is heavily based on autovacuum.c, q.v.
   */
! static void
  do_start_bgworker(RegisteredBgWorker *rw)
  {
      pid_t        worker_pid;

      ereport(DEBUG1,
              (errmsg("starting background worker process \"%s\"",
                      rw->rw_worker.bgw_name)));
--- 5532,5564 ----
   * Start a new bgworker.
   * Starting time conditions must have been checked already.
   *
+  * Returns true on success, false on failure.
+  * In either case, update the RegisteredBgWorker's state appropriately.
+  *
   * This code is heavily based on autovacuum.c, q.v.
   */
! static bool
  do_start_bgworker(RegisteredBgWorker *rw)
  {
      pid_t        worker_pid;

+     Assert(rw->rw_pid == 0);
+
+     /*
+      * Allocate and assign the Backend element.  Note we must do this before
+      * forking, so that we can handle out of memory properly.
+      *
+      * Treat failure as though the worker had crashed.  That way, the
+      * postmaster will wait a bit before attempting to start it again; if it
+      * tried again right away, most likely it'd find itself repeating the
+      * out-of-memory or fork failure condition.
+      */
+     if (!assign_backendlist_entry(rw))
+     {
+         rw->rw_crashed_at = GetCurrentTimestamp();
+         return false;
+     }
+
      ereport(DEBUG1,
              (errmsg("starting background worker process \"%s\"",
                      rw->rw_worker.bgw_name)));
*************** do_start_bgworker(RegisteredBgWorker *rw
*** 5549,5557 ****
  #endif
      {
          case -1:
              ereport(LOG,
                      (errmsg("could not fork worker process: %m")));
!             return;

  #ifndef EXEC_BACKEND
          case 0:
--- 5570,5586 ----
  #endif
      {
          case -1:
+             /* in postmaster, fork failed ... */
              ereport(LOG,
                      (errmsg("could not fork worker process: %m")));
!             /* undo what assign_backendlist_entry did */
!             ReleasePostmasterChildSlot(rw->rw_child_slot);
!             rw->rw_child_slot = 0;
!             free(rw->rw_backend);
!             rw->rw_backend = NULL;
!             /* mark entry as crashed, so we'll try again later */
!             rw->rw_crashed_at = GetCurrentTimestamp();
!             break;

  #ifndef EXEC_BACKEND
          case 0:
*************** do_start_bgworker(RegisteredBgWorker *rw
*** 5575,5588 ****
              PostmasterContext = NULL;

              StartBackgroundWorker();
              break;
  #endif
          default:
              rw->rw_pid = worker_pid;
              rw->rw_backend->pid = rw->rw_pid;
              ReportBackgroundWorkerPID(rw);
!             break;
      }
  }

  /*
--- 5604,5627 ----
              PostmasterContext = NULL;

              StartBackgroundWorker();
+
+             exit(1);            /* should not get here */
              break;
  #endif
          default:
+             /* in postmaster, fork successful ... */
              rw->rw_pid = worker_pid;
              rw->rw_backend->pid = rw->rw_pid;
              ReportBackgroundWorkerPID(rw);
!             /* add new worker to lists of backends */
!             dlist_push_head(&BackendList, &rw->rw_backend->elem);
! #ifdef EXEC_BACKEND
!             ShmemBackendArrayAdd(rw->rw_backend);
! #endif
!             return true;
      }
+
+     return false;
  }

  /*
*************** bgworker_should_start_now(BgWorkerStartT
*** 5629,5634 ****
--- 5668,5675 ----
   * Allocate the Backend struct for a connected background worker, but don't
   * add it to the list of backends just yet.
   *
+  * On failure, return false without changing any worker state.
+  *
   * Some info from the Backend is copied into the passed rw.
   */
  static bool
*************** assign_backendlist_entry(RegisteredBgWor
*** 5647,5654 ****
          ereport(LOG,
                  (errcode(ERRCODE_INTERNAL_ERROR),
                   errmsg("could not generate random cancel key")));
-
-         rw->rw_crashed_at = GetCurrentTimestamp();
          return false;
      }

--- 5688,5693 ----
*************** assign_backendlist_entry(RegisteredBgWor
*** 5658,5671 ****
          ereport(LOG,
                  (errcode(ERRCODE_OUT_OF_MEMORY),
                   errmsg("out of memory")));
-
-         /*
-          * The worker didn't really crash, but setting this nonzero makes
-          * postmaster wait a bit before attempting to start it again; if it
-          * tried again right away, most likely it'd find itself under the same
-          * memory pressure.
-          */
-         rw->rw_crashed_at = GetCurrentTimestamp();
          return false;
      }

--- 5697,5702 ----
*************** assign_backendlist_entry(RegisteredBgWor
*** 5687,5692 ****
--- 5718,5728 ----
   * As a side effect, the bgworker control variables are set or reset whenever
   * there are more workers to start after this one, and whenever the overall
   * system state requires it.
+  *
+  * The reason we start at most one worker per call is to avoid consuming the
+  * postmaster's attention for too long when many such requests are pending.
+  * As long as StartWorkerNeeded is true, ServerLoop will not block and will
+  * call this function again after dealing with any other issues.
   */
  static void
  maybe_start_bgworker(void)
*************** maybe_start_bgworker(void)
*** 5694,5706 ****
      slist_mutable_iter iter;
      TimestampTz now = 0;

      if (FatalError)
      {
          StartWorkerNeeded = false;
          HaveCrashedWorker = false;
!         return;                    /* not yet */
      }

      HaveCrashedWorker = false;

      slist_foreach_modify(iter, &BackgroundWorkerList)
--- 5730,5748 ----
      slist_mutable_iter iter;
      TimestampTz now = 0;

+     /*
+      * During crash recovery, we have no need to be called until the state
+      * transition out of recovery.
+      */
      if (FatalError)
      {
          StartWorkerNeeded = false;
          HaveCrashedWorker = false;
!         return;
      }

+     /* Don't need to be called again unless we find a reason for it below */
+     StartWorkerNeeded = false;
      HaveCrashedWorker = false;

      slist_foreach_modify(iter, &BackgroundWorkerList)
*************** maybe_start_bgworker(void)
*** 5709,5719 ****

          rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);

!         /* already running? */
          if (rw->rw_pid != 0)
              continue;

!         /* marked for death? */
          if (rw->rw_terminate)
          {
              ForgetBackgroundWorker(&iter);
--- 5751,5761 ----

          rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);

!         /* ignore if already running */
          if (rw->rw_pid != 0)
              continue;

!         /* if marked for death, clean up and remove from list */
          if (rw->rw_terminate)
          {
              ForgetBackgroundWorker(&iter);
*************** maybe_start_bgworker(void)
*** 5735,5746 ****
--- 5777,5790 ----
                  continue;
              }

+             /* read system time only when needed */
              if (now == 0)
                  now = GetCurrentTimestamp();

              if (!TimestampDifferenceExceeds(rw->rw_crashed_at, now,
                                        rw->rw_worker.bgw_restart_time * 1000))
              {
+                 /* Set flag to remember that we have workers to start later */
                  HaveCrashedWorker = true;
                  continue;
              }
*************** maybe_start_bgworker(void)
*** 5748,5782 ****

          if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
          {
!             /* reset crash time before calling assign_backendlist_entry */
              rw->rw_crashed_at = 0;

              /*
!              * Allocate and assign the Backend element.  Note we must do this
!              * before forking, so that we can handle out of memory properly.
               */
!             if (!assign_backendlist_entry(rw))
                  return;
!
!             do_start_bgworker(rw);        /* sets rw->rw_pid */
!
!             dlist_push_head(&BackendList, &rw->rw_backend->elem);
! #ifdef EXEC_BACKEND
!             ShmemBackendArrayAdd(rw->rw_backend);
! #endif

              /*
!              * Have ServerLoop call us again.  Note that there might not
!              * actually *be* another runnable worker, but we don't care all
!              * that much; we will find out the next time we run.
               */
              StartWorkerNeeded = true;
              return;
          }
      }
-
-     /* no runnable worker found */
-     StartWorkerNeeded = false;
  }

  /*
--- 5792,5826 ----

          if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
          {
!             /* reset crash time before trying to start worker */
              rw->rw_crashed_at = 0;

              /*
!              * Try to start the worker.
!              *
!              * On failure, give up processing workers for now, but set
!              * StartWorkerNeeded so we'll come back here on the next iteration
!              * of ServerLoop to try again.  (We don't want to wait, because
!              * there might be additional ready-to-run workers.)  We could set
!              * HaveCrashedWorker as well, since this worker is now marked
!              * crashed, but there's no need because the next run of this
!              * function will do that.
               */
!             if (!do_start_bgworker(rw))
!             {
!                 StartWorkerNeeded = true;
                  return;
!             }

              /*
!              * Quit, but have ServerLoop call us again to look for additional
!              * ready-to-run workers.  There might not be any, but we'll find
!              * out the next time we run.
               */
              StartWorkerNeeded = true;
              return;
          }
      }
  }

  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

22 April 2017, 06:50:41

I wrote:
> Attached is a lightly-tested draft patch that converts the postmaster to
> use a WaitEventSet for waiting in ServerLoop.  I've got mixed emotions
> about whether this is the direction to proceed, though.

Attached are a couple of patches that represent a plausible Plan B.
The first one changes the postmaster to run its signal handlers without
specifying SA_RESTART.  I've confirmed that that seems to fix the
select_parallel-test-takes-a-long-time problem on gaur/pademelon.
The second one uses pselect, if available, to replace the unblock-signals/
select()/block-signals dance in ServerLoop.  On platforms where pselect
exists and works properly, that should fix the race condition I described
previously.  On platforms where it doesn't, we're no worse off than
before.

As mentioned in the comments for the second patch, even if we don't
have working pselect(), the only problem is that ServerLoop's response
to an interrupt might be delayed by as much as the up-to-1-minute timeout.
The only existing case where that's really bad is launching multiple
bgworkers.  I would therefore advocate also changing maybe_start_bgworker
to start up to N bgworkers per call, where N is large enough to pretty
much always satisfy simultaneously-arriving requests.  I'd pick 100 or
so, but am willing to negotiate.

I think that these patches represent something we could back-patch
without a lot of trepidation, unlike the WaitEventSet-based approach.
Therefore, my proposal is to apply and backpatch these changes, and
call it good for v10.  For v11, we could work on changing the postmaster
to not do work in signal handlers, as discussed upthread.  That would
supersede these two patches completely, though I'd still advocate for
keeping the change in maybe_start_bgworker.

Note: for testing purposes, these patches are quite independent; just
ignore the hunk in the second patch that changes a comment added by
the first one.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c382345..52b5e2c 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 610,615 ****
--- 610,624 ----
      /*
       * Set up signal handlers for the postmaster process.
       *
+      * In the postmaster, we want to install non-ignored handlers *without*
+      * SA_RESTART.  This is because they'll be blocked at all times except
+      * when ServerLoop is waiting for something to happen, and during that
+      * window, we want signals to exit the select(2) wait so that ServerLoop
+      * can respond if anything interesting happened.  On some platforms,
+      * signals marked SA_RESTART would not cause the select() wait to end.
+      * Child processes will generally want SA_RESTART, but we expect them to
+      * set up their own handlers before unblocking signals.
+      *
       * CAUTION: when changing this list, check for side-effects on the signal
       * handling setup of child processes.  See tcop/postgres.c,
       * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
*************** PostmasterMain(int argc, char *argv[])
*** 620,635 ****
      pqinitmask();
      PG_SETMASK(&BlockSig);

!     pqsignal(SIGHUP, SIGHUP_handler);    /* reread config file and have
!                                          * children do same */
!     pqsignal(SIGINT, pmdie);    /* send SIGTERM and shut down */
!     pqsignal(SIGQUIT, pmdie);    /* send SIGQUIT and die */
!     pqsignal(SIGTERM, pmdie);    /* wait for children and shut down */
      pqsignal(SIGALRM, SIG_IGN); /* ignored */
      pqsignal(SIGPIPE, SIG_IGN); /* ignored */
!     pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */
!     pqsignal(SIGUSR2, dummy_handler);    /* unused, reserve for children */
!     pqsignal(SIGCHLD, reaper);    /* handle child termination */
      pqsignal(SIGTTIN, SIG_IGN); /* ignored */
      pqsignal(SIGTTOU, SIG_IGN); /* ignored */
      /* ignore SIGXFSZ, so that ulimit violations work like disk full */
--- 629,648 ----
      pqinitmask();
      PG_SETMASK(&BlockSig);

!     pqsignal_no_restart(SIGHUP, SIGHUP_handler);        /* reread config file
!                                                          * and have children do
!                                                          * same */
!     pqsignal_no_restart(SIGINT, pmdie); /* send SIGTERM and shut down */
!     pqsignal_no_restart(SIGQUIT, pmdie);        /* send SIGQUIT and die */
!     pqsignal_no_restart(SIGTERM, pmdie);        /* wait for children and shut
!                                                  * down */
      pqsignal(SIGALRM, SIG_IGN); /* ignored */
      pqsignal(SIGPIPE, SIG_IGN); /* ignored */
!     pqsignal_no_restart(SIGUSR1, sigusr1_handler);        /* message from child
!                                                          * process */
!     pqsignal_no_restart(SIGUSR2, dummy_handler);        /* unused, reserve for
!                                                          * children */
!     pqsignal_no_restart(SIGCHLD, reaper);        /* handle child termination */
      pqsignal(SIGTTIN, SIG_IGN); /* ignored */
      pqsignal(SIGTTOU, SIG_IGN); /* ignored */
      /* ignore SIGXFSZ, so that ulimit violations work like disk full */
diff --git a/src/include/port.h b/src/include/port.h
index c6937e5..52910ed 100644
*** a/src/include/port.h
--- b/src/include/port.h
*************** extern int    pg_mkdir_p(char *path, int om
*** 469,474 ****
--- 469,479 ----
  /* port/pqsignal.c */
  typedef void (*pqsigfunc) (int signo);
  extern pqsigfunc pqsignal(int signo, pqsigfunc func);
+ #ifndef WIN32
+ extern pqsigfunc pqsignal_no_restart(int signo, pqsigfunc func);
+ #else
+ #define pqsignal_no_restart(signo, func) pqsignal(signo, func)
+ #endif

  /* port/quotes.c */
  extern char *escape_single_quotes_ascii(const char *src);
diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index 5d366da..e7e4451 100644
*** a/src/port/pqsignal.c
--- b/src/port/pqsignal.c
***************
*** 32,38 ****
  #if !defined(WIN32) || defined(FRONTEND)

  /*
!  * Set up a signal handler for signal "signo"
   *
   * Returns the previous handler.
   */
--- 32,38 ----
  #if !defined(WIN32) || defined(FRONTEND)

  /*
!  * Set up a signal handler, with SA_RESTART, for signal "signo"
   *
   * Returns the previous handler.
   */
*************** pqsignal(int signo, pqsigfunc func)
*** 58,61 ****
--- 58,90 ----
  #endif
  }

+ /*
+  * Set up a signal handler, without SA_RESTART, for signal "signo"
+  *
+  * Returns the previous handler.
+  *
+  * On Windows, this would be identical to pqsignal(), so don't bother.
+  */
+ #ifndef WIN32
+
+ pqsigfunc
+ pqsignal_no_restart(int signo, pqsigfunc func)
+ {
+     struct sigaction act,
+                 oact;
+
+     act.sa_handler = func;
+     sigemptyset(&act.sa_mask);
+     act.sa_flags = 0;
+ #ifdef SA_NOCLDSTOP
+     if (signo == SIGCHLD)
+         act.sa_flags |= SA_NOCLDSTOP;
+ #endif
+     if (sigaction(signo, &act, &oact) < 0)
+         return SIG_ERR;
+     return oact.sa_handler;
+ }
+
+ #endif   /* !WIN32 */
+
  #endif   /* !defined(WIN32) || defined(FRONTEND) */
diff --git a/configure b/configure
index 99d05bf..42ff097 100755
*** a/configure
--- b/configure
*************** fi
*** 12892,12898 ****
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat
pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs
wcstombs_l
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
--- 12892,12898 ----
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pselect
pstatpthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes
wcstombswcstombs_l 
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index c36c503..871edd8 100644
*** a/configure.in
--- b/configure.in
*************** PGAC_FUNC_WCSTOMBS_L
*** 1429,1435 ****
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat
pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs
wcstombs_l])

  AC_REPLACE_FUNCS(fseeko)
  case $host_os in
--- 1429,1435 ----
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pselect
pstatpthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes
wcstombswcstombs_l]) 

  AC_REPLACE_FUNCS(fseeko)
  case $host_os in
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 52b5e2c..382e6a8 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 613,621 ****
       * In the postmaster, we want to install non-ignored handlers *without*
       * SA_RESTART.  This is because they'll be blocked at all times except
       * when ServerLoop is waiting for something to happen, and during that
!      * window, we want signals to exit the select(2) wait so that ServerLoop
       * can respond if anything interesting happened.  On some platforms,
!      * signals marked SA_RESTART would not cause the select() wait to end.
       * Child processes will generally want SA_RESTART, but we expect them to
       * set up their own handlers before unblocking signals.
       *
--- 613,621 ----
       * In the postmaster, we want to install non-ignored handlers *without*
       * SA_RESTART.  This is because they'll be blocked at all times except
       * when ServerLoop is waiting for something to happen, and during that
!      * window, we want signals to exit the pselect(2) wait so that ServerLoop
       * can respond if anything interesting happened.  On some platforms,
!      * signals marked SA_RESTART would not cause the pselect() wait to end.
       * Child processes will generally want SA_RESTART, but we expect them to
       * set up their own handlers before unblocking signals.
       *
*************** ServerLoop(void)
*** 1669,1674 ****
--- 1669,1676 ----
      for (;;)
      {
          fd_set        rmask;
+         fd_set       *rmask_p;
+         struct timeval timeout;
          int            selres;
          time_t        now;

*************** ServerLoop(void)
*** 1678,1714 ****
           * We block all signals except while sleeping. That makes it safe for
           * signal handlers, which again block all signals while executing, to
           * do nontrivial work.
-          *
-          * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
-          * any new connections, so we don't call select(), and just sleep.
           */
-         memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));
-
          if (pmState == PM_WAIT_DEAD_END)
          {
!             PG_SETMASK(&UnBlockSig);
!
!             pg_usleep(100000L); /* 100 msec seems reasonable */
!             selres = 0;
!
!             PG_SETMASK(&BlockSig);
          }
          else
          {
!             /* must set timeout each time; some OSes change it! */
!             struct timeval timeout;

              /* Needs to run with blocked signals! */
              DetermineSleepTime(&timeout);

              PG_SETMASK(&UnBlockSig);

!             selres = select(nSockets, &rmask, NULL, NULL, &timeout);

              PG_SETMASK(&BlockSig);
          }

!         /* Now check the select() result */
          if (selres < 0)
          {
              if (errno != EINTR && errno != EWOULDBLOCK)
--- 1680,1743 ----
           * We block all signals except while sleeping. That makes it safe for
           * signal handlers, which again block all signals while executing, to
           * do nontrivial work.
           */
          if (pmState == PM_WAIT_DEAD_END)
          {
!             /*
!              * If we are in PM_WAIT_DEAD_END state, then we don't want to
!              * accept any new connections, so pass a null rmask.
!              */
!             rmask_p = NULL;
!             timeout.tv_sec = 0;
!             timeout.tv_usec = 100000;    /* 100 msec seems reasonable */
          }
          else
          {
!             /* Normal case: check sockets, and compute a suitable timeout */
!             memcpy(&rmask, &readmask, sizeof(fd_set));
!             rmask_p = &rmask;

              /* Needs to run with blocked signals! */
              DetermineSleepTime(&timeout);
+         }

+         /*
+          * We prefer to wait with pselect(2) if available, as using that,
+          * together with *not* using SA_RESTART for signals, guarantees that
+          * we will get kicked off the wait if a signal occurs.
+          *
+          * If we lack pselect(2), fake it with select(2).  This has a race
+          * condition: a signal that was already pending will be delivered
+          * before we reach the select(), and therefore the select() will wait,
+          * even though we might wish to do something in response.  Therefore,
+          * beware of putting any time-critical signal response logic into
+          * ServerLoop rather than into the signal handler itself.  It will run
+          * eventually, but maybe not till after a timeout delay.
+          *
+          * It is rumored that some implementations of pselect() are not
+          * atomic, making the first alternative here functionally equivalent
+          * to the second.  Not much we can do about that though.
+          */
+         {
+ #ifdef HAVE_PSELECT
+             /* pselect uses a randomly different timeout API, sigh */
+             struct timespec ptimeout;
+
+             ptimeout.tv_sec = timeout.tv_sec;
+             ptimeout.tv_nsec = timeout.tv_usec * 1000;
+
+             selres = pselect(nSockets, rmask_p, NULL, NULL,
+                              &ptimeout, &UnBlockSig);
+ #else
              PG_SETMASK(&UnBlockSig);

!             selres = select(nSockets, rmask_p, NULL, NULL, &timeout);

              PG_SETMASK(&BlockSig);
+ #endif
          }

!         /* Now check the select()/pselect() result */
          if (selres < 0)
          {
              if (errno != EINTR && errno != EWOULDBLOCK)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 03e9803..9bbf1b1 100644
*** a/src/include/pg_config.h.in
--- b/src/include/pg_config.h.in
***************
*** 400,405 ****
--- 400,408 ----
  /* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
  #undef HAVE_PPC_LWARX_MUTEX_HINT

+ /* Define to 1 if you have the `pselect' function. */
+ #undef HAVE_PSELECT
+
  /* Define to 1 if you have the `pstat' function. */
  #undef HAVE_PSTAT

diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index f9588b0..5d36768 100644
*** a/src/include/pg_config.h.win32
--- b/src/include/pg_config.h.win32
***************
*** 267,272 ****
--- 267,275 ----
  /* Define to 1 if you have the <poll.h> header file. */
  /* #undef HAVE_POLL_H */

+ /* Define to 1 if you have the `pselect' function. */
+ /* #undef HAVE_PSELECT */
+
  /* Define to 1 if you have the `pstat' function. */
  /* #undef HAVE_PSTAT */


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

22 April 2017, 08:54:44

On 2017-04-21 23:50:41 -0400, Tom Lane wrote:
> Attached are a couple of patches that represent a plausible Plan B.
> The first one changes the postmaster to run its signal handlers without
> specifying SA_RESTART.  I've confirmed that that seems to fix the
> select_parallel-test-takes-a-long-time problem on gaur/pademelon.

> The second one uses pselect, if available, to replace the unblock-signals/
> select()/block-signals dance in ServerLoop.  On platforms where pselect
> exists and works properly, that should fix the race condition I described
> previously.  On platforms where it doesn't, we're no worse off than
> before.

We probably should note somewhere prominently that pselect isn't
actually race-free on a number of platforms.


> I think that these patches represent something we could back-patch
> without a lot of trepidation, unlike the WaitEventSet-based approach.
> Therefore, my proposal is to apply and backpatch these changes, and
> call it good for v10.  For v11, we could work on changing the postmaster
> to not do work in signal handlers, as discussed upthread.  That would
> supersede these two patches completely, though I'd still advocate for
> keeping the change in maybe_start_bgworker.

Not yet having looked at your patches, that sounds like a reasonable
plan.  I'd still like to get something like your CLOEXEC patch applied
independently however.


< patches >

Looks reasonable on a quick skim.


Greetings,

Andres Freund

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

24 April 2017, 23:16:44

On 2017-04-21 23:50:41 -0400, Tom Lane wrote:
> I wrote:
> > Attached is a lightly-tested draft patch that converts the postmaster to
> > use a WaitEventSet for waiting in ServerLoop.  I've got mixed emotions
> > about whether this is the direction to proceed, though.
> 
> Attached are a couple of patches that represent a plausible Plan B.
> The first one changes the postmaster to run its signal handlers without
> specifying SA_RESTART.  I've confirmed that that seems to fix the
> select_parallel-test-takes-a-long-time problem on gaur/pademelon.
> The second one uses pselect, if available, to replace the unblock-signals/
> select()/block-signals dance in ServerLoop.  On platforms where pselect
> exists and works properly, that should fix the race condition I described
> previously.  On platforms where it doesn't, we're no worse off than
> before.
> 
> As mentioned in the comments for the second patch, even if we don't
> have working pselect(), the only problem is that ServerLoop's response
> to an interrupt might be delayed by as much as the up-to-1-minute timeout.
> The only existing case where that's really bad is launching multiple
> bgworkers.  I would therefore advocate also changing maybe_start_bgworker
> to start up to N bgworkers per call, where N is large enough to pretty
> much always satisfy simultaneously-arriving requests.  I'd pick 100 or
> so, but am willing to negotiate.
> 
> I think that these patches represent something we could back-patch
> without a lot of trepidation, unlike the WaitEventSet-based approach.
> Therefore, my proposal is to apply and backpatch these changes, and
> call it good for v10.  For v11, we could work on changing the postmaster
> to not do work in signal handlers, as discussed upthread.  That would
> supersede these two patches completely, though I'd still advocate for
> keeping the change in maybe_start_bgworker.
> 
> Note: for testing purposes, these patches are quite independent; just
> ignore the hunk in the second patch that changes a comment added by
> the first one.

Unclear if related, but
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42
has a suspicious timing of failing in a weird way.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

24 April 2017, 23:24:16

On 2017-04-24 13:16:44 -0700, Andres Freund wrote:
> Unclear if related, but
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42
> has a suspicious timing of failing in a weird way.

Given that gharial is also failing on 9.6 (same set of commits) and
coypu fails (again same set) on 9.6
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33

I'm afraid it's more likely to be related.

gharial & coypu owners, any chance you could try starting postgres with
log_min_messages=debug5 on one of the affected machines?

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

25 April 2017, 00:33:39

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-24 13:16:44 -0700, Andres Freund wrote:
>> Unclear if related, but
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42
>> has a suspicious timing of failing in a weird way.

> Given that gharial is also failing on 9.6 (same set of commits) and
> coypu fails (again same set) on 9.6
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33

coypu's problem is unrelated:

running bootstrap script ... 2017-04-24 23:04:53.084 CEST [21114] FATAL:  could not create semaphores: No space left on
device
2017-04-24 23:04:53.084 CEST [21114] DETAIL:  Failed system call was semget(1, 17, 03600).

but it does seem likely that one of these patches broke gharial.
That's pretty annoying :-(
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

25 April 2017, 00:38:07

On 2017-04-24 17:33:39 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-04-24 13:16:44 -0700, Andres Freund wrote:
> >> Unclear if related, but
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42
> >> has a suspicious timing of failing in a weird way.
> 
> > Given that gharial is also failing on 9.6 (same set of commits) and
> > coypu fails (again same set) on 9.6
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33
> 
> coypu's problem is unrelated:
> 
> running bootstrap script ... 2017-04-24 23:04:53.084 CEST [21114] FATAL:  could not create semaphores: No space left
ondevice
 
> 2017-04-24 23:04:53.084 CEST [21114] DETAIL:  Failed system call was semget(1, 17, 03600).

Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6
failure is the same as gharial's mode of failure.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

25 April 2017, 01:14:41

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-24 17:33:39 -0400, Tom Lane wrote:
>> coypu's problem is unrelated:

> Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6
> failure is the same as gharial's mode of failure.

[ looks closer... ]  Oh: the 9.6 run occurred first, and the failures on
HEAD and 9.5 are presumably follow-on damage because the stuck postmaster
hasn't released semaphores.

A bit of googling establishes that NetBSD 5.1 has a broken pselect
implementation:

http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625

That says they fixed it in later versions but not 5.1 :-(

I can't find any similar smoking gun on the web for HPUX, but
I'd fully expect their bug database to be behind a paywall.

What I'm inclined to do is to revert the pselect change but not the other,
to see if that fixes these two animals.  If it does, we could look into
blacklisting these particular platforms when choosing pselect.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

25 April 2017, 01:19:01

On 2017-04-24 18:14:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-04-24 17:33:39 -0400, Tom Lane wrote:
> >> coypu's problem is unrelated:
> 
> > Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6
> > failure is the same as gharial's mode of failure.
> 
> [ looks closer... ]  Oh: the 9.6 run occurred first, and the failures on
> HEAD and 9.5 are presumably follow-on damage because the stuck postmaster
> hasn't released semaphores.
> 
> A bit of googling establishes that NetBSD 5.1 has a broken pselect
> implementation:
> 
> http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625

Yikes.  Do I understand correctly that they effectively just mapped
pselect to select?


> What I'm inclined to do is to revert the pselect change but not the other,
> to see if that fixes these two animals.  If it does, we could look into
> blacklisting these particular platforms when choosing pselect.

Seems sensible.

- Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

25 April 2017, 01:20:44

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-24 18:14:41 -0400, Tom Lane wrote:
>> A bit of googling establishes that NetBSD 5.1 has a broken pselect
>> implementation:
>> 
>> http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625

> Yikes.  Do I understand correctly that they effectively just mapped
> pselect to select?

That's what it sounds like.  Probably not intentionally, but in effect.

>> What I'm inclined to do is to revert the pselect change but not the other,
>> to see if that fixes these two animals.  If it does, we could look into
>> blacklisting these particular platforms when choosing pselect.

> Seems sensible.

Will do.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

25 April 2017, 02:47:09

I wrote:
> What I'm inclined to do is to revert the pselect change but not the other,
> to see if that fixes these two animals.  If it does, we could look into
> blacklisting these particular platforms when choosing pselect.

It looks like coypu is going to need manual intervention (ie, kill -9
on the leftover postmaster) to get unwedged :-(.  That's particularly
disturbing because it implies that ServerLoop isn't iterating at all;
otherwise, it'd have noticed by now that the buildfarm script deleted
its data directory out from under it.  Even if NetBSD's pselect had
forgotten to unblock signals, you'd figure it'd time out after a
minute ... so it's even more broken than that.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Rémi Zara

Date:

25 April 2017, 08:53:02

> Le 25 avr. 2017 à 01:47, Tom Lane <tgl@sss.pgh.pa.us> a écrit :
>
> I wrote:
>> What I'm inclined to do is to revert the pselect change but not the other,
>> to see if that fixes these two animals.  If it does, we could look into
>> blacklisting these particular platforms when choosing pselect.
>
> It looks like coypu is going to need manual intervention (ie, kill -9
> on the leftover postmaster) to get unwedged :-(.  That's particularly
> disturbing because it implies that ServerLoop isn't iterating at all;
> otherwise, it'd have noticed by now that the buildfarm script deleted
> its data directory out from under it.  Even if NetBSD's pselect had
> forgotten to unblock signals, you'd figure it'd time out after a
> minute ... so it's even more broken than that.
>

Hi,

coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores.
I’ve done the clean-up.

Regards,

Rémi

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

25 April 2017, 18:57:30

Rémi Zara <remi_zara@mac.com> writes:
>> Le 25 avr. 2017 à 01:47, Tom Lane <tgl@sss.pgh.pa.us> a écrit :
>> It looks like coypu is going to need manual intervention (ie, kill -9
>> on the leftover postmaster) to get unwedged :-(.  That's particularly
>> disturbing because it implies that ServerLoop isn't iterating at all;
>> otherwise, it'd have noticed by now that the buildfarm script deleted
>> its data directory out from under it.

> coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores.
> I’ve done the clean-up.

Huh, that's even more interesting.

Looking at the code, what ServerLoop actually does when it notices that
the postmaster.pid file has been removed is
            kill(MyProcPid, SIGQUIT);

So if our hypothesis is that pselect() failed to unblock signals,
then failure to quit is easily explained: the postmaster never
received/acted on its own signal.  But that should have left you
with a running postmaster holding the shared memory and semaphores.
Seems like if it is gone but it failed to remove those, somebody must've
kill -9'd it ... but who?  I see nothing in the buildfarm script that
would.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

26 April 2017, 18:42:38

I wrote:
> Rémi Zara <remi_zara@mac.com> writes:
>> coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores.
>> I’ve done the clean-up.

> Huh, that's even more interesting.

I installed NetBSD 5.1.5 on an old Mac G4; I believe this is a reasonable
approximation to coypu's environment. With the pselect patch installed,
I can replicate the behavior we saw in the buildfarm of connections
immediately failing with "the database system is starting up".
Investigation shows that pselect reports ready sockets correctly (which is
what allows connections to get in at all), and it does stop waiting either
for a signal or for a timeout. What it forgets to do is to actually
service the signal. The observed behavior is caused by the fact that
reaper() is never called so the postmaster never realizes that the startup
process has finished.

I experimented with putting
PG_SETMASK(&UnBlockSig); PG_SETMASK(&BlockSig);

immediately after the pselect() call, and found that indeed that lets
signals get serviced, and things work pretty much normally.

However, closer inspection finds that pselect only stops waiting when
a signal arrives *while it's waiting*, not if there was a signal already
pending. So this is actually even more broken than the so called "non
atomic" behavior we had expected to see --- at least with that, the
pending signal would have gotten serviced promptly, even if ServerLoop
itself didn't iterate.

This is all giving me less than warm fuzzy feelings about the state of
pselect support out in the real world.

So at this point we seem to have three plausible alternatives:

1. Let HEAD stand as it is. We have a problem with slow response to
bgworker start requests that arrive while ServerLoop is active, but that's
a pretty tight window usually (although I believe I've seen it hit at
least once in testing).

2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever
else we find to be flaky. Then only the blacklisted platforms have the
problem.

3. Go ahead with converting the postmaster to use WaitEventSet, a la
the draft patch I posted earlier. I'd be happy to do this if we were
at the start of a devel cycle, but right now seems a bit late --- not
to mention that we really need to fix 9.6 as well.

We could substantially ameliorate the slow-response problem by allowing
maybe_start_bgworker to launch multiple workers per call, which is
something I think we should do regardless. (I have a patch written to
allow it to launch up to N workers per call, but have held off committing
that till after the dust settles in ServerLoop.)

I'm leaning to doing #1 plus the maybe_start_bgworker change. There's
certainly room for difference of opinion here, though. Thoughts?
regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

27 April 2017, 00:05:39

Andres Freund <andres@anarazel.de> writes:
> I'd still like to get something like your CLOEXEC patch applied
> independently however.

Here's an updated version of that, which makes use of our previous
conclusion that F_SETFD/FD_CLOEXEC are available everywhere except
Windows, and fixes some sloppy thinking about the EXEC_BACKEND case.

I went ahead and changed the call to epoll_create into epoll_create1.
I'm not too concerned about loss of portability there --- it seems
unlikely that many people are still using ten-year-old glibc, and
even less likely that any of them would be interested in running
current Postgres on their stable-unto-death platform.  We could add
a configure test for epoll_create1 if you feel one's needed, but
I think it'd just be a waste of cycles.

I propose to push this into HEAD and 9.6 too.

            regards, tom lane

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0d0701a..946fbff 100644
*** a/src/backend/storage/ipc/latch.c
--- b/src/backend/storage/ipc/latch.c
*************** static volatile sig_atomic_t waiting = f
*** 118,123 ****
--- 118,126 ----
  static int    selfpipe_readfd = -1;
  static int    selfpipe_writefd = -1;

+ /* Process owning the self-pipe --- needed for checking purposes */
+ static int    selfpipe_owner_pid = 0;
+
  /* Private function prototypes */
  static void sendSelfPipeByte(void);
  static void drainSelfPipe(void);
*************** InitializeLatchSupport(void)
*** 146,152 ****
  #ifndef WIN32
      int            pipefd[2];

!     Assert(selfpipe_readfd == -1);

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
--- 149,189 ----
  #ifndef WIN32
      int            pipefd[2];

!     if (IsUnderPostmaster)
!     {
!         /*
!          * We might have inherited connections to a self-pipe created by the
!          * postmaster.  It's critical that child processes create their own
!          * self-pipes, of course, and we really want them to close the
!          * inherited FDs for safety's sake.
!          */
!         if (selfpipe_owner_pid != 0)
!         {
!             /* Assert we go through here but once in a child process */
!             Assert(selfpipe_owner_pid != MyProcPid);
!             /* Release postmaster's pipe FDs; ignore any error */
!             (void) close(selfpipe_readfd);
!             (void) close(selfpipe_writefd);
!             /* Clean up, just for safety's sake; we'll set these below */
!             selfpipe_readfd = selfpipe_writefd = -1;
!             selfpipe_owner_pid = 0;
!         }
!         else
!         {
!             /*
!              * Postmaster didn't create a self-pipe ... or else we're in an
!              * EXEC_BACKEND build, in which case it doesn't matter since the
!              * postmaster's pipe FDs were closed by the action of FD_CLOEXEC.
!              */
!             Assert(selfpipe_readfd == -1);
!         }
!     }
!     else
!     {
!         /* In postmaster or standalone backend, assert we do this but once */
!         Assert(selfpipe_readfd == -1);
!         Assert(selfpipe_owner_pid == 0);
!     }

      /*
       * Set up the self-pipe that allows a signal handler to wake up the
*************** InitializeLatchSupport(void)
*** 154,176 ****
       * that SetLatch won't block if the event has already been set many times
       * filling the kernel buffer. Make the read-end non-blocking too, so that
       * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
       */
      if (pipe(pipefd) < 0)
          elog(FATAL, "pipe() failed: %m");
      if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) == -1)
!         elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
      if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) == -1)
!         elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a backend-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
--- 191,220 ----
       * that SetLatch won't block if the event has already been set many times
       * filling the kernel buffer. Make the read-end non-blocking too, so that
       * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+      * Also, make both FDs close-on-exec, since we surely do not want any
+      * child processes messing with them.
       */
      if (pipe(pipefd) < 0)
          elog(FATAL, "pipe() failed: %m");
      if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) == -1)
!         elog(FATAL, "fcntl(F_SETFL) failed on read-end of self-pipe: %m");
      if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) == -1)
!         elog(FATAL, "fcntl(F_SETFL) failed on write-end of self-pipe: %m");
!     if (fcntl(pipefd[0], F_SETFD, FD_CLOEXEC) == -1)
!         elog(FATAL, "fcntl(F_SETFD) failed on read-end of self-pipe: %m");
!     if (fcntl(pipefd[1], F_SETFD, FD_CLOEXEC) == -1)
!         elog(FATAL, "fcntl(F_SETFD) failed on write-end of self-pipe: %m");

      selfpipe_readfd = pipefd[0];
      selfpipe_writefd = pipefd[1];
+     selfpipe_owner_pid = MyProcPid;
  #else
      /* currently, nothing to do here for Windows */
  #endif
  }

  /*
!  * Initialize a process-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
*************** InitLatch(volatile Latch *latch)
*** 181,187 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
--- 225,231 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #else
      latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
      if (latch->event == NULL)
*************** InitLatch(volatile Latch *latch)
*** 199,204 ****
--- 243,252 ----
   * containing the latch with ShmemInitStruct. (The Unix implementation
   * doesn't actually require that, but the Windows one does.) Because of
   * this restriction, we have no concurrency issues to worry about here.
+  *
+  * Note that other handles created in this module are never marked as
+  * inheritable.  Thus we do not need to worry about cleaning up child
+  * process references to postmaster-private latches or WaitEventSets.
   */
  void
  InitSharedLatch(volatile Latch *latch)
*************** OwnLatch(volatile Latch *latch)
*** 244,250 ****

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);
  #endif

      if (latch->owner_pid != 0)
--- 292,298 ----

  #ifndef WIN32
      /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #endif

      if (latch->owner_pid != 0)
*************** DisownLatch(volatile Latch *latch)
*** 277,283 ****
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * backend-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
--- 325,331 ----
   * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
   *
   * The latch must be owned by the current process, ie. it must be a
!  * process-local latch initialized with InitLatch, or a shared latch
   * associated with the current process by calling OwnLatch.
   *
   * Returns bit mask indicating which condition(s) caused the wake-up. Note
*************** CreateWaitEventSet(MemoryContext context
*** 517,525 ****
      set->nevents_space = nevents;

  #if defined(WAIT_USE_EPOLL)
!     set->epoll_fd = epoll_create(nevents);
      if (set->epoll_fd < 0)
!         elog(ERROR, "epoll_create failed: %m");
  #elif defined(WAIT_USE_WIN32)

      /*
--- 565,573 ----
      set->nevents_space = nevents;

  #if defined(WAIT_USE_EPOLL)
!     set->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
      if (set->epoll_fd < 0)
!         elog(ERROR, "epoll_create1 failed: %m");
  #elif defined(WAIT_USE_WIN32)

      /*
*************** CreateWaitEventSet(MemoryContext context
*** 540,545 ****
--- 588,599 ----

  /*
   * Free a previously created WaitEventSet.
+  *
+  * Note: preferably, this shouldn't have to free any resources that could be
+  * inherited across an exec().  If it did, we'd likely leak those resources in
+  * many scenarios.  For the epoll case, we ensure that by setting FD_CLOEXEC
+  * when the FD is created.  For the Windows case, we assume that the handles
+  * involved are non-inheritable.
   */
  void
  FreeWaitEventSet(WaitEventSet *set)
*************** FreeWaitEventSet(WaitEventSet *set)
*** 586,592 ****
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a backend-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
--- 640,646 ----
   * used to modify previously added wait events using ModifyWaitEvent().
   *
   * In the WL_LATCH_SET case the latch must be owned by the current process,
!  * i.e. it must be a process-local latch initialized with InitLatch, or a
   * shared latch associated with the current process by calling OwnLatch.
   *
   * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

27 April 2017, 03:37:40

On 2017-04-26 11:42:38 -0400, Tom Lane wrote:
> 1. Let HEAD stand as it is.  We have a problem with slow response to
> bgworker start requests that arrive while ServerLoop is active, but that's
> a pretty tight window usually (although I believe I've seen it hit at
> least once in testing).
> 
> 2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever
> else we find to be flaky.  Then only the blacklisted platforms have the
> problem.

That seems unattractive at this point.  I'm not looking forward to
having to debug more random platforms that implement this badly in a
yet another weird way.

> 3. Go ahead with converting the postmaster to use WaitEventSet, a la
> the draft patch I posted earlier.  I'd be happy to do this if we were
> at the start of a devel cycle, but right now seems a bit late --- not
> to mention that we really need to fix 9.6 as well.

Yea, backpatching this to 9.6 seems like a bigger hammer than
appropriate.  I'm on the fence WRT master, I think there's an argument
to be made that this is going to become a bigger and bigger problem, and
that we'll wish in a year or two that we had fewer releases with
parallelism etc that don't use WaitEventSets.

> I'm leaning to doing #1 plus the maybe_start_bgworker change.  There's
> certainly room for difference of opinion here, though.  Thoughts?

I see you did the bgworker thing - that seems good to me.

Thanks,

Andres

Re: [HACKERS] Unportable implementation of background worker start

From

Robert Haas

Date:

27 April 2017, 03:58:31

On Wed, Apr 26, 2017 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote:
>> 3. Go ahead with converting the postmaster to use WaitEventSet, a la
>> the draft patch I posted earlier.  I'd be happy to do this if we were
>> at the start of a devel cycle, but right now seems a bit late --- not
>> to mention that we really need to fix 9.6 as well.
>
> Yea, backpatching this to 9.6 seems like a bigger hammer than
> appropriate.  I'm on the fence WRT master, I think there's an argument
> to be made that this is going to become a bigger and bigger problem, and
> that we'll wish in a year or two that we had fewer releases with
> parallelism etc that don't use WaitEventSets.

I think changing this might be wise.  This problem isn't going away
for real until we do this, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

27 April 2017, 04:10:33

On 2017-04-26 17:05:39 -0400, Tom Lane wrote:
> Here's an updated version of that, which makes use of our previous
> conclusion that F_SETFD/FD_CLOEXEC are available everywhere except
> Windows, and fixes some sloppy thinking about the EXEC_BACKEND case.
> 
> I went ahead and changed the call to epoll_create into epoll_create1.
> I'm not too concerned about loss of portability there --- it seems
> unlikely that many people are still using ten-year-old glibc, and
> even less likely that any of them would be interested in running
> current Postgres on their stable-unto-death platform.  We could add
> a configure test for epoll_create1 if you feel one's needed, but
> I think it'd just be a waste of cycles.

Yea, I think we can live with that.  If we find it's a problem, we can
add a configure test later.


> I propose to push this into HEAD and 9.6 too.

Cool.


Change looks good to me.


Greetings,

Andres Freund

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

27 April 2017, 04:38:38

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 26, 2017 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote:
>>> 3. Go ahead with converting the postmaster to use WaitEventSet, a la
>>> the draft patch I posted earlier.  I'd be happy to do this if we were
>>> at the start of a devel cycle, but right now seems a bit late --- not
>>> to mention that we really need to fix 9.6 as well.

>> Yea, backpatching this to 9.6 seems like a bigger hammer than
>> appropriate.  I'm on the fence WRT master, I think there's an argument
>> to be made that this is going to become a bigger and bigger problem, and
>> that we'll wish in a year or two that we had fewer releases with
>> parallelism etc that don't use WaitEventSets.

> I think changing this might be wise.  This problem isn't going away
> for real until we do this, right?

Sure, but we have a lot of other problems that aren't going away until
we fix them, either.  This patch smells like new development to me ---
and it's not even very complete, because really what Andres wants to do
(and I concur) is to get rid of the postmaster's habit of doing
interesting things in signal handlers.  I'm definitely not on board with
doing that for v10 at this point.  But if we apply this patch, and then
do that in v11, then v10 will look like neither earlier nor later branches
with respect to the postmaster's event wait mechanisms.  I think that's
a recipe for undue maintenance pain.

I believe that the already-committed patches represent a sufficient-for-now
response to the known performance problems here, and so I'm thinking we
should stop here for v10.  I'm okay with pushing the latch.c changes
I just proposed, because those provide a useful safeguard against
extension modules doing something exciting in the postmaster process.
But I don't think we should go much further than that for v10.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

27 April 2017, 23:35:29

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-26 17:05:39 -0400, Tom Lane wrote:
>> I went ahead and changed the call to epoll_create into epoll_create1.
>> I'm not too concerned about loss of portability there --- it seems
>> unlikely that many people are still using ten-year-old glibc, and
>> even less likely that any of them would be interested in running
>> current Postgres on their stable-unto-death platform.  We could add
>> a configure test for epoll_create1 if you feel one's needed, but
>> I think it'd just be a waste of cycles.

> Yea, I think we can live with that.  If we find it's a problem, we can
> add a configure test later.

Well, according to the buildfarm, "later" is "now" :-(.

If RHEL5 is too old to have epoll_create1, I think your dates for it
might be a bit off.  Anyway, I'll go do something about that in a
little bit.

It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC"
in latch.c, rather than bothering with a full-blown configure check.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

27 April 2017, 23:45:05

On 2017-04-27 16:35:29 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-04-26 17:05:39 -0400, Tom Lane wrote:
> >> I went ahead and changed the call to epoll_create into epoll_create1.
> >> I'm not too concerned about loss of portability there --- it seems
> >> unlikely that many people are still using ten-year-old glibc, and
> >> even less likely that any of them would be interested in running
> >> current Postgres on their stable-unto-death platform.  We could add
> >> a configure test for epoll_create1 if you feel one's needed, but
> >> I think it'd just be a waste of cycles.
> 
> > Yea, I think we can live with that.  If we find it's a problem, we can
> > add a configure test later.
> 
> Well, according to the buildfarm, "later" is "now" :-(.

Too bad.


> If RHEL5 is too old to have epoll_create1, I think your dates for it
> might be a bit off.  Anyway, I'll go do something about that in a
> little bit.

2008-11-13 is when glibc 2.9 was released. Appears that RHEL5 still
ships with glibc 2.5, released 2006-09-29. Given that RHEL 5 originally
was released 2007-03-05, that's not too surprising?


> It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC"
> in latch.c, rather than bothering with a full-blown configure check.

Yea, that sounds worth trying.  Wonder if we need to care about kernels
not supporting it, but glibc having support?  I'd be ok skimping on that
for now.

Greetings,

Andres Freund

Re: [HACKERS] Unportable implementation of background worker start

From

Tom Lane

Date:

28 April 2017, 00:06:24

Andres Freund <andres@anarazel.de> writes:
> On 2017-04-27 16:35:29 -0400, Tom Lane wrote:
>> It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC"
>> in latch.c, rather than bothering with a full-blown configure check.

> Yea, that sounds worth trying.  Wonder if we need to care about kernels
> not supporting it, but glibc having support?  I'd be ok skimping on that
> for now.

On my RHEL6 box, <sys/epoll.h> is provided by glibc not the kernel:

$ rpm -qf /usr/include/sys/epoll.h
glibc-headers-2.12-1.209.el6_9.1.x86_64

So I think it's probably safe to assume that the header is in sync
with what glibc can do.

As for kernel (much) older than glibc, I'd rather expect glibc to paper
over that, though I've not looked at the source code to be sure.
        regards, tom lane

Re: [HACKERS] Unportable implementation of background worker start

From

Andres Freund

Date:

11 February 2019, 11:07:03

Hi,

On 2017-04-26 11:42:38 -0400, Tom Lane wrote:
> 3. Go ahead with converting the postmaster to use WaitEventSet, a la
> the draft patch I posted earlier.  I'd be happy to do this if we were
> at the start of a devel cycle, but right now seems a bit late --- not
> to mention that we really need to fix 9.6 as well.

Btw, recent-ish versions of
http://man7.org/linux/man-pages/man7/signal-safety.7.html
have
       *  POSIX.1-2003 clarified that if an application calls fork(2) from a
          signal handler and any of the fork handlers registered by
          pthread_atfork(3) calls a function that is not async-signal-safe,
          the behavior is undefined.  A future revision of the standard is
          likely to remove fork(2) from the list of async-signal-safe
          functions.

Greetings,

Andres Freund