Thread: [HACKERS] Unportable implementation of background worker start
I chanced to notice that on gaur/pademelon, the "select_parallel" regression test sometimes takes a great deal longer than normal, for no obvious reason. It does eventually terminate, but sometimes only after 10 to 20 minutes rather than the couple dozen seconds that are typical on that slow machine. After a fair amount of hair-pulling, I traced the problem to the fact that maybe_start_bgworker() will only start at most one worker per call; after that, it sets StartWorkerNeeded = true and returns, opining in a comment that ServerLoop will make another call shortly. Unfortunately, that's hogwash. It happens to work reliably on Linux, because according to signal(7) The following interfaces are never restarted after being interrupted by a signal handler, regardless of the useof SA_RESTART; they always fail with the error EINTR when interrupted by a signal handler: ... select(2) ... However, that's a Linux-ism. What POSIX 2008 says about it, in the select(2) reference page, is that If SA_RESTART has been set for the interrupting signal, it isimplementation-defined whether the function restarts or returnswith[EINTR]. HPUX apparently adopts the "restart" definition, and as we've previously found out, not only does select(2) not return immediately but it actually seems to reset the timeout timer to its original value. (Some googling suggests that restarting occurs on SVR4-derived kernels but not BSD-derived ones.) So what happens, if several RegisterDynamicBackgroundWorker requests arrive in a single iteration of the postmaster's sigusr1_handler, is that the first one is serviced thanks to the maybe_start_bgworker call appearing in sigusr1_handler, and then we return to the select() call and sleep. The next start request is serviced only when the typically-60-second select timeout expires, or more usually when some other interrupt arrives at the postmaster. In the regression tests, what seems to happen is that we get a PMSIGNAL_START_AUTOVAC_WORKER from the autovac launcher every thirty seconds, allowing one more bgworker to get launched each time we go through sigusr1_handler. Here's an actual trace of one select_parallel.sql query trying to launch four parallel workers; I added a bunch of elog(LOG) calls to help diagnose what was going on: 2017-04-19 17:25:33.448 EDT [8827] LOG: signaling postmaster for startup of slot 1 2017-04-19 17:25:33.448 EDT [8827] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:25:33.448 EDT [8827] LOG: signaling postmaster for startup of slot 2 2017-04-19 17:25:33.448 EDT [8827] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:25:33.448 EDT [8827] LOG: signaling postmaster for startup of slot 3 2017-04-19 17:25:33.448 EDT [8827] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:25:33.448 EDT [8827] LOG: signaling postmaster for startup of slot 4 2017-04-19 17:25:33.448 EDT [8827] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:25:33.563 EDT [6456] LOG: entering sigusr1_handler 2017-04-19 17:25:33.563 EDT [6456] LOG: registering background worker "parallel worker for PID 8827" 2017-04-19 17:25:33.563 EDT [6456] LOG: registering background worker "parallel worker for PID 8827" 2017-04-19 17:25:33.563 EDT [6456] LOG: registering background worker "parallel worker for PID 8827" 2017-04-19 17:25:33.563 EDT [6456] LOG: registering background worker "parallel worker for PID 8827" 2017-04-19 17:25:33.563 EDT [6456] LOG: entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1 2017-04-19 17:25:33.563 EDT [6456] LOG: starting background worker process "parallel worker for PID 8827" 2017-04-19 17:25:33.566 EDT [8871] LOG: starting parallel worker 3 2017-04-19 17:25:33.642 EDT [8871] LOG: exiting parallel worker 3 2017-04-19 17:25:33.642 EDT [8871] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:25:33.647 EDT [6456] LOG: leaving sigusr1_handler 2017-04-19 17:25:33.647 EDT [6456] LOG: entering reaper 2017-04-19 17:25:33.647 EDT [6456] LOG: unregistering background worker "parallel worker for PID 8827" 2017-04-19 17:25:33.647 EDT [6456] LOG: reaped bgworker pid 8871 status 0 2017-04-19 17:25:33.647 EDT [6456] LOG: leaving reaper 2017-04-19 17:26:03.114 EDT [6456] LOG: entering sigusr1_handler 2017-04-19 17:26:03.115 EDT [6456] LOG: entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1 2017-04-19 17:26:03.115 EDT [6456] LOG: starting background worker process "parallel worker for PID 8827" 2017-04-19 17:26:03.118 EDT [8874] LOG: starting parallel worker 2 2017-04-19 17:26:03.164 EDT [8874] LOG: exiting parallel worker 2 2017-04-19 17:26:03.164 EDT [8874] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:26:03.185 EDT [6456] LOG: leaving sigusr1_handler 2017-04-19 17:26:03.185 EDT [6456] LOG: entering reaper 2017-04-19 17:26:03.186 EDT [6456] LOG: unregistering background worker "parallel worker for PID 8827" 2017-04-19 17:26:03.186 EDT [6456] LOG: reaped bgworker pid 8874 status 0 2017-04-19 17:26:03.186 EDT [6456] LOG: leaving reaper 2017-04-19 17:26:03.284 EDT [6456] LOG: entering reaper 2017-04-19 17:26:03.284 EDT [6456] LOG: leaving reaper 2017-04-19 17:26:33.378 EDT [6456] LOG: entering sigusr1_handler 2017-04-19 17:26:33.378 EDT [6456] LOG: entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1 2017-04-19 17:26:33.378 EDT [6456] LOG: starting background worker process "parallel worker for PID 8827" 2017-04-19 17:26:33.382 EDT [8876] LOG: starting parallel worker 1 2017-04-19 17:26:33.428 EDT [8876] LOG: exiting parallel worker 1 2017-04-19 17:26:33.428 EDT [8876] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:26:33.452 EDT [6456] LOG: leaving sigusr1_handler 2017-04-19 17:26:33.453 EDT [6456] LOG: entering reaper 2017-04-19 17:26:33.453 EDT [6456] LOG: unregistering background worker "parallel worker for PID 8827" 2017-04-19 17:26:33.453 EDT [6456] LOG: reaped bgworker pid 8876 status 0 2017-04-19 17:26:33.453 EDT [6456] LOG: leaving reaper 2017-04-19 17:26:33.560 EDT [6456] LOG: entering reaper 2017-04-19 17:26:33.560 EDT [6456] LOG: leaving reaper 2017-04-19 17:27:03.114 EDT [6456] LOG: entering sigusr1_handler 2017-04-19 17:27:03.115 EDT [6456] LOG: entered maybe_start_bgworker, StartWorkerNeeded=1, HaveCrashedWorker=1 2017-04-19 17:27:03.115 EDT [6456] LOG: starting background worker process "parallel worker for PID 8827" 2017-04-19 17:27:03.118 EDT [8879] LOG: starting parallel worker 0 2017-04-19 17:27:03.167 EDT [8879] LOG: exiting parallel worker 0 2017-04-19 17:27:03.167 EDT [8879] STATEMENT: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; 2017-04-19 17:27:03.174 EDT [6456] LOG: leaving sigusr1_handler 2017-04-19 17:27:03.174 EDT [6456] LOG: entering reaper 2017-04-19 17:27:03.175 EDT [6456] LOG: unregistering background worker "parallel worker for PID 8827" 2017-04-19 17:27:03.184 EDT [6456] LOG: reaped bgworker pid 8879 status 0 2017-04-19 17:27:03.184 EDT [6456] LOG: leaving reaper While I haven't yet tested it, it seems like a fix might be as simple as deleting these lines in maybe_start_bgworker: /* * Have ServerLoop call us again. Note that there might not * actually *be* another runnableworker, but we don't care all * that much; we will find out the next time we run. */ StartWorkerNeeded = true; return; So I'm wondering what the design rationale was for only starting one bgworker per invocation. It also appears to me that do_start_bgworker's treatment of fork failure is completely brain dead. Did anyone really think about that case? regards, tom lane
Tom Lane wrote: > While I haven't yet tested it, it seems like a fix might be as simple > as deleting these lines in maybe_start_bgworker: > > /* > * Have ServerLoop call us again. Note that there might not > * actually *be* another runnable worker, but we don't care all > * that much; we will find out the next time we run. > */ > StartWorkerNeeded = true; > return; > > So I'm wondering what the design rationale was for only starting one > bgworker per invocation. The rationale was that there may be other tasks waiting for postmaster attention, and if there are many bgworkers needing to be started, the other work may be delayed for a long time. This is not the first time that this rationale has been challenged, but so far there hasn't been any good reason to change it. One option is to just remove it as you propose, but a different one is to stop using select(2) in ServerLoop, because those behavior differences seem to make it rather unusable. > It also appears to me that do_start_bgworker's treatment of fork > failure is completely brain dead. Did anyone really think about > that case? Hmm, I probably modelled it on autovacuum without giving that case much additional consideration. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> So I'm wondering what the design rationale was for only starting one >> bgworker per invocation. > The rationale was that there may be other tasks waiting for postmaster > attention, and if there are many bgworkers needing to be started, the > other work may be delayed for a long time. This is not the first time > that this rationale has been challenged, but so far there hasn't been > any good reason to change it. One option is to just remove it as you > propose, but a different one is to stop using select(2) in ServerLoop, > because those behavior differences seem to make it rather unusable. Hm. Do you have a more-portable alternative? regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> So I'm wondering what the design rationale was for only starting one > >> bgworker per invocation. > > > The rationale was that there may be other tasks waiting for postmaster > > attention, and if there are many bgworkers needing to be started, the > > other work may be delayed for a long time. This is not the first time > > that this rationale has been challenged, but so far there hasn't been > > any good reason to change it. One option is to just remove it as you > > propose, but a different one is to stop using select(2) in ServerLoop, > > because those behavior differences seem to make it rather unusable. > > Hm. Do you have a more-portable alternative? I was thinking in a WaitEventSet from latch.c. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Hm. Do you have a more-portable alternative? > I was thinking in a WaitEventSet from latch.c. Yeah, some googling turns up the suggestion that a self-pipe is a portable way to get consistent semantics from select(); latch.c has already done that. I suppose that using latch.c would be convenient in that we'd have to write little new code, but it's a tad annoying to add the overhead of a self-pipe on platforms where we don't need it (which seems to be most). regards, tom lane
On 2017-04-19 18:56:26 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> Hm. Do you have a more-portable alternative? > > > I was thinking in a WaitEventSet from latch.c. > > Yeah, some googling turns up the suggestion that a self-pipe is a portable > way to get consistent semantics from select(); latch.c has already done > that. I suppose that using latch.c would be convenient in that we'd have > to write little new code, but it's a tad annoying to add the overhead of a > self-pipe on platforms where we don't need it (which seems to be most). FWIW, I'd wished before that we used something a bit more modern than select() if available... It's nice to be able to listen to a larger number of sockets without repeated O(sockets) overhead. BTW, we IIRC had discussed removing the select() backed latch implementation in this release. I'll try to dig up that discussion. - Andres
Andres Freund <andres@anarazel.de> writes: > FWIW, I'd wished before that we used something a bit more modern than > select() if available... It's nice to be able to listen to a larger > number of sockets without repeated O(sockets) overhead. [ raised eyebrow... ] Is anyone really running postmasters with enough listen sockets for that to be meaningful? > BTW, we IIRC had discussed removing the select() backed latch > implementation in this release. I'll try to dig up that discussion. Might be sensible. Even my pet dinosaurs have poll(2). We should check the buildfarm to see if the select() implementation is being tested at all. regards, tom lane
I wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> Hm. Do you have a more-portable alternative? >> I was thinking in a WaitEventSet from latch.c. My first reaction was that that sounded like a lot more work than removing two lines from maybe_start_bgworker and adjusting some comments. But on closer inspection, the slow-bgworker-start issue isn't the only problem here. On a machine that restarts select()'s timeout after an interrupt, as (at least) HPUX does, the postmaster will actually never iterate ServerLoop's loop except immediately after receiving a new connection request. The stream of interrupts from the autovac launcher is alone sufficient to prevent the initial 60-second timeout from ever elapsing. So if there are no new connection requests for awhile, none of the housekeeping actions in ServerLoop get done. Most of those actions are concerned with restarting failed background tasks, which is something we could get by without --- it's unlikely that those tasks would fail without causing a database-wide restart, and then there really isn't going to be much need for them until at least one new connection request has arrived. But the last step in that loop is concerned with touching the sockets and lock files to prevent aggressive /tmp cleaners from removing them, and that's something that can't be let slide, or we might as well not have the logic at all. I've confirmed experimentally that on my HPUX box, a postmaster not receiving new connections for an hour or more in fact fails to update the mod times on the sockets and lock files. So that problem isn't hypothetical. So it's looking to me like we do need to do something about this, and ideally back-patch it all the way. But WaitEventSet doesn't exist before 9.6. Do we have the stomach for back-patching that into stable branches? regards, tom lane
I wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> Hm. Do you have a more-portable alternative? >> I was thinking in a WaitEventSet from latch.c. > My first reaction was that that sounded like a lot more work than removing > two lines from maybe_start_bgworker and adjusting some comments. But on > closer inspection, the slow-bgworker-start issue isn't the only problem > here. On a machine that restarts select()'s timeout after an interrupt, > as (at least) HPUX does, the postmaster will actually never iterate > ServerLoop's loop except immediately after receiving a new connection > request. I had a go at making the postmaster use a WaitEventSet, and immediately ran into problems: latch.c is completely unprepared to be used anywhere except a postmaster child process. I think we can get around the issue for the self-pipe, as per the attached untested patch. But there remains a problem: we should do a FreeWaitEventSet() after forking a child process to ensure that postmaster children aren't running around with open FDs for the postmaster's stuff. This is no big problem in a regular Unix build; we can give ClosePostmasterPorts() the responsibility. It *is* a problem in EXEC_BACKEND children, which won't have inherited the WaitEventSet data structure. Maybe we could ignore the problem for Unix EXEC_BACKEND builds, since we consider those to be for debug purposes only, not for production. But I don't think we can get away with it for Windows --- or are the HANDLEs in a Windows WaitEventSet not inheritable resources? regards, tom lane diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4798370..d4ac54c 100644 *** a/src/backend/storage/ipc/latch.c --- b/src/backend/storage/ipc/latch.c *************** static volatile sig_atomic_t waiting = f *** 129,134 **** --- 129,136 ---- /* Read and write ends of the self-pipe */ static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; + /* Process owning the self-pipe --- needed for checking purposes */ + static int selfpipe_owner_pid = 0; /* Private function prototypes */ static void sendSelfPipeByte(void); *************** InitializeLatchSupport(void) *** 158,164 **** #ifndef WIN32 int pipefd[2]; ! Assert(selfpipe_readfd == -1); /* * Set up the self-pipe that allows a signal handler to wake up the --- 160,202 ---- #ifndef WIN32 int pipefd[2]; ! if (IsUnderPostmaster) ! { ! /* ! * We might have inherited connections to a self-pipe created by the ! * postmaster. It's critical that child processes create their own ! * self-pipes, of course, and we really want them to close the ! * inherited FDs for safety's sake. ! */ ! if (selfpipe_owner_pid != 0) ! { ! /* Assert we go through here but once in a child process */ ! Assert(selfpipe_owner_pid != MyProcPid); ! /* Release postmaster's pipe */ ! close(selfpipe_readfd); ! close(selfpipe_writefd); ! /* Clean up, just for safety's sake; we'll set these in a bit */ ! selfpipe_readfd = selfpipe_writefd = -1; ! selfpipe_owner_pid = 0; ! } ! else ! { ! /* ! * Postmaster didn't create a self-pipe ... or else we're in an ! * EXEC_BACKEND build, and don't know because we didn't inherit ! * values for the static variables. (In that case we'll fail to ! * close the inherited FDs, but that seems acceptable since ! * EXEC_BACKEND builds aren't meant for production on Unix.) ! */ ! Assert(selfpipe_readfd == -1); ! } ! } ! else ! { ! /* In postmaster or standalone backend, assert we do this but once */ ! Assert(selfpipe_readfd == -1); ! Assert(selfpipe_owner_pid == 0); ! } /* * Set up the self-pipe that allows a signal handler to wake up the *************** InitializeLatchSupport(void) *** 176,188 **** selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a backend-local latch. */ void InitLatch(volatile Latch *latch) --- 214,227 ---- selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; + selfpipe_owner_pid = MyProcPid; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a process-local latch. */ void InitLatch(volatile Latch *latch) *************** InitLatch(volatile Latch *latch) *** 193,199 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) --- 232,238 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) *************** OwnLatch(volatile Latch *latch) *** 256,262 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #endif if (latch->owner_pid != 0) --- 295,301 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #endif if (latch->owner_pid != 0) *************** DisownLatch(volatile Latch *latch) *** 289,295 **** * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * backend-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note --- 328,334 ---- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * process-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note *************** FreeWaitEventSet(WaitEventSet *set) *** 597,603 **** * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a backend-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are --- 636,642 ---- * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a process-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-04-20 19:23:28 -0400, Tom Lane wrote: > or are the HANDLEs in a Windows WaitEventSet not inheritable > resources? I think we have control over that. According to https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx CreateProcess() has to be called with bInheritHandles = true (which we do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true too. The latter we already only do for InitSharedLatch(), but not for InitLatch(), nor for the WSACreateEvent's created for sockets - those apparently can never be inherited. So that kind of sounds like it should be doable. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-20 19:23:28 -0400, Tom Lane wrote: >> or are the HANDLEs in a Windows WaitEventSet not inheritable >> resources? > I think we have control over that. According to > https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx > CreateProcess() has to be called with bInheritHandles = true (which we > do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true > too. The latter we already only do for InitSharedLatch(), but not for > InitLatch(), nor for the WSACreateEvent's created for sockets - those > apparently can never be inherited. > So that kind of sounds like it should be doable. Ah, good. I'll add a comment about that and press on. regards, tom lane
On 2017-04-20 00:50:13 -0400, Tom Lane wrote: > I wrote: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> Hm. Do you have a more-portable alternative? > > >> I was thinking in a WaitEventSet from latch.c. > > My first reaction was that that sounded like a lot more work than removing > two lines from maybe_start_bgworker and adjusting some comments. But on > closer inspection, the slow-bgworker-start issue isn't the only problem > here. On a machine that restarts select()'s timeout after an interrupt, > as (at least) HPUX does, the postmaster will actually never iterate > ServerLoop's loop except immediately after receiving a new connection > request. The stream of interrupts from the autovac launcher is alone > sufficient to prevent the initial 60-second timeout from ever elapsing. > So if there are no new connection requests for awhile, none of the > housekeeping actions in ServerLoop get done. FWIW, I vaguely remember somewhat related issues on x86/linux too. On busy machines in autovacuum_freeze_max_age territory (pretty frequent thing these days), the signalling frequency caused problems in postmaster, but I unfortunately don't remember the symptoms very well. > So it's looking to me like we do need to do something about this, and > ideally back-patch it all the way. But WaitEventSet doesn't exist > before 9.6. Do we have the stomach for back-patching that into > stable branches? Hm, that's not exactly no code - on the other hand it's (excepting the select(2) path) reasonably well exercised. What we could do is to convert master, see how the beta process likes it, and then backpatch? These don't like super pressing issues. - Andres
I wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-04-20 19:23:28 -0400, Tom Lane wrote: >>> or are the HANDLEs in a Windows WaitEventSet not inheritable >>> resources? >> So that kind of sounds like it should be doable. > Ah, good. I'll add a comment about that and press on. So ... what would you say to replacing epoll_create() with epoll_create1(EPOLL_CLOEXEC) ? Then a WaitEventSet would not represent inheritable-across-exec resources on any platform, making it a lot easier to deal with the EXEC_BACKEND case. AFAIK, both APIs are Linux-only, and epoll_create1() is not much newer than epoll_create(), so it seems like we'd not be giving up much portability if we insist on epoll_create1. regards, tom lane
On 2017-04-20 19:53:02 -0400, Tom Lane wrote: > I wrote: > > Andres Freund <andres@anarazel.de> writes: > >> On 2017-04-20 19:23:28 -0400, Tom Lane wrote: > >>> or are the HANDLEs in a Windows WaitEventSet not inheritable > >>> resources? > > >> So that kind of sounds like it should be doable. > > > Ah, good. I'll add a comment about that and press on. > > So ... what would you say to replacing epoll_create() with > epoll_create1(EPOLL_CLOEXEC) ? Then a WaitEventSet would not > represent inheritable-across-exec resources on any platform, > making it a lot easier to deal with the EXEC_BACKEND case. > > AFAIK, both APIs are Linux-only, and epoll_create1() is not much > newer than epoll_create(), so it seems like we'd not be giving up > much portability if we insist on epoll_create1. I'm generally quite in favor of using CLOEXEC as much as possible in our tree. I'm a bit concerned with epoll_create1's availability tho - the glibc support for it was introduced in 2.9, whereas epoll_create is in 2.3.2. On the other hand 2.9 was released 2008-11-13. If we remain concerned we could just fcntl(fd, F_SETFD, FD_CLOEXEC) instead - that should only be like three lines more code or such, and should be available for a lot longer. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-20 19:53:02 -0400, Tom Lane wrote: >> So ... what would you say to replacing epoll_create() with >> epoll_create1(EPOLL_CLOEXEC) ? Then a WaitEventSet would not >> represent inheritable-across-exec resources on any platform, >> making it a lot easier to deal with the EXEC_BACKEND case. > I'm generally quite in favor of using CLOEXEC as much as possible in our > tree. I'm a bit concerned with epoll_create1's availability tho - the > glibc support for it was introduced in 2.9, whereas epoll_create is in > 2.3.2. On the other hand 2.9 was released 2008-11-13. Also, if it's not there we'd fall back to using plain poll(), which is not so awful that we need to work hard to avoid it. I'd just as soon keep the number of combinations down. regards, tom lane
On 2017-04-20 20:05:02 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-20 19:53:02 -0400, Tom Lane wrote: > >> So ... what would you say to replacing epoll_create() with > >> epoll_create1(EPOLL_CLOEXEC) ? Then a WaitEventSet would not > >> represent inheritable-across-exec resources on any platform, > >> making it a lot easier to deal with the EXEC_BACKEND case. > > > I'm generally quite in favor of using CLOEXEC as much as possible in our > > tree. I'm a bit concerned with epoll_create1's availability tho - the > > glibc support for it was introduced in 2.9, whereas epoll_create is in > > 2.3.2. On the other hand 2.9 was released 2008-11-13. > > Also, if it's not there we'd fall back to using plain poll(), which is > not so awful that we need to work hard to avoid it. I'd just as soon > keep the number of combinations down. Just using fcntl(SET, CLOEXEC) wound't increase the number of combinations? - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-20 20:05:02 -0400, Tom Lane wrote: >> Also, if it's not there we'd fall back to using plain poll(), which is >> not so awful that we need to work hard to avoid it. I'd just as soon >> keep the number of combinations down. > Just using fcntl(SET, CLOEXEC) wound't increase the number of > combinations? True, if you just did it that way unconditionally. But doesn't that require an extra kernel call per CreateWaitEventSet()? regards, tom lane
On 2017-04-20 20:10:41 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-20 20:05:02 -0400, Tom Lane wrote: > >> Also, if it's not there we'd fall back to using plain poll(), which is > >> not so awful that we need to work hard to avoid it. I'd just as soon > >> keep the number of combinations down. > > > Just using fcntl(SET, CLOEXEC) wound't increase the number of > > combinations? > > True, if you just did it that way unconditionally. But doesn't that > require an extra kernel call per CreateWaitEventSet()? It does - the question is whether that matters much. FE/BE uses a persistent wait set, but unfortunately much of other latch users don't. And some of them can be somewhat frequent - so I guess that'd possibly be measurable. Ok, so I'm on board with epoll1. If somebody were to change more frequent latch users to use persistent wait sets, that'd be good too. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-20 00:50:13 -0400, Tom Lane wrote: >> My first reaction was that that sounded like a lot more work than removing >> two lines from maybe_start_bgworker and adjusting some comments. But on >> closer inspection, the slow-bgworker-start issue isn't the only problem >> here. > FWIW, I vaguely remember somewhat related issues on x86/linux too. After sleeping and thinking more, I've realized that the slow-bgworker-start issue actually exists on *every* platform, it's just harder to hit when select() is interruptable. But consider the case where multiple bgworker-start requests arrive while ServerLoop is actively executing (perhaps because a connection request just came in). The postmaster has signals blocked, so nothing happens for the moment. When we go around the loop and reach PG_SETMASK(&UnBlockSig); the pending SIGUSR1 is delivered, and sigusr1_handler reads all the bgworker start requests, and services just one of them. Then control returns and proceeds to selres = select(nSockets, &rmask, NULL, NULL, &timeout); But now there's no interrupt pending. So the remaining start requests do not get serviced until (a) some other postmaster interrupt arrives, or (b) the one-minute timeout elapses. They could be waiting awhile. Bottom line is that any request for more than one bgworker at a time faces a non-negligible risk of suffering serious latency. I'm coming back to the idea that at least in the back branches, the thing to do is allow maybe_start_bgworker to start multiple workers. Is there any actual evidence for the claim that that might have bad side effects? regards, tom lane
Tom Lane wrote: > After sleeping and thinking more, I've realized that the > slow-bgworker-start issue actually exists on *every* platform, it's just > harder to hit when select() is interruptable. But consider the case > where multiple bgworker-start requests arrive while ServerLoop is > actively executing (perhaps because a connection request just came in). > The postmaster has signals blocked, so nothing happens for the moment. > When we go around the loop and reach > > PG_SETMASK(&UnBlockSig); > > the pending SIGUSR1 is delivered, and sigusr1_handler reads all the > bgworker start requests, and services just one of them. Then control > returns and proceeds to > > selres = select(nSockets, &rmask, NULL, NULL, &timeout); > > But now there's no interrupt pending. So the remaining start requests > do not get serviced until (a) some other postmaster interrupt arrives, > or (b) the one-minute timeout elapses. They could be waiting awhile. > > Bottom line is that any request for more than one bgworker at a time > faces a non-negligible risk of suffering serious latency. Interesting. It's hard to hit, for sure. > I'm coming back to the idea that at least in the back branches, the > thing to do is allow maybe_start_bgworker to start multiple workers. > > Is there any actual evidence for the claim that that might have > bad side effects? Well, I ran tests with a few dozen thousand sample workers and the neglect for other things (such as connection requests) was visible, but that's probably not a scenario many servers run often currently. I don't strongly object to the idea of removing the "return" in older branches, since it's evidently a problem. However, as bgworkers start to be used more, I think we should definitely have some protection. In a system with a large number of workers available for parallel queries, it seems possible for a high velocity server to get stuck in the loop for some time. (I haven't actually verified this, though. My experiments were with the early kind, static bgworkers.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attached is a lightly-tested draft patch that converts the postmaster to use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions about whether this is the direction to proceed, though. It adds at least a couple of kernel calls per postmaster signal delivery, and probably to every postmaster connection acceptance (ServerLoop iteration), to fix problems that are so corner-casey that we've never even known we had them till now. I'm looking longingly at pselect(2), ppoll(2), and epoll_pwait(2), which would solve the problem without need for a self-pipe. But the latter two are Linux-only, and while pselect does exist in recent POSIX editions, it's nonetheless got portability issues. Googling suggests that on a number of platforms, pselect is non-atomic, ie it's nothing but a wrapper for sigprocmask/select/sigprocmask, which would mean that the race condition I described in my previous message still exists. Despite that, a pretty attractive proposal is to do, essentially, #ifdef HAVE_PSELECT selres = pselect(nSockets, &rmask, NULL, NULL, &timeout, &UnBlockSig); #else PG_SETMASK(&UnBlockSig); selres = select(nSockets, &rmask, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); #endif This fixes the race on platforms where pselect exists and is correctly implemented, and we're no worse off than before where that's not true. The other component of the problem is the possibility that select() will restart if the signal is marked SA_RESTART. (Presumably that would apply to pselect too.) I am thinking that maybe the answer is "if it hurts, don't do it" --- that is, in the postmaster maybe we shouldn't use SA_RESTART, at least not for these signals. A different line of thought is to try to provide a bulletproof solution, but push the implementation problems down into latch.c --- that is, the goal would be to provide a pselect-like variant of WaitEventSetWait that is guaranteed to return if interrupted, as opposed to the current behavior where it's guaranteed not to. But that seems like quite a bit of work. Whether or not we decide to change over the postmaster.c code, I think it'd likely be a good idea to apply most or all of the attached changes in latch.c. Setting CLOEXEC on the relevant FDs is clearly a good thing, and the other changes will provide some safety if some preloaded extension decides to create a latch in the postmaster process. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6831342..658ba73 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** *** 79,88 **** #include <netdb.h> #include <limits.h> - #ifdef HAVE_SYS_SELECT_H - #include <sys/select.h> - #endif - #ifdef USE_BONJOUR #include <dns_sd.h> #endif --- 79,84 ---- *************** static bool LoadedSSL = false; *** 379,384 **** --- 375,386 ---- static DNSServiceRef bonjour_sdref = NULL; #endif + /* WaitEventSet that ServerLoop uses to wait for connections */ + static WaitEventSet *ServerWaitSet = NULL; + + /* Latch used to ensure that signals interrupt the wait in ServerLoop */ + static Latch PostmasterLatch; + /* * postmaster.c - function prototypes */ *************** static int ServerLoop(void); *** 409,415 **** static int BackendStartup(Port *port); static int ProcessStartupPacket(Port *port, bool SSLdone); static void processCancelRequest(Port *port, void *pkt); ! static int initMasks(fd_set *rmask); static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); --- 411,417 ---- static int BackendStartup(Port *port); static int ProcessStartupPacket(Port *port, bool SSLdone); static void processCancelRequest(Port *port, void *pkt); ! static void initServerWaitSet(void); static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); *************** checkDataDir(void) *** 1535,1541 **** } /* ! * Determine how long should we let ServerLoop sleep. * * In normal conditions we wait at most one minute, to ensure that the other * background tasks handled by ServerLoop get done even when no requests are --- 1537,1543 ---- } /* ! * Determine how long should we let ServerLoop sleep (in msec). * * In normal conditions we wait at most one minute, to ensure that the other * background tasks handled by ServerLoop get done even when no requests are *************** checkDataDir(void) *** 1543,1551 **** * we don't actually sleep so that they are quickly serviced. Other exception * cases are as shown in the code. */ ! static void ! DetermineSleepTime(struct timeval * timeout) { TimestampTz next_wakeup = 0; /* --- 1545,1554 ---- * we don't actually sleep so that they are quickly serviced. Other exception * cases are as shown in the code. */ ! static long ! DetermineSleepTime(void) { + long timeout; TimestampTz next_wakeup = 0; /* *************** DetermineSleepTime(struct timeval * time *** 1558,1582 **** if (AbortStartTime != 0) { /* time left to abort; clamp to 0 in case it already expired */ ! timeout->tv_sec = SIGKILL_CHILDREN_AFTER_SECS - (time(NULL) - AbortStartTime); ! timeout->tv_sec = Max(timeout->tv_sec, 0); ! timeout->tv_usec = 0; } else ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } ! return; } if (StartWorkerNeeded) ! { ! timeout->tv_sec = 0; ! timeout->tv_usec = 0; ! return; ! } if (HaveCrashedWorker) { --- 1561,1577 ---- if (AbortStartTime != 0) { /* time left to abort; clamp to 0 in case it already expired */ ! timeout = SIGKILL_CHILDREN_AFTER_SECS - (time(NULL) - AbortStartTime); ! timeout = Max(timeout, 0) * 1000; } else ! timeout = 60 * 1000; ! return timeout; } if (StartWorkerNeeded) ! return 0; if (HaveCrashedWorker) { *************** DetermineSleepTime(struct timeval * time *** 1617,1639 **** long secs; int microsecs; TimestampDifference(GetCurrentTimestamp(), next_wakeup, &secs, µsecs); ! timeout->tv_sec = secs; ! timeout->tv_usec = microsecs; /* Ensure we don't exceed one minute */ ! if (timeout->tv_sec > 60) ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } } else ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } } /* --- 1612,1630 ---- long secs; int microsecs; + /* XXX this is stupidly inefficient */ TimestampDifference(GetCurrentTimestamp(), next_wakeup, &secs, µsecs); ! timeout = secs * 1000 + microsecs / 1000; /* Ensure we don't exceed one minute */ ! if (timeout > 60 * 1000) ! timeout = 60 * 1000; } else ! timeout = 60 * 1000; ! ! return timeout; } /* *************** DetermineSleepTime(struct timeval * time *** 1644,1662 **** static int ServerLoop(void) { - fd_set readmask; - int nSockets; time_t last_lockfile_recheck_time, last_touch_time; last_lockfile_recheck_time = last_touch_time = time(NULL); ! nSockets = initMasks(&readmask); for (;;) { ! fd_set rmask; ! int selres; time_t now; /* --- 1635,1651 ---- static int ServerLoop(void) { time_t last_lockfile_recheck_time, last_touch_time; last_lockfile_recheck_time = last_touch_time = time(NULL); ! initServerWaitSet(); for (;;) { ! WaitEvent event; ! int nevents; time_t now; /* *************** ServerLoop(void) *** 1667,1742 **** * do nontrivial work. * * If we are in PM_WAIT_DEAD_END state, then we don't want to accept ! * any new connections, so we don't call select(), and just sleep. */ - memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set)); - if (pmState == PM_WAIT_DEAD_END) { PG_SETMASK(&UnBlockSig); pg_usleep(100000L); /* 100 msec seems reasonable */ ! selres = 0; PG_SETMASK(&BlockSig); } else { ! /* must set timeout each time; some OSes change it! */ ! struct timeval timeout; /* Needs to run with blocked signals! */ ! DetermineSleepTime(&timeout); PG_SETMASK(&UnBlockSig); ! selres = select(nSockets, &rmask, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); } - /* Now check the select() result */ - if (selres < 0) - { - if (errno != EINTR && errno != EWOULDBLOCK) - { - ereport(LOG, - (errcode_for_socket_access(), - errmsg("select() failed in postmaster: %m"))); - return STATUS_ERROR; - } - } - /* ! * New connection pending on any of our sockets? If so, fork a child ! * process to deal with it. */ ! if (selres > 0) { ! int i; ! ! for (i = 0; i < MAXLISTEN; i++) { ! if (ListenSocket[i] == PGINVALID_SOCKET) ! break; ! if (FD_ISSET(ListenSocket[i], &rmask)) ! { ! Port *port; ! port = ConnCreate(ListenSocket[i]); ! if (port) ! { ! BackendStartup(port); ! /* ! * We no longer need the open socket or port structure ! * in this process ! */ ! StreamClose(port->sock); ! ConnFree(port); ! } } } } /* If we have lost the log collector, try to start a new one */ --- 1656,1719 ---- * do nontrivial work. * * If we are in PM_WAIT_DEAD_END state, then we don't want to accept ! * any new connections, so we don't call WaitEventSetWait(), and just ! * sleep. XXX not ideal */ if (pmState == PM_WAIT_DEAD_END) { PG_SETMASK(&UnBlockSig); pg_usleep(100000L); /* 100 msec seems reasonable */ ! nevents = 0; PG_SETMASK(&BlockSig); } else { ! long timeout; /* Needs to run with blocked signals! */ ! timeout = DetermineSleepTime(); PG_SETMASK(&UnBlockSig); ! nevents = WaitEventSetWait(ServerWaitSet, timeout, &event, 1, 0); PG_SETMASK(&BlockSig); } /* ! * Deal with detected event, if any. */ ! if (nevents > 0) { ! /* ! * New connection pending on any of our sockets? If so, fork a ! * child process to deal with it. ! */ ! if (event.events & WL_SOCKET_READABLE) { ! Port *port; ! port = ConnCreate(event.fd); ! if (port) ! { ! BackendStartup(port); ! /* ! * We no longer need the open socket or port structure in ! * this process ! */ ! StreamClose(port->sock); ! ConnFree(port); } } + + /* + * Latch set? If so, clear it. + */ + else if (event.events & WL_LATCH_SET) + ResetLatch(&PostmasterLatch); } /* If we have lost the log collector, try to start a new one */ *************** ServerLoop(void) *** 1874,1903 **** } /* ! * Initialise the masks for select() for the ports we are listening on. ! * Return the number of sockets to listen on. */ ! static int ! initMasks(fd_set *rmask) { - int maxsock = -1; int i; ! FD_ZERO(rmask); for (i = 0; i < MAXLISTEN; i++) { ! int fd = ListenSocket[i]; ! if (fd == PGINVALID_SOCKET) break; ! FD_SET(fd, rmask); ! ! if (fd > maxsock) ! maxsock = fd; } ! return maxsock + 1; } --- 1851,1879 ---- } /* ! * Create a WaitEventSet for ServerLoop() to wait on. This includes the ! * sockets we are listening on, plus the PostmasterLatch. */ ! static void ! initServerWaitSet(void) { int i; ! ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1); for (i = 0; i < MAXLISTEN; i++) { ! pgsocket sock = ListenSocket[i]; ! if (sock == PGINVALID_SOCKET) break; ! AddWaitEventToSet(ServerWaitSet, WL_SOCKET_READABLE, sock, NULL, NULL); } ! InitializeLatchSupport(); ! InitLatch(&PostmasterLatch); ! AddWaitEventToSet(ServerWaitSet, WL_LATCH_SET, PGINVALID_SOCKET, ! &PostmasterLatch, NULL); } *************** ClosePostmasterPorts(bool am_syslogger) *** 2436,2441 **** --- 2412,2431 ---- postmaster_alive_fds[POSTMASTER_FD_OWN] = -1; #endif + /* + * Close the postmaster's WaitEventSet, if any, to be sure that child + * processes don't accidentally mess with underlying file descriptors. + * (Note: in an EXEC_BACKEND build, this won't do anything because we + * didn't inherit the static pointer, much less the data structure itself. + * But it's OK because WaitEventSets don't contain any resources that can + * be inherited across exec().) + */ + if (ServerWaitSet) + { + FreeWaitEventSet(ServerWaitSet); + ServerWaitSet = NULL; + } + /* Close the listen sockets */ for (i = 0; i < MAXLISTEN; i++) { *************** SIGHUP_handler(SIGNAL_ARGS) *** 2553,2558 **** --- 2543,2551 ---- #endif } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; *************** pmdie(SIGNAL_ARGS) *** 2724,2729 **** --- 2717,2725 ---- break; } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; *************** reaper(SIGNAL_ARGS) *** 3037,3042 **** --- 3033,3041 ---- */ PostmasterStateMachine(); + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + /* Done with signal handler */ PG_SETMASK(&UnBlockSig); *************** sigusr1_handler(SIGNAL_ARGS) *** 5078,5083 **** --- 5077,5085 ---- signal_child(StartupPID, SIGUSR2); } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4798370..92d2ff0 100644 *** a/src/backend/storage/ipc/latch.c --- b/src/backend/storage/ipc/latch.c *************** *** 62,67 **** --- 62,71 ---- #include "storage/pmsignal.h" #include "storage/shmem.h" + #ifndef FD_CLOEXEC + #define FD_CLOEXEC 1 + #endif + /* * Select the fd readiness primitive to use. Normally the "most modern" * primitive supported by the OS will be used, but for testing it can be *************** static volatile sig_atomic_t waiting = f *** 130,135 **** --- 134,142 ---- static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; + /* Process owning the self-pipe --- needed for checking purposes */ + static int selfpipe_owner_pid = 0; + /* Private function prototypes */ static void sendSelfPipeByte(void); static void drainSelfPipe(void); *************** InitializeLatchSupport(void) *** 158,164 **** #ifndef WIN32 int pipefd[2]; ! Assert(selfpipe_readfd == -1); /* * Set up the self-pipe that allows a signal handler to wake up the --- 165,211 ---- #ifndef WIN32 int pipefd[2]; ! if (IsUnderPostmaster) ! { ! /* ! * We might have inherited connections to a self-pipe created by the ! * postmaster. It's critical that child processes create their own ! * self-pipes, of course, and we really want them to close the ! * inherited FDs for safety's sake. On most platforms we can use ! * F_SETFD/FD_CLOEXEC to make sure that happens, but if we lack that ! * facility, close the FDs the hard way. ! */ ! if (selfpipe_owner_pid != 0) ! { ! /* Assert we go through here but once in a child process */ ! Assert(selfpipe_owner_pid != MyProcPid); ! #ifndef F_SETFD ! /* Release postmaster's pipe FDs, if we lack FD_CLOEXEC */ ! close(selfpipe_readfd); ! close(selfpipe_writefd); ! #endif ! /* Clean up, just for safety's sake; we'll set these below */ ! selfpipe_readfd = selfpipe_writefd = -1; ! selfpipe_owner_pid = 0; ! } ! else ! { ! /* ! * Postmaster didn't create a self-pipe ... or else we're in an ! * EXEC_BACKEND build, and don't know because we didn't inherit ! * values for the static variables. (If we lack FD_CLOEXEC we'll ! * fail to close the inherited FDs, but that seems acceptable ! * since EXEC_BACKEND builds aren't meant for production on Unix.) ! */ ! Assert(selfpipe_readfd == -1); ! } ! } ! else ! { ! /* In postmaster or standalone backend, assert we do this but once */ ! Assert(selfpipe_readfd == -1); ! Assert(selfpipe_owner_pid == 0); ! } /* * Set up the self-pipe that allows a signal handler to wake up the *************** InitializeLatchSupport(void) *** 166,188 **** * SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl() failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl() failed on write-end of self-pipe: %m"); selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a backend-local latch. */ void InitLatch(volatile Latch *latch) --- 213,244 ---- * SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. + * Also, if possible, make both FDs close-on-exec, since we surely do not + * want any child processes messing with them. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl(F_SETFL) failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl(F_SETFL) failed on write-end of self-pipe: %m"); ! #ifdef F_SETFD ! if (fcntl(pipefd[0], F_SETFD, FD_CLOEXEC) < 0) ! elog(FATAL, "fcntl(F_SETFD) failed on read-end of self-pipe: %m"); ! if (fcntl(pipefd[1], F_SETFD, FD_CLOEXEC) < 0) ! elog(FATAL, "fcntl(F_SETFD) failed on write-end of self-pipe: %m"); ! #endif selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; + selfpipe_owner_pid = MyProcPid; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a process-local latch. */ void InitLatch(volatile Latch *latch) *************** InitLatch(volatile Latch *latch) *** 193,199 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) --- 249,255 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) *************** InitLatch(volatile Latch *latch) *** 211,216 **** --- 267,276 ---- * containing the latch with ShmemInitStruct. (The Unix implementation * doesn't actually require that, but the Windows one does.) Because of * this restriction, we have no concurrency issues to worry about here. + * + * Note that other handles created in this module are never marked as + * inheritable. Thus we do not need to worry about cleaning up child + * process references to postmaster-private latches or WaitEventSets. */ void InitSharedLatch(volatile Latch *latch) *************** OwnLatch(volatile Latch *latch) *** 256,262 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #endif if (latch->owner_pid != 0) --- 316,322 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #endif if (latch->owner_pid != 0) *************** DisownLatch(volatile Latch *latch) *** 289,295 **** * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * backend-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note --- 349,355 ---- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * process-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note *************** CreateWaitEventSet(MemoryContext context *** 531,536 **** --- 591,600 ---- set->epoll_fd = epoll_create(nevents); if (set->epoll_fd < 0) elog(ERROR, "epoll_create failed: %m"); + /* Attempt to make the FD non-inheritable; if fail, no big deal */ + #ifdef F_SETFD + (void) fcntl(set->epoll_fd, F_SETFD, FD_CLOEXEC); + #endif #elif defined(WAIT_USE_WIN32) /* *************** CreateWaitEventSet(MemoryContext context *** 551,556 **** --- 615,626 ---- /* * Free a previously created WaitEventSet. + * + * Note: preferably, this shouldn't have to free any resources that could be + * inherited across an exec(). If it did, we'd likely leak those resources in + * many scenarios. For the epoll() case, we attempt to ensure that by setting + * FD_CLOEXEC when the FD is created. For the Windows case, we assume that + * the handles involved are non-inheritable. */ void FreeWaitEventSet(WaitEventSet *set) *************** FreeWaitEventSet(WaitEventSet *set) *** 597,603 **** * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a backend-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are --- 667,673 ---- * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a process-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 2017-04-21 12:50:04 -0400, Tom Lane wrote: > Attached is a lightly-tested draft patch that converts the postmaster to > use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions > about whether this is the direction to proceed, though. It adds at least > a couple of kernel calls per postmaster signal delivery, and probably to > every postmaster connection acceptance (ServerLoop iteration), to fix > problems that are so corner-casey that we've never even known we had them > till now. I'm not concerned much about the signal delivery paths, and I can't quite imagine that another syscall in the accept path is going to be measurable - worth ensuring though. I do agree that it's a bit of a big stick for the back-branches... > A different line of thought is to try to provide a bulletproof solution, > but push the implementation problems down into latch.c --- that is, the > goal would be to provide a pselect-like variant of WaitEventSetWait that > is guaranteed to return if interrupted, as opposed to the current behavior > where it's guaranteed not to. But that seems like quite a bit of work. Seems like a sane idea to me. The use of latches has already grown due to parallelism and it'll likely grow further - some of them seem likely to also have to deal with such concerns. I'd much rather centralize things down to a common place. On the other hand most types of our processes do SetLatch() in just nearly all the relevant signal handlers anyway, so they're pretty close to behaviour already. > Whether or not we decide to change over the postmaster.c code, I think > it'd likely be a good idea to apply most or all of the attached changes > in latch.c. Setting CLOEXEC on the relevant FDs is clearly a good thing, > and the other changes will provide some safety if some preloaded extension > decides to create a latch in the postmaster process. On the principle, I agree. Reading through the changes now. > @@ -1667,76 +1656,64 @@ ServerLoop(void) > * do nontrivial work. > * > * If we are in PM_WAIT_DEAD_END state, then we don't want to accept > - * any new connections, so we don't call select(), and just sleep. > + * any new connections, so we don't call WaitEventSetWait(), and just > + * sleep. XXX not ideal > */ Couldn't we just deactive the sockets in the set instead? > /* > - * Initialise the masks for select() for the ports we are listening on. > - * Return the number of sockets to listen on. > + * Create a WaitEventSet for ServerLoop() to wait on. This includes the > + * sockets we are listening on, plus the PostmasterLatch. > */ > -static int > -initMasks(fd_set *rmask) > +static void > +initServerWaitSet(void) > { > - int maxsock = -1; > int i; > > - FD_ZERO(rmask); > + ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1); Why are we using MAXLISTEN, rather than the actual number of things to listen to? The only benefit of this seems to be that we could theoretically allow dynamic reconfiguration of the sockets a bit more easily in the future, but that could just as well be done by recreating the set. Random note: Do we actually have any code that errors out if too many sockets are being listened to? > @@ -2553,6 +2543,9 @@ SIGHUP_handler(SIGNAL_ARGS) > #endif > } > > + /* Force ServerLoop to iterate */ > + SetLatch(&PostmasterLatch); > + > PG_SETMASK(&UnBlockSig); > > errno = save_errno; > @@ -2724,6 +2717,9 @@ pmdie(SIGNAL_ARGS) > break; > } > > + /* Force ServerLoop to iterate */ > + SetLatch(&PostmasterLatch); > + > PG_SETMASK(&UnBlockSig); > > errno = save_errno; > @@ -3037,6 +3033,9 @@ reaper(SIGNAL_ARGS) > */ > PostmasterStateMachine(); > > + /* Force ServerLoop to iterate */ > + SetLatch(&PostmasterLatch); > + > /* Done with signal handler */ > PG_SETMASK(&UnBlockSig); > > @@ -5078,6 +5077,9 @@ sigusr1_handler(SIGNAL_ARGS) > signal_child(StartupPID, SIGUSR2); > } > > + /* Force ServerLoop to iterate */ > + SetLatch(&PostmasterLatch); > + > PG_SETMASK(&UnBlockSig); > > errno = save_errno; I kind of would like, in master, take a chance of replace all the work done in signal handlers, by just a SetLatch(), and do it outside of signal handlers instead. Forking from signal handlers is just plain weird. > diff --git a/src/backend/storage/ipc/latchindex 4798370..92d2ff0 100644 > --- a/src/backend/storage/ipc/latch.c > +++ b/src/backend/storage/ipc/latch.c > @@ -62,6 +62,10 @@ > #include "storage/pmsignal.h" > #include "storage/shmem.h" > > +#ifndef FD_CLOEXEC > +#define FD_CLOEXEC 1 > +#endif Hm? Are we sure this is portable? Is there really cases that have F_SETFD, but not CLOEXEC? Greetings, Andres Freund
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> I'm coming back to the idea that at least in the back branches, the >> thing to do is allow maybe_start_bgworker to start multiple workers. >> >> Is there any actual evidence for the claim that that might have >> bad side effects? > Well, I ran tests with a few dozen thousand sample workers and the > neglect for other things (such as connection requests) was visible, but > that's probably not a scenario many servers run often currently. Indeed. I'm pretty skeptical that that's an interesting case, and if it is, the current coding is broken anyway, because with that many workers you are going to start noticing that running maybe_start_bgworker over again for each worker is an O(N^2) proposition. Admittedly, iterating the loop in maybe_start_bgworker is really cheap compared to a fork(), but eventually the big-O problem is going to eat your lunch. > I don't strongly object to the idea of removing the "return" in older > branches, since it's evidently a problem. However, as bgworkers start > to be used more, I think we should definitely have some protection. In > a system with a large number of workers available for parallel queries, > it seems possible for a high velocity server to get stuck in the loop > for some time. (I haven't actually verified this, though. My > experiments were with the early kind, static bgworkers.) It might be sensible to limit the number of workers launched per call, but I think the limit should be quite a bit higher than 1 ... something like 100 or 1000 might be appropriate. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2017-04-21 12:50:04 -0400, Tom Lane wrote: >> Attached is a lightly-tested draft patch that converts the postmaster to >> use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions >> about whether this is the direction to proceed, though. It adds at least >> a couple of kernel calls per postmaster signal delivery, and probably to >> every postmaster connection acceptance (ServerLoop iteration), to fix >> problems that are so corner-casey that we've never even known we had them >> till now. > I'm not concerned much about the signal delivery paths, and I can't > quite imagine that another syscall in the accept path is going to be > measurable - worth ensuring though. > ... > On the other hand most types of our processes do SetLatch() in just > nearly all the relevant signal handlers anyway, so they're pretty close > to behaviour already. True. Maybe I'm being too worried. >> * If we are in PM_WAIT_DEAD_END state, then we don't want to accept >> - * any new connections, so we don't call select(), and just sleep. >> + * any new connections, so we don't call WaitEventSetWait(), and just >> + * sleep. XXX not ideal >> */ > Couldn't we just deactive the sockets in the set instead? Yeah, I think it'd be better to do something like that. The pg_usleep call has the same issue of possibly not responding to interrupts. The risks are a lot less, since it's a much shorter wait, but I would rather eliminate the separate code path in favor of doing it honestly. Didn't seem like something to fuss over in the first draft though. >> + ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1); > Why are we using MAXLISTEN, rather than the actual number of things to > listen to? It'd take more code (ie, an additional scan of the array) to predetermine that. I figured the space-per-item in the WaitEventSet wasn't enough to worry about ... do you think differently? > Random note: Do we actually have any code that errors out if too many > sockets are being listened to? Yes, see StreamServerPort, about line 400. > I kind of would like, in master, take a chance of replace all the work > done in signal handlers, by just a SetLatch(), and do it outside of > signal handlers instead. Forking from signal handlers is just plain > weird. Yeah, maybe it's time. But in v11, and not for back-patch. >> +#ifndef FD_CLOEXEC >> +#define FD_CLOEXEC 1 >> +#endif > Hm? Are we sure this is portable? Is there really cases that have > F_SETFD, but not CLOEXEC? Copied-and-pasted from our only existing use of FD_CLOEXEC, in libpq. Might well be obsolete but I see no particular reason not to do it. regards, tom lane
I wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-04-21 12:50:04 -0400, Tom Lane wrote: >>> +#ifndef FD_CLOEXEC >>> +#define FD_CLOEXEC 1 >>> +#endif >> Hm? Are we sure this is portable? Is there really cases that have >> F_SETFD, but not CLOEXEC? > Copied-and-pasted from our only existing use of FD_CLOEXEC, in libpq. Looking closer, that code dates to Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL8_0_BR [7627b91cd] 2004-10-21 20:23:19 +0000 Set the close-on-exec flag for libpq's socket to the backend, to avoid any possible problems from child programs executedby the client app. Per suggestion from Elliot Lee of Red Hat. and while the public discussion about it https://www.postgresql.org/message-id/flat/18172.1098382248%40sss.pgh.pa.us doesn't really say, I suspect the specific coding was Elliot's suggestion as well. It probably had some value at one time ... but I see that SUSv2 mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own coding rules it ought to be okay to assume they're there. I'm tempted to rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see if anything in the buildfarm complains. regards, tom lane
On 2017-04-21 14:08:21 -0400, Tom Lane wrote: > but I see that SUSv2 > mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own > coding rules it ought to be okay to assume they're there. I'm tempted to > rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see > if anything in the buildfarm complains. +1 - Andres
Tom, On 2017-04-21 13:49:27 -0400, Tom Lane wrote: > >> * If we are in PM_WAIT_DEAD_END state, then we don't want to accept > >> - * any new connections, so we don't call select(), and just sleep. > >> + * any new connections, so we don't call WaitEventSetWait(), and just > >> + * sleep. XXX not ideal > >> */ > > > Couldn't we just deactive the sockets in the set instead? > > Yeah, I think it'd be better to do something like that. The pg_usleep > call has the same issue of possibly not responding to interrupts. The > risks are a lot less, since it's a much shorter wait, but I would rather > eliminate the separate code path in favor of doing it honestly. Didn't > seem like something to fuss over in the first draft though. Ok, cool. > >> + ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1); > > > Why are we using MAXLISTEN, rather than the actual number of things to > > listen to? > > It'd take more code (ie, an additional scan of the array) to predetermine > that. I figured the space-per-item in the WaitEventSet wasn't enough to > worry about ... do you think differently? I'm not sure. We do create an epoll handler with enough space, and that has some overhead. Don't know whether that's worthwhile to care about. > > I kind of would like, in master, take a chance of replace all the work > > done in signal handlers, by just a SetLatch(), and do it outside of > > signal handlers instead. Forking from signal handlers is just plain > > weird. > > Yeah, maybe it's time. But in v11, and not for back-patch. Agreed. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-21 14:08:21 -0400, Tom Lane wrote: >> but I see that SUSv2 >> mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own >> coding rules it ought to be okay to assume they're there. I'm tempted to >> rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see >> if anything in the buildfarm complains. > +1 Done, we'll soon see what happens. In the same area, I noticed that POSIX does not say that the success result for fcntl(F_SETFD) and related cases is 0. It says that the failure result is -1 and the success result is some other value. We seem to have this right in most places, but e.g. port/noblock.c gets it wrong. The lack of field complaints implies that just about everybody actually does return 0 on success, but I still think it would be a good idea to run around and make all the calls test specifically for -1. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Andres Freund <andres@anarazel.de> writes: >> On 2017-04-21 14:08:21 -0400, Tom Lane wrote: >>> but I see that SUSv2 >>> mandates that fcntl.h provide both F_SETFD and FD_CLOEXEC, so by our own >>> coding rules it ought to be okay to assume they're there. I'm tempted to >>> rip out the quoted bit, as well as the #ifdef F_SETFD, from libpq and see >>> if anything in the buildfarm complains. >> +1 > Done, we'll soon see what happens. Should have seen this coming, I guess: some of the Windows critters are falling over, apparently because they lack fcntl() altogether. So the #ifdef F_SETFD was really acting as a proxy for "#ifdef HAVE_FCNTL". There's no HAVE_FCNTL test in configure ATM, and I'm thinking it would be pretty silly to add one, since surely it would succeed on anything Unix-y enough to run the configure script. I'm guessing the best thing to do is put back #ifdef F_SETFD; alternatively we might spell it like "#ifndef WIN32", but I'm unsure if that'd do what we want on Cygwin or MinGW. In non-Windows code paths in latch.c, we probably wouldn't need to bother with #ifdef F_SETFD. Hopefully we can leave in the removal of "#define FD_CLOEXEC". Will wait a bit longer for more results. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> It also appears to me that do_start_bgworker's treatment of fork >> failure is completely brain dead. Did anyone really think about >> that case? > Hmm, I probably modelled it on autovacuum without giving that case much > additional consideration. Attached is a proposed patch that should make fork failure behave sanely, ie it works much the same as a worker crash immediately after launch. I also refactored things a bit to make do_start_bgworker fully responsible for updating the RegisteredBgWorker's state, rather than doing just some of it as before. I tested this by hot-wiring the fork_process call to fail some of the time, which showed that the postmaster now seems to recover OK, but parallel.c's logic is completely innocent of the idea that worker-startup failure is possible. The leader backend just freezes, and nothing short of kill -9 on that backend will get you out of it. Fixing that seems like material for a separate patch though. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index c382345..8461c24 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** static void TerminateChildren(int signal *** 420,425 **** --- 420,426 ---- #define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL) static int CountChildren(int target); + static bool assign_backendlist_entry(RegisteredBgWorker *rw); static void maybe_start_bgworker(void); static bool CreateOptsFile(int argc, char *argv[], char *fullprogname); static pid_t StartChildProcess(AuxProcType type); *************** bgworker_forkexec(int shmem_slot) *** 5531,5543 **** * Start a new bgworker. * Starting time conditions must have been checked already. * * This code is heavily based on autovacuum.c, q.v. */ ! static void do_start_bgworker(RegisteredBgWorker *rw) { pid_t worker_pid; ereport(DEBUG1, (errmsg("starting background worker process \"%s\"", rw->rw_worker.bgw_name))); --- 5532,5564 ---- * Start a new bgworker. * Starting time conditions must have been checked already. * + * Returns true on success, false on failure. + * In either case, update the RegisteredBgWorker's state appropriately. + * * This code is heavily based on autovacuum.c, q.v. */ ! static bool do_start_bgworker(RegisteredBgWorker *rw) { pid_t worker_pid; + Assert(rw->rw_pid == 0); + + /* + * Allocate and assign the Backend element. Note we must do this before + * forking, so that we can handle out of memory properly. + * + * Treat failure as though the worker had crashed. That way, the + * postmaster will wait a bit before attempting to start it again; if it + * tried again right away, most likely it'd find itself repeating the + * out-of-memory or fork failure condition. + */ + if (!assign_backendlist_entry(rw)) + { + rw->rw_crashed_at = GetCurrentTimestamp(); + return false; + } + ereport(DEBUG1, (errmsg("starting background worker process \"%s\"", rw->rw_worker.bgw_name))); *************** do_start_bgworker(RegisteredBgWorker *rw *** 5549,5557 **** #endif { case -1: ereport(LOG, (errmsg("could not fork worker process: %m"))); ! return; #ifndef EXEC_BACKEND case 0: --- 5570,5586 ---- #endif { case -1: + /* in postmaster, fork failed ... */ ereport(LOG, (errmsg("could not fork worker process: %m"))); ! /* undo what assign_backendlist_entry did */ ! ReleasePostmasterChildSlot(rw->rw_child_slot); ! rw->rw_child_slot = 0; ! free(rw->rw_backend); ! rw->rw_backend = NULL; ! /* mark entry as crashed, so we'll try again later */ ! rw->rw_crashed_at = GetCurrentTimestamp(); ! break; #ifndef EXEC_BACKEND case 0: *************** do_start_bgworker(RegisteredBgWorker *rw *** 5575,5588 **** PostmasterContext = NULL; StartBackgroundWorker(); break; #endif default: rw->rw_pid = worker_pid; rw->rw_backend->pid = rw->rw_pid; ReportBackgroundWorkerPID(rw); ! break; } } /* --- 5604,5627 ---- PostmasterContext = NULL; StartBackgroundWorker(); + + exit(1); /* should not get here */ break; #endif default: + /* in postmaster, fork successful ... */ rw->rw_pid = worker_pid; rw->rw_backend->pid = rw->rw_pid; ReportBackgroundWorkerPID(rw); ! /* add new worker to lists of backends */ ! dlist_push_head(&BackendList, &rw->rw_backend->elem); ! #ifdef EXEC_BACKEND ! ShmemBackendArrayAdd(rw->rw_backend); ! #endif ! return true; } + + return false; } /* *************** bgworker_should_start_now(BgWorkerStartT *** 5629,5634 **** --- 5668,5675 ---- * Allocate the Backend struct for a connected background worker, but don't * add it to the list of backends just yet. * + * On failure, return false without changing any worker state. + * * Some info from the Backend is copied into the passed rw. */ static bool *************** assign_backendlist_entry(RegisteredBgWor *** 5647,5654 **** ereport(LOG, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("could not generate random cancel key"))); - - rw->rw_crashed_at = GetCurrentTimestamp(); return false; } --- 5688,5693 ---- *************** assign_backendlist_entry(RegisteredBgWor *** 5658,5671 **** ereport(LOG, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); - - /* - * The worker didn't really crash, but setting this nonzero makes - * postmaster wait a bit before attempting to start it again; if it - * tried again right away, most likely it'd find itself under the same - * memory pressure. - */ - rw->rw_crashed_at = GetCurrentTimestamp(); return false; } --- 5697,5702 ---- *************** assign_backendlist_entry(RegisteredBgWor *** 5687,5692 **** --- 5718,5728 ---- * As a side effect, the bgworker control variables are set or reset whenever * there are more workers to start after this one, and whenever the overall * system state requires it. + * + * The reason we start at most one worker per call is to avoid consuming the + * postmaster's attention for too long when many such requests are pending. + * As long as StartWorkerNeeded is true, ServerLoop will not block and will + * call this function again after dealing with any other issues. */ static void maybe_start_bgworker(void) *************** maybe_start_bgworker(void) *** 5694,5706 **** slist_mutable_iter iter; TimestampTz now = 0; if (FatalError) { StartWorkerNeeded = false; HaveCrashedWorker = false; ! return; /* not yet */ } HaveCrashedWorker = false; slist_foreach_modify(iter, &BackgroundWorkerList) --- 5730,5748 ---- slist_mutable_iter iter; TimestampTz now = 0; + /* + * During crash recovery, we have no need to be called until the state + * transition out of recovery. + */ if (FatalError) { StartWorkerNeeded = false; HaveCrashedWorker = false; ! return; } + /* Don't need to be called again unless we find a reason for it below */ + StartWorkerNeeded = false; HaveCrashedWorker = false; slist_foreach_modify(iter, &BackgroundWorkerList) *************** maybe_start_bgworker(void) *** 5709,5719 **** rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur); ! /* already running? */ if (rw->rw_pid != 0) continue; ! /* marked for death? */ if (rw->rw_terminate) { ForgetBackgroundWorker(&iter); --- 5751,5761 ---- rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur); ! /* ignore if already running */ if (rw->rw_pid != 0) continue; ! /* if marked for death, clean up and remove from list */ if (rw->rw_terminate) { ForgetBackgroundWorker(&iter); *************** maybe_start_bgworker(void) *** 5735,5746 **** --- 5777,5790 ---- continue; } + /* read system time only when needed */ if (now == 0) now = GetCurrentTimestamp(); if (!TimestampDifferenceExceeds(rw->rw_crashed_at, now, rw->rw_worker.bgw_restart_time * 1000)) { + /* Set flag to remember that we have workers to start later */ HaveCrashedWorker = true; continue; } *************** maybe_start_bgworker(void) *** 5748,5782 **** if (bgworker_should_start_now(rw->rw_worker.bgw_start_time)) { ! /* reset crash time before calling assign_backendlist_entry */ rw->rw_crashed_at = 0; /* ! * Allocate and assign the Backend element. Note we must do this ! * before forking, so that we can handle out of memory properly. */ ! if (!assign_backendlist_entry(rw)) return; ! ! do_start_bgworker(rw); /* sets rw->rw_pid */ ! ! dlist_push_head(&BackendList, &rw->rw_backend->elem); ! #ifdef EXEC_BACKEND ! ShmemBackendArrayAdd(rw->rw_backend); ! #endif /* ! * Have ServerLoop call us again. Note that there might not ! * actually *be* another runnable worker, but we don't care all ! * that much; we will find out the next time we run. */ StartWorkerNeeded = true; return; } } - - /* no runnable worker found */ - StartWorkerNeeded = false; } /* --- 5792,5826 ---- if (bgworker_should_start_now(rw->rw_worker.bgw_start_time)) { ! /* reset crash time before trying to start worker */ rw->rw_crashed_at = 0; /* ! * Try to start the worker. ! * ! * On failure, give up processing workers for now, but set ! * StartWorkerNeeded so we'll come back here on the next iteration ! * of ServerLoop to try again. (We don't want to wait, because ! * there might be additional ready-to-run workers.) We could set ! * HaveCrashedWorker as well, since this worker is now marked ! * crashed, but there's no need because the next run of this ! * function will do that. */ ! if (!do_start_bgworker(rw)) ! { ! StartWorkerNeeded = true; return; ! } /* ! * Quit, but have ServerLoop call us again to look for additional ! * ready-to-run workers. There might not be any, but we'll find ! * out the next time we run. */ StartWorkerNeeded = true; return; } } } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I wrote: > Attached is a lightly-tested draft patch that converts the postmaster to > use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions > about whether this is the direction to proceed, though. Attached are a couple of patches that represent a plausible Plan B. The first one changes the postmaster to run its signal handlers without specifying SA_RESTART. I've confirmed that that seems to fix the select_parallel-test-takes-a-long-time problem on gaur/pademelon. The second one uses pselect, if available, to replace the unblock-signals/ select()/block-signals dance in ServerLoop. On platforms where pselect exists and works properly, that should fix the race condition I described previously. On platforms where it doesn't, we're no worse off than before. As mentioned in the comments for the second patch, even if we don't have working pselect(), the only problem is that ServerLoop's response to an interrupt might be delayed by as much as the up-to-1-minute timeout. The only existing case where that's really bad is launching multiple bgworkers. I would therefore advocate also changing maybe_start_bgworker to start up to N bgworkers per call, where N is large enough to pretty much always satisfy simultaneously-arriving requests. I'd pick 100 or so, but am willing to negotiate. I think that these patches represent something we could back-patch without a lot of trepidation, unlike the WaitEventSet-based approach. Therefore, my proposal is to apply and backpatch these changes, and call it good for v10. For v11, we could work on changing the postmaster to not do work in signal handlers, as discussed upthread. That would supersede these two patches completely, though I'd still advocate for keeping the change in maybe_start_bgworker. Note: for testing purposes, these patches are quite independent; just ignore the hunk in the second patch that changes a comment added by the first one. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index c382345..52b5e2c 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** PostmasterMain(int argc, char *argv[]) *** 610,615 **** --- 610,624 ---- /* * Set up signal handlers for the postmaster process. * + * In the postmaster, we want to install non-ignored handlers *without* + * SA_RESTART. This is because they'll be blocked at all times except + * when ServerLoop is waiting for something to happen, and during that + * window, we want signals to exit the select(2) wait so that ServerLoop + * can respond if anything interesting happened. On some platforms, + * signals marked SA_RESTART would not cause the select() wait to end. + * Child processes will generally want SA_RESTART, but we expect them to + * set up their own handlers before unblocking signals. + * * CAUTION: when changing this list, check for side-effects on the signal * handling setup of child processes. See tcop/postgres.c, * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c, *************** PostmasterMain(int argc, char *argv[]) *** 620,635 **** pqinitmask(); PG_SETMASK(&BlockSig); ! pqsignal(SIGHUP, SIGHUP_handler); /* reread config file and have ! * children do same */ ! pqsignal(SIGINT, pmdie); /* send SIGTERM and shut down */ ! pqsignal(SIGQUIT, pmdie); /* send SIGQUIT and die */ ! pqsignal(SIGTERM, pmdie); /* wait for children and shut down */ pqsignal(SIGALRM, SIG_IGN); /* ignored */ pqsignal(SIGPIPE, SIG_IGN); /* ignored */ ! pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */ ! pqsignal(SIGUSR2, dummy_handler); /* unused, reserve for children */ ! pqsignal(SIGCHLD, reaper); /* handle child termination */ pqsignal(SIGTTIN, SIG_IGN); /* ignored */ pqsignal(SIGTTOU, SIG_IGN); /* ignored */ /* ignore SIGXFSZ, so that ulimit violations work like disk full */ --- 629,648 ---- pqinitmask(); PG_SETMASK(&BlockSig); ! pqsignal_no_restart(SIGHUP, SIGHUP_handler); /* reread config file ! * and have children do ! * same */ ! pqsignal_no_restart(SIGINT, pmdie); /* send SIGTERM and shut down */ ! pqsignal_no_restart(SIGQUIT, pmdie); /* send SIGQUIT and die */ ! pqsignal_no_restart(SIGTERM, pmdie); /* wait for children and shut ! * down */ pqsignal(SIGALRM, SIG_IGN); /* ignored */ pqsignal(SIGPIPE, SIG_IGN); /* ignored */ ! pqsignal_no_restart(SIGUSR1, sigusr1_handler); /* message from child ! * process */ ! pqsignal_no_restart(SIGUSR2, dummy_handler); /* unused, reserve for ! * children */ ! pqsignal_no_restart(SIGCHLD, reaper); /* handle child termination */ pqsignal(SIGTTIN, SIG_IGN); /* ignored */ pqsignal(SIGTTOU, SIG_IGN); /* ignored */ /* ignore SIGXFSZ, so that ulimit violations work like disk full */ diff --git a/src/include/port.h b/src/include/port.h index c6937e5..52910ed 100644 *** a/src/include/port.h --- b/src/include/port.h *************** extern int pg_mkdir_p(char *path, int om *** 469,474 **** --- 469,479 ---- /* port/pqsignal.c */ typedef void (*pqsigfunc) (int signo); extern pqsigfunc pqsignal(int signo, pqsigfunc func); + #ifndef WIN32 + extern pqsigfunc pqsignal_no_restart(int signo, pqsigfunc func); + #else + #define pqsignal_no_restart(signo, func) pqsignal(signo, func) + #endif /* port/quotes.c */ extern char *escape_single_quotes_ascii(const char *src); diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c index 5d366da..e7e4451 100644 *** a/src/port/pqsignal.c --- b/src/port/pqsignal.c *************** *** 32,38 **** #if !defined(WIN32) || defined(FRONTEND) /* ! * Set up a signal handler for signal "signo" * * Returns the previous handler. */ --- 32,38 ---- #if !defined(WIN32) || defined(FRONTEND) /* ! * Set up a signal handler, with SA_RESTART, for signal "signo" * * Returns the previous handler. */ *************** pqsignal(int signo, pqsigfunc func) *** 58,61 **** --- 58,90 ---- #endif } + /* + * Set up a signal handler, without SA_RESTART, for signal "signo" + * + * Returns the previous handler. + * + * On Windows, this would be identical to pqsignal(), so don't bother. + */ + #ifndef WIN32 + + pqsigfunc + pqsignal_no_restart(int signo, pqsigfunc func) + { + struct sigaction act, + oact; + + act.sa_handler = func; + sigemptyset(&act.sa_mask); + act.sa_flags = 0; + #ifdef SA_NOCLDSTOP + if (signo == SIGCHLD) + act.sa_flags |= SA_NOCLDSTOP; + #endif + if (sigaction(signo, &act, &oact) < 0) + return SIG_ERR; + return oact.sa_handler; + } + + #endif /* !WIN32 */ + #endif /* !defined(WIN32) || defined(FRONTEND) */ diff --git a/configure b/configure index 99d05bf..42ff097 100755 *** a/configure --- b/configure *************** fi *** 12892,12898 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 12892,12898 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pselect pstatpthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombswcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index c36c503..871edd8 100644 *** a/configure.in --- b/configure.in *************** PGAC_FUNC_WCSTOMBS_L *** 1429,1435 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1429,1435 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pselect pstatpthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombswcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 52b5e2c..382e6a8 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** PostmasterMain(int argc, char *argv[]) *** 613,621 **** * In the postmaster, we want to install non-ignored handlers *without* * SA_RESTART. This is because they'll be blocked at all times except * when ServerLoop is waiting for something to happen, and during that ! * window, we want signals to exit the select(2) wait so that ServerLoop * can respond if anything interesting happened. On some platforms, ! * signals marked SA_RESTART would not cause the select() wait to end. * Child processes will generally want SA_RESTART, but we expect them to * set up their own handlers before unblocking signals. * --- 613,621 ---- * In the postmaster, we want to install non-ignored handlers *without* * SA_RESTART. This is because they'll be blocked at all times except * when ServerLoop is waiting for something to happen, and during that ! * window, we want signals to exit the pselect(2) wait so that ServerLoop * can respond if anything interesting happened. On some platforms, ! * signals marked SA_RESTART would not cause the pselect() wait to end. * Child processes will generally want SA_RESTART, but we expect them to * set up their own handlers before unblocking signals. * *************** ServerLoop(void) *** 1669,1674 **** --- 1669,1676 ---- for (;;) { fd_set rmask; + fd_set *rmask_p; + struct timeval timeout; int selres; time_t now; *************** ServerLoop(void) *** 1678,1714 **** * We block all signals except while sleeping. That makes it safe for * signal handlers, which again block all signals while executing, to * do nontrivial work. - * - * If we are in PM_WAIT_DEAD_END state, then we don't want to accept - * any new connections, so we don't call select(), and just sleep. */ - memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set)); - if (pmState == PM_WAIT_DEAD_END) { ! PG_SETMASK(&UnBlockSig); ! ! pg_usleep(100000L); /* 100 msec seems reasonable */ ! selres = 0; ! ! PG_SETMASK(&BlockSig); } else { ! /* must set timeout each time; some OSes change it! */ ! struct timeval timeout; /* Needs to run with blocked signals! */ DetermineSleepTime(&timeout); PG_SETMASK(&UnBlockSig); ! selres = select(nSockets, &rmask, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); } ! /* Now check the select() result */ if (selres < 0) { if (errno != EINTR && errno != EWOULDBLOCK) --- 1680,1743 ---- * We block all signals except while sleeping. That makes it safe for * signal handlers, which again block all signals while executing, to * do nontrivial work. */ if (pmState == PM_WAIT_DEAD_END) { ! /* ! * If we are in PM_WAIT_DEAD_END state, then we don't want to ! * accept any new connections, so pass a null rmask. ! */ ! rmask_p = NULL; ! timeout.tv_sec = 0; ! timeout.tv_usec = 100000; /* 100 msec seems reasonable */ } else { ! /* Normal case: check sockets, and compute a suitable timeout */ ! memcpy(&rmask, &readmask, sizeof(fd_set)); ! rmask_p = &rmask; /* Needs to run with blocked signals! */ DetermineSleepTime(&timeout); + } + /* + * We prefer to wait with pselect(2) if available, as using that, + * together with *not* using SA_RESTART for signals, guarantees that + * we will get kicked off the wait if a signal occurs. + * + * If we lack pselect(2), fake it with select(2). This has a race + * condition: a signal that was already pending will be delivered + * before we reach the select(), and therefore the select() will wait, + * even though we might wish to do something in response. Therefore, + * beware of putting any time-critical signal response logic into + * ServerLoop rather than into the signal handler itself. It will run + * eventually, but maybe not till after a timeout delay. + * + * It is rumored that some implementations of pselect() are not + * atomic, making the first alternative here functionally equivalent + * to the second. Not much we can do about that though. + */ + { + #ifdef HAVE_PSELECT + /* pselect uses a randomly different timeout API, sigh */ + struct timespec ptimeout; + + ptimeout.tv_sec = timeout.tv_sec; + ptimeout.tv_nsec = timeout.tv_usec * 1000; + + selres = pselect(nSockets, rmask_p, NULL, NULL, + &ptimeout, &UnBlockSig); + #else PG_SETMASK(&UnBlockSig); ! selres = select(nSockets, rmask_p, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); + #endif } ! /* Now check the select()/pselect() result */ if (selres < 0) { if (errno != EINTR && errno != EWOULDBLOCK) diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 03e9803..9bbf1b1 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *************** *** 400,405 **** --- 400,408 ---- /* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */ #undef HAVE_PPC_LWARX_MUTEX_HINT + /* Define to 1 if you have the `pselect' function. */ + #undef HAVE_PSELECT + /* Define to 1 if you have the `pstat' function. */ #undef HAVE_PSTAT diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index f9588b0..5d36768 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *************** *** 267,272 **** --- 267,275 ---- /* Define to 1 if you have the <poll.h> header file. */ /* #undef HAVE_POLL_H */ + /* Define to 1 if you have the `pselect' function. */ + /* #undef HAVE_PSELECT */ + /* Define to 1 if you have the `pstat' function. */ /* #undef HAVE_PSTAT */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-04-21 23:50:41 -0400, Tom Lane wrote: > Attached are a couple of patches that represent a plausible Plan B. > The first one changes the postmaster to run its signal handlers without > specifying SA_RESTART. I've confirmed that that seems to fix the > select_parallel-test-takes-a-long-time problem on gaur/pademelon. > The second one uses pselect, if available, to replace the unblock-signals/ > select()/block-signals dance in ServerLoop. On platforms where pselect > exists and works properly, that should fix the race condition I described > previously. On platforms where it doesn't, we're no worse off than > before. We probably should note somewhere prominently that pselect isn't actually race-free on a number of platforms. > I think that these patches represent something we could back-patch > without a lot of trepidation, unlike the WaitEventSet-based approach. > Therefore, my proposal is to apply and backpatch these changes, and > call it good for v10. For v11, we could work on changing the postmaster > to not do work in signal handlers, as discussed upthread. That would > supersede these two patches completely, though I'd still advocate for > keeping the change in maybe_start_bgworker. Not yet having looked at your patches, that sounds like a reasonable plan. I'd still like to get something like your CLOEXEC patch applied independently however. < patches > Looks reasonable on a quick skim. Greetings, Andres Freund
On 2017-04-21 23:50:41 -0400, Tom Lane wrote: > I wrote: > > Attached is a lightly-tested draft patch that converts the postmaster to > > use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions > > about whether this is the direction to proceed, though. > > Attached are a couple of patches that represent a plausible Plan B. > The first one changes the postmaster to run its signal handlers without > specifying SA_RESTART. I've confirmed that that seems to fix the > select_parallel-test-takes-a-long-time problem on gaur/pademelon. > The second one uses pselect, if available, to replace the unblock-signals/ > select()/block-signals dance in ServerLoop. On platforms where pselect > exists and works properly, that should fix the race condition I described > previously. On platforms where it doesn't, we're no worse off than > before. > > As mentioned in the comments for the second patch, even if we don't > have working pselect(), the only problem is that ServerLoop's response > to an interrupt might be delayed by as much as the up-to-1-minute timeout. > The only existing case where that's really bad is launching multiple > bgworkers. I would therefore advocate also changing maybe_start_bgworker > to start up to N bgworkers per call, where N is large enough to pretty > much always satisfy simultaneously-arriving requests. I'd pick 100 or > so, but am willing to negotiate. > > I think that these patches represent something we could back-patch > without a lot of trepidation, unlike the WaitEventSet-based approach. > Therefore, my proposal is to apply and backpatch these changes, and > call it good for v10. For v11, we could work on changing the postmaster > to not do work in signal handlers, as discussed upthread. That would > supersede these two patches completely, though I'd still advocate for > keeping the change in maybe_start_bgworker. > > Note: for testing purposes, these patches are quite independent; just > ignore the hunk in the second patch that changes a comment added by > the first one. Unclear if related, but https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42 has a suspicious timing of failing in a weird way. - Andres
On 2017-04-24 13:16:44 -0700, Andres Freund wrote: > Unclear if related, but > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42 > has a suspicious timing of failing in a weird way. Given that gharial is also failing on 9.6 (same set of commits) and coypu fails (again same set) on 9.6 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33 I'm afraid it's more likely to be related. gharial & coypu owners, any chance you could try starting postgres with log_min_messages=debug5 on one of the affected machines? - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-24 13:16:44 -0700, Andres Freund wrote: >> Unclear if related, but >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42 >> has a suspicious timing of failing in a weird way. > Given that gharial is also failing on 9.6 (same set of commits) and > coypu fails (again same set) on 9.6 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33 coypu's problem is unrelated: running bootstrap script ... 2017-04-24 23:04:53.084 CEST [21114] FATAL: could not create semaphores: No space left on device 2017-04-24 23:04:53.084 CEST [21114] DETAIL: Failed system call was semget(1, 17, 03600). but it does seem likely that one of these patches broke gharial. That's pretty annoying :-( regards, tom lane
On 2017-04-24 17:33:39 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-24 13:16:44 -0700, Andres Freund wrote: > >> Unclear if related, but > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2017-04-24%2019%3A30%3A42 > >> has a suspicious timing of failing in a weird way. > > > Given that gharial is also failing on 9.6 (same set of commits) and > > coypu fails (again same set) on 9.6 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2017-04-24%2018%3A20%3A33 > > coypu's problem is unrelated: > > running bootstrap script ... 2017-04-24 23:04:53.084 CEST [21114] FATAL: could not create semaphores: No space left ondevice > 2017-04-24 23:04:53.084 CEST [21114] DETAIL: Failed system call was semget(1, 17, 03600). Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6 failure is the same as gharial's mode of failure. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-24 17:33:39 -0400, Tom Lane wrote: >> coypu's problem is unrelated: > Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6 > failure is the same as gharial's mode of failure. [ looks closer... ] Oh: the 9.6 run occurred first, and the failures on HEAD and 9.5 are presumably follow-on damage because the stuck postmaster hasn't released semaphores. A bit of googling establishes that NetBSD 5.1 has a broken pselect implementation: http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625 That says they fixed it in later versions but not 5.1 :-( I can't find any similar smoking gun on the web for HPUX, but I'd fully expect their bug database to be behind a paywall. What I'm inclined to do is to revert the pselect change but not the other, to see if that fixes these two animals. If it does, we could look into blacklisting these particular platforms when choosing pselect. regards, tom lane
On 2017-04-24 18:14:41 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-24 17:33:39 -0400, Tom Lane wrote: > >> coypu's problem is unrelated: > > > Note I was linking the 9.6 report form coypu, not HEAD. Afaics the 9.6 > > failure is the same as gharial's mode of failure. > > [ looks closer... ] Oh: the 9.6 run occurred first, and the failures on > HEAD and 9.5 are presumably follow-on damage because the stuck postmaster > hasn't released semaphores. > > A bit of googling establishes that NetBSD 5.1 has a broken pselect > implementation: > > http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625 Yikes. Do I understand correctly that they effectively just mapped pselect to select? > What I'm inclined to do is to revert the pselect change but not the other, > to see if that fixes these two animals. If it does, we could look into > blacklisting these particular platforms when choosing pselect. Seems sensible. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-24 18:14:41 -0400, Tom Lane wrote: >> A bit of googling establishes that NetBSD 5.1 has a broken pselect >> implementation: >> >> http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=43625 > Yikes. Do I understand correctly that they effectively just mapped > pselect to select? That's what it sounds like. Probably not intentionally, but in effect. >> What I'm inclined to do is to revert the pselect change but not the other, >> to see if that fixes these two animals. If it does, we could look into >> blacklisting these particular platforms when choosing pselect. > Seems sensible. Will do. regards, tom lane
I wrote: > What I'm inclined to do is to revert the pselect change but not the other, > to see if that fixes these two animals. If it does, we could look into > blacklisting these particular platforms when choosing pselect. It looks like coypu is going to need manual intervention (ie, kill -9 on the leftover postmaster) to get unwedged :-(. That's particularly disturbing because it implies that ServerLoop isn't iterating at all; otherwise, it'd have noticed by now that the buildfarm script deleted its data directory out from under it. Even if NetBSD's pselect had forgotten to unblock signals, you'd figure it'd time out after a minute ... so it's even more broken than that. regards, tom lane
> Le 25 avr. 2017 à 01:47, Tom Lane <tgl@sss.pgh.pa.us> a écrit : > > I wrote: >> What I'm inclined to do is to revert the pselect change but not the other, >> to see if that fixes these two animals. If it does, we could look into >> blacklisting these particular platforms when choosing pselect. > > It looks like coypu is going to need manual intervention (ie, kill -9 > on the leftover postmaster) to get unwedged :-(. That's particularly > disturbing because it implies that ServerLoop isn't iterating at all; > otherwise, it'd have noticed by now that the buildfarm script deleted > its data directory out from under it. Even if NetBSD's pselect had > forgotten to unblock signals, you'd figure it'd time out after a > minute ... so it's even more broken than that. > Hi, coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores. I’ve done the clean-up. Regards, Rémi
Rémi Zara <remi_zara@mac.com> writes: >> Le 25 avr. 2017 à 01:47, Tom Lane <tgl@sss.pgh.pa.us> a écrit : >> It looks like coypu is going to need manual intervention (ie, kill -9 >> on the leftover postmaster) to get unwedged :-(. That's particularly >> disturbing because it implies that ServerLoop isn't iterating at all; >> otherwise, it'd have noticed by now that the buildfarm script deleted >> its data directory out from under it. > coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores. > I’ve done the clean-up. Huh, that's even more interesting. Looking at the code, what ServerLoop actually does when it notices that the postmaster.pid file has been removed is kill(MyProcPid, SIGQUIT); So if our hypothesis is that pselect() failed to unblock signals, then failure to quit is easily explained: the postmaster never received/acted on its own signal. But that should have left you with a running postmaster holding the shared memory and semaphores. Seems like if it is gone but it failed to remove those, somebody must've kill -9'd it ... but who? I see nothing in the buildfarm script that would. regards, tom lane
I wrote: > Rémi Zara <remi_zara@mac.com> writes: >> coypu was not stuck (no buildfarm related process running), but failed to clean-up shared memory and semaphores. >> I’ve done the clean-up. > Huh, that's even more interesting. I installed NetBSD 5.1.5 on an old Mac G4; I believe this is a reasonable approximation to coypu's environment. With the pselect patch installed, I can replicate the behavior we saw in the buildfarm of connections immediately failing with "the database system is starting up". Investigation shows that pselect reports ready sockets correctly (which is what allows connections to get in at all), and it does stop waiting either for a signal or for a timeout. What it forgets to do is to actually service the signal. The observed behavior is caused by the fact that reaper() is never called so the postmaster never realizes that the startup process has finished. I experimented with putting PG_SETMASK(&UnBlockSig); PG_SETMASK(&BlockSig); immediately after the pselect() call, and found that indeed that lets signals get serviced, and things work pretty much normally. However, closer inspection finds that pselect only stops waiting when a signal arrives *while it's waiting*, not if there was a signal already pending. So this is actually even more broken than the so called "non atomic" behavior we had expected to see --- at least with that, the pending signal would have gotten serviced promptly, even if ServerLoop itself didn't iterate. This is all giving me less than warm fuzzy feelings about the state of pselect support out in the real world. So at this point we seem to have three plausible alternatives: 1. Let HEAD stand as it is. We have a problem with slow response to bgworker start requests that arrive while ServerLoop is active, but that's a pretty tight window usually (although I believe I've seen it hit at least once in testing). 2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever else we find to be flaky. Then only the blacklisted platforms have the problem. 3. Go ahead with converting the postmaster to use WaitEventSet, a la the draft patch I posted earlier. I'd be happy to do this if we were at the start of a devel cycle, but right now seems a bit late --- not to mention that we really need to fix 9.6 as well. We could substantially ameliorate the slow-response problem by allowing maybe_start_bgworker to launch multiple workers per call, which is something I think we should do regardless. (I have a patch written to allow it to launch up to N workers per call, but have held off committing that till after the dust settles in ServerLoop.) I'm leaning to doing #1 plus the maybe_start_bgworker change. There's certainly room for difference of opinion here, though. Thoughts? regards, tom lane
Andres Freund <andres@anarazel.de> writes: > I'd still like to get something like your CLOEXEC patch applied > independently however. Here's an updated version of that, which makes use of our previous conclusion that F_SETFD/FD_CLOEXEC are available everywhere except Windows, and fixes some sloppy thinking about the EXEC_BACKEND case. I went ahead and changed the call to epoll_create into epoll_create1. I'm not too concerned about loss of portability there --- it seems unlikely that many people are still using ten-year-old glibc, and even less likely that any of them would be interested in running current Postgres on their stable-unto-death platform. We could add a configure test for epoll_create1 if you feel one's needed, but I think it'd just be a waste of cycles. I propose to push this into HEAD and 9.6 too. regards, tom lane diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 0d0701a..946fbff 100644 *** a/src/backend/storage/ipc/latch.c --- b/src/backend/storage/ipc/latch.c *************** static volatile sig_atomic_t waiting = f *** 118,123 **** --- 118,126 ---- static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; + /* Process owning the self-pipe --- needed for checking purposes */ + static int selfpipe_owner_pid = 0; + /* Private function prototypes */ static void sendSelfPipeByte(void); static void drainSelfPipe(void); *************** InitializeLatchSupport(void) *** 146,152 **** #ifndef WIN32 int pipefd[2]; ! Assert(selfpipe_readfd == -1); /* * Set up the self-pipe that allows a signal handler to wake up the --- 149,189 ---- #ifndef WIN32 int pipefd[2]; ! if (IsUnderPostmaster) ! { ! /* ! * We might have inherited connections to a self-pipe created by the ! * postmaster. It's critical that child processes create their own ! * self-pipes, of course, and we really want them to close the ! * inherited FDs for safety's sake. ! */ ! if (selfpipe_owner_pid != 0) ! { ! /* Assert we go through here but once in a child process */ ! Assert(selfpipe_owner_pid != MyProcPid); ! /* Release postmaster's pipe FDs; ignore any error */ ! (void) close(selfpipe_readfd); ! (void) close(selfpipe_writefd); ! /* Clean up, just for safety's sake; we'll set these below */ ! selfpipe_readfd = selfpipe_writefd = -1; ! selfpipe_owner_pid = 0; ! } ! else ! { ! /* ! * Postmaster didn't create a self-pipe ... or else we're in an ! * EXEC_BACKEND build, in which case it doesn't matter since the ! * postmaster's pipe FDs were closed by the action of FD_CLOEXEC. ! */ ! Assert(selfpipe_readfd == -1); ! } ! } ! else ! { ! /* In postmaster or standalone backend, assert we do this but once */ ! Assert(selfpipe_readfd == -1); ! Assert(selfpipe_owner_pid == 0); ! } /* * Set up the self-pipe that allows a signal handler to wake up the *************** InitializeLatchSupport(void) *** 154,176 **** * that SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) == -1) ! elog(FATAL, "fcntl() failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) == -1) ! elog(FATAL, "fcntl() failed on write-end of self-pipe: %m"); selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a backend-local latch. */ void InitLatch(volatile Latch *latch) --- 191,220 ---- * that SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. + * Also, make both FDs close-on-exec, since we surely do not want any + * child processes messing with them. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) == -1) ! elog(FATAL, "fcntl(F_SETFL) failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) == -1) ! elog(FATAL, "fcntl(F_SETFL) failed on write-end of self-pipe: %m"); ! if (fcntl(pipefd[0], F_SETFD, FD_CLOEXEC) == -1) ! elog(FATAL, "fcntl(F_SETFD) failed on read-end of self-pipe: %m"); ! if (fcntl(pipefd[1], F_SETFD, FD_CLOEXEC) == -1) ! elog(FATAL, "fcntl(F_SETFD) failed on write-end of self-pipe: %m"); selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; + selfpipe_owner_pid = MyProcPid; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a process-local latch. */ void InitLatch(volatile Latch *latch) *************** InitLatch(volatile Latch *latch) *** 181,187 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) --- 225,231 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) *************** InitLatch(volatile Latch *latch) *** 199,204 **** --- 243,252 ---- * containing the latch with ShmemInitStruct. (The Unix implementation * doesn't actually require that, but the Windows one does.) Because of * this restriction, we have no concurrency issues to worry about here. + * + * Note that other handles created in this module are never marked as + * inheritable. Thus we do not need to worry about cleaning up child + * process references to postmaster-private latches or WaitEventSets. */ void InitSharedLatch(volatile Latch *latch) *************** OwnLatch(volatile Latch *latch) *** 244,250 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #endif if (latch->owner_pid != 0) --- 292,298 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #endif if (latch->owner_pid != 0) *************** DisownLatch(volatile Latch *latch) *** 277,283 **** * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * backend-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note --- 325,331 ---- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * process-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note *************** CreateWaitEventSet(MemoryContext context *** 517,525 **** set->nevents_space = nevents; #if defined(WAIT_USE_EPOLL) ! set->epoll_fd = epoll_create(nevents); if (set->epoll_fd < 0) ! elog(ERROR, "epoll_create failed: %m"); #elif defined(WAIT_USE_WIN32) /* --- 565,573 ---- set->nevents_space = nevents; #if defined(WAIT_USE_EPOLL) ! set->epoll_fd = epoll_create1(EPOLL_CLOEXEC); if (set->epoll_fd < 0) ! elog(ERROR, "epoll_create1 failed: %m"); #elif defined(WAIT_USE_WIN32) /* *************** CreateWaitEventSet(MemoryContext context *** 540,545 **** --- 588,599 ---- /* * Free a previously created WaitEventSet. + * + * Note: preferably, this shouldn't have to free any resources that could be + * inherited across an exec(). If it did, we'd likely leak those resources in + * many scenarios. For the epoll case, we ensure that by setting FD_CLOEXEC + * when the FD is created. For the Windows case, we assume that the handles + * involved are non-inheritable. */ void FreeWaitEventSet(WaitEventSet *set) *************** FreeWaitEventSet(WaitEventSet *set) *** 586,592 **** * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a backend-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are --- 640,646 ---- * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a process-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-04-26 11:42:38 -0400, Tom Lane wrote: > 1. Let HEAD stand as it is. We have a problem with slow response to > bgworker start requests that arrive while ServerLoop is active, but that's > a pretty tight window usually (although I believe I've seen it hit at > least once in testing). > > 2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever > else we find to be flaky. Then only the blacklisted platforms have the > problem. That seems unattractive at this point. I'm not looking forward to having to debug more random platforms that implement this badly in a yet another weird way. > 3. Go ahead with converting the postmaster to use WaitEventSet, a la > the draft patch I posted earlier. I'd be happy to do this if we were > at the start of a devel cycle, but right now seems a bit late --- not > to mention that we really need to fix 9.6 as well. Yea, backpatching this to 9.6 seems like a bigger hammer than appropriate. I'm on the fence WRT master, I think there's an argument to be made that this is going to become a bigger and bigger problem, and that we'll wish in a year or two that we had fewer releases with parallelism etc that don't use WaitEventSets. > I'm leaning to doing #1 plus the maybe_start_bgworker change. There's > certainly room for difference of opinion here, though. Thoughts? I see you did the bgworker thing - that seems good to me. Thanks, Andres
On Wed, Apr 26, 2017 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote: >> 3. Go ahead with converting the postmaster to use WaitEventSet, a la >> the draft patch I posted earlier. I'd be happy to do this if we were >> at the start of a devel cycle, but right now seems a bit late --- not >> to mention that we really need to fix 9.6 as well. > > Yea, backpatching this to 9.6 seems like a bigger hammer than > appropriate. I'm on the fence WRT master, I think there's an argument > to be made that this is going to become a bigger and bigger problem, and > that we'll wish in a year or two that we had fewer releases with > parallelism etc that don't use WaitEventSets. I think changing this might be wise. This problem isn't going away for real until we do this, right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-04-26 17:05:39 -0400, Tom Lane wrote: > Here's an updated version of that, which makes use of our previous > conclusion that F_SETFD/FD_CLOEXEC are available everywhere except > Windows, and fixes some sloppy thinking about the EXEC_BACKEND case. > > I went ahead and changed the call to epoll_create into epoll_create1. > I'm not too concerned about loss of portability there --- it seems > unlikely that many people are still using ten-year-old glibc, and > even less likely that any of them would be interested in running > current Postgres on their stable-unto-death platform. We could add > a configure test for epoll_create1 if you feel one's needed, but > I think it'd just be a waste of cycles. Yea, I think we can live with that. If we find it's a problem, we can add a configure test later. > I propose to push this into HEAD and 9.6 too. Cool. Change looks good to me. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 26, 2017 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote: >>> 3. Go ahead with converting the postmaster to use WaitEventSet, a la >>> the draft patch I posted earlier. I'd be happy to do this if we were >>> at the start of a devel cycle, but right now seems a bit late --- not >>> to mention that we really need to fix 9.6 as well. >> Yea, backpatching this to 9.6 seems like a bigger hammer than >> appropriate. I'm on the fence WRT master, I think there's an argument >> to be made that this is going to become a bigger and bigger problem, and >> that we'll wish in a year or two that we had fewer releases with >> parallelism etc that don't use WaitEventSets. > I think changing this might be wise. This problem isn't going away > for real until we do this, right? Sure, but we have a lot of other problems that aren't going away until we fix them, either. This patch smells like new development to me --- and it's not even very complete, because really what Andres wants to do (and I concur) is to get rid of the postmaster's habit of doing interesting things in signal handlers. I'm definitely not on board with doing that for v10 at this point. But if we apply this patch, and then do that in v11, then v10 will look like neither earlier nor later branches with respect to the postmaster's event wait mechanisms. I think that's a recipe for undue maintenance pain. I believe that the already-committed patches represent a sufficient-for-now response to the known performance problems here, and so I'm thinking we should stop here for v10. I'm okay with pushing the latch.c changes I just proposed, because those provide a useful safeguard against extension modules doing something exciting in the postmaster process. But I don't think we should go much further than that for v10. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2017-04-26 17:05:39 -0400, Tom Lane wrote: >> I went ahead and changed the call to epoll_create into epoll_create1. >> I'm not too concerned about loss of portability there --- it seems >> unlikely that many people are still using ten-year-old glibc, and >> even less likely that any of them would be interested in running >> current Postgres on their stable-unto-death platform. We could add >> a configure test for epoll_create1 if you feel one's needed, but >> I think it'd just be a waste of cycles. > Yea, I think we can live with that. If we find it's a problem, we can > add a configure test later. Well, according to the buildfarm, "later" is "now" :-(. If RHEL5 is too old to have epoll_create1, I think your dates for it might be a bit off. Anyway, I'll go do something about that in a little bit. It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC" in latch.c, rather than bothering with a full-blown configure check. regards, tom lane
On 2017-04-27 16:35:29 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-26 17:05:39 -0400, Tom Lane wrote: > >> I went ahead and changed the call to epoll_create into epoll_create1. > >> I'm not too concerned about loss of portability there --- it seems > >> unlikely that many people are still using ten-year-old glibc, and > >> even less likely that any of them would be interested in running > >> current Postgres on their stable-unto-death platform. We could add > >> a configure test for epoll_create1 if you feel one's needed, but > >> I think it'd just be a waste of cycles. > > > Yea, I think we can live with that. If we find it's a problem, we can > > add a configure test later. > > Well, according to the buildfarm, "later" is "now" :-(. Too bad. > If RHEL5 is too old to have epoll_create1, I think your dates for it > might be a bit off. Anyway, I'll go do something about that in a > little bit. 2008-11-13 is when glibc 2.9 was released. Appears that RHEL5 still ships with glibc 2.5, released 2006-09-29. Given that RHEL 5 originally was released 2007-03-05, that's not too surprising? > It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC" > in latch.c, rather than bothering with a full-blown configure check. Yea, that sounds worth trying. Wonder if we need to care about kernels not supporting it, but glibc having support? I'd be ok skimping on that for now. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-04-27 16:35:29 -0400, Tom Lane wrote: >> It looks like it might be sufficient to do "#ifdef EPOLL_CLOEXEC" >> in latch.c, rather than bothering with a full-blown configure check. > Yea, that sounds worth trying. Wonder if we need to care about kernels > not supporting it, but glibc having support? I'd be ok skimping on that > for now. On my RHEL6 box, <sys/epoll.h> is provided by glibc not the kernel: $ rpm -qf /usr/include/sys/epoll.h glibc-headers-2.12-1.209.el6_9.1.x86_64 So I think it's probably safe to assume that the header is in sync with what glibc can do. As for kernel (much) older than glibc, I'd rather expect glibc to paper over that, though I've not looked at the source code to be sure. regards, tom lane
Hi, On 2017-04-26 11:42:38 -0400, Tom Lane wrote: > 3. Go ahead with converting the postmaster to use WaitEventSet, a la > the draft patch I posted earlier. I'd be happy to do this if we were > at the start of a devel cycle, but right now seems a bit late --- not > to mention that we really need to fix 9.6 as well. Btw, recent-ish versions of http://man7.org/linux/man-pages/man7/signal-safety.7.html have * POSIX.1-2003 clarified that if an application calls fork(2) from a signal handler and any of the fork handlers registered by pthread_atfork(3) calls a function that is not async-signal-safe, the behavior is undefined. A future revision of the standard is likely to remove fork(2) from the list of async-signal-safe functions. Greetings, Andres Freund