Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests |
Date | |
Msg-id | 20170803034740.GA2641942@rfd.leadboat.com Whole thread Raw |
In response to | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
|
List | pgsql-hackers |
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure looks > like the crashing bgworker took the postmaster down with it. That is > Not Supposed To Happen. > > Wondering if we broke something here recently, I tried to reproduce it > on a Linux machine by adding a randomized Assert failure in > shm_mq_set_sender. I don't see any such problem, even with EXEC_BACKEND; > we recover from the crash as-expected. > > So I'm starting to get a distinct feeling that there's something wrong > with the cygwin port. But I dunno what. I think signal blocking broke on Cygwin. On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18 x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as attached. The patch's small_parallel.sql is a subset of select_parallel.sql sufficient to reproduce the mq_sender Assert failure and the postmaster silent exit. (It occasionally needed hundreds of iterations to do so.) The parallel query normally starts four bgworkers; when the mq_sender Assert fired, the test had started five workers in response to four registrations. The postmaster.c instrumentation regularly detects sigusr1_handler() calls while another sigusr1_handler() is already on the stack: 6328 2017-08-02 07:25:42.788 GMT LOG: forbid signals @ sigusr1_handler 6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 0 6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 1 6328 2017-08-02 07:25:42.788 GMT DEBUG: slot 1 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 1) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 2 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 2 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 2 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 2) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 3 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 3 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 3) 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-3 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 4 6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 4 not yet registered 6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 4) 6328 2017-08-02 07:25:42.789 GMT DEBUG: starting background worker process "parallel worker for PID 4776" 6328 2017-08-02 07:25:42.790 GMT LOG: forbid signals @ sigusr1_handler 6328 2017-08-02 07:25:42.790 GMT WARNING: signals already forbidden @ sigusr1_handler 6328 2017-08-02 07:25:42.790 GMT LOG: permit signals @ sigusr1_handler postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without such protection, duplicate bgworkers are an understandable result. I caught several other assertions; the PMChildFlags failure is another case of duplicate postmaster children: 6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871) 3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229) 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221) Also, got a few "select() failed in postmaster: Bad address" I suspect a Cygwin signals bug. I'll try to distill a self-contained test case for the Cygwin hackers. The lack of failures on buildfarm member brolga argues that older Cygwin is not affected. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: