Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) |
Date | |
Msg-id | 20140929185229.GP16581@awork2.anarazel.de Whole thread Raw |
In response to | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: test_shm_mq failing on anole (was: Sending out a
request for more buildfarm animals?)
|
List | pgsql-hackers |
On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: > >> Hamid@EDB; Can you please have someone configure anole to build git > >> head as well as the other branches? Thanks. > > > > The test_shm_mq regression tests hung on this machine this morning. > > Hamid was able to give me access to log in and troubleshoot. > > Unfortunately, I wasn't able to completely track down the problem > > before accidentally killing off the running cluster, but it looks like > > test_shm_mq_pipelined() tried to start 3 background workers and the > > postmaster only ever launched one of them, so the test just sat there > > and waited for the other two workers to start. At this point, I have > > no idea what could cause the postmaster to be asleep at the switch > > like this, but it seems clear that's what happened. > > This happened again, and I investigated further. It looks like the > postmaster knows full well that it's supposed to start more bgworkers: > the ones that never get started are in the postmaster's > BackgroundWorkerList, and StartWorkerNeeded is true. But it only > starts the first one, not all three. Why? > > Here's my theory. When I did a backtrace inside the postmaster, it > was stuck inside inside select(), within ServerLoop(). I think that's > just where it was when the backend that wanted to run test_shm_mq > requested that a few background workers get launched. Each > registration would have sent the postmaster a separate SIGUSR1, but > for some reason the postmaster only received one, which I think is > legit behavior, though possibly not typical on modern Linux systems. > When the SIGUSR1 arrived, the postmaster jumped into > sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), > which launched the first background worker. Then it returned, and the > arrival of the signal did NOT interrupt the pending select(). > > This chain of events can't occur if an arriving SIGUSR1 causes > select() to return EINTR or EWOULDBLOCK, nor can it happen if the > signal handler is entered three separate times, once for each SIGUSR1. > That combination of explanations seems likely sufficient to explain > why this doesn't occur on other machines. > > The code seems to have been this way since the commit that introduced > background workers (da07a1e856511dca59cbb1357616e26baa64428e), > although the function was called StartOneBackgroundWorker back then. If that theory is true, wouldn't things get unstuck everytime a new connection comes in? Or 60 seconds have passed? That's not to say this isn't wrong, but still? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: