Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) |
Date | |
Msg-id | CA+TgmoZJ7z9_1Jwvq8GeANBfjEjUcmNCgNzeHKVp42dSWd2SWA@mail.gmail.com Whole thread Raw |
In response to | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: test_shm_mq failing on anole (was: Sending out a
request for more buildfarm animals?)
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) |
List | pgsql-hackers |
On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: >> Hamid@EDB; Can you please have someone configure anole to build git >> head as well as the other branches? Thanks. > > The test_shm_mq regression tests hung on this machine this morning. > Hamid was able to give me access to log in and troubleshoot. > Unfortunately, I wasn't able to completely track down the problem > before accidentally killing off the running cluster, but it looks like > test_shm_mq_pipelined() tried to start 3 background workers and the > postmaster only ever launched one of them, so the test just sat there > and waited for the other two workers to start. At this point, I have > no idea what could cause the postmaster to be asleep at the switch > like this, but it seems clear that's what happened. This happened again, and I investigated further. It looks like the postmaster knows full well that it's supposed to start more bgworkers: the ones that never get started are in the postmaster's BackgroundWorkerList, and StartWorkerNeeded is true. But it only starts the first one, not all three. Why? Here's my theory. When I did a backtrace inside the postmaster, it was stuck inside inside select(), within ServerLoop(). I think that's just where it was when the backend that wanted to run test_shm_mq requested that a few background workers get launched. Each registration would have sent the postmaster a separate SIGUSR1, but for some reason the postmaster only received one, which I think is legit behavior, though possibly not typical on modern Linux systems. When the SIGUSR1 arrived, the postmaster jumped into sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), which launched the first background worker. Then it returned, and the arrival of the signal did NOT interrupt the pending select(). This chain of events can't occur if an arriving SIGUSR1 causes select() to return EINTR or EWOULDBLOCK, nor can it happen if the signal handler is entered three separate times, once for each SIGUSR1. That combination of explanations seems likely sufficient to explain why this doesn't occur on other machines. The code seems to have been this way since the commit that introduced background workers (da07a1e856511dca59cbb1357616e26baa64428e), although the function was called StartOneBackgroundWorker back then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: