Apparent race condition in standby promotion - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Apparent race condition in standby promotion |
Date | |
Msg-id | 9542.1461607767@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Apparent race condition in standby promotion
|
List | pgsql-hackers |
I'm looking at buildfarm member tern's recent failure: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08 The perl script thinks it's started the postmaster and sent a promote signal, then it's waited 90 seconds for the postmaster to come out of standby: ### Starting node "standby" # Running: pg_ctl -w -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logstart waiting for server to start.... done server started # Postmaster PID for node "standby" is 14025120 # Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tbl1 values ('in master,before promotion') # Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO trunc_tbl values ('in master,before promotion') # Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tail_tbl SELECT g, 'in master,before promotion: ' || g FROM generate_series(1, 10000) g # Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c CHECKPOINT ### Promoting node "standby" # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logpromote server promoting # Timed out while waiting for promotion of standby at RewindTest.pm line 166. But look at what's in the standby's log: LOG: database system was interrupted; last known up at 2016-04-25 01:58:39 UTC LOG: entering standby mode LOG: redo starts at 0/2000028 LOG: consistent recovery state reached at 0/2000C00 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at 0/3000000 on timeline 1 LOG: statement: SELECT NOT pg_is_in_recovery() ... 88 occurrences of that removed ... LOG: statement: SELECT NOT pg_is_in_recovery() LOG: received promote request FATAL: terminating walreceiver process due to administrator command The standby server didn't notice the promote request until AFTER the perl script had probed 90 times, once a second, for promotion to finish. What's up with that? One theory is that the postmaster handles promotion requests improperly. Its only consideration of the problem is in sigusr1_handler: if (CheckPromoteSignal() && StartupPID != 0 && (pmState == PM_STARTUP || pmState == PM_RECOVERY || pmState== PM_HOT_STANDBY || pmState == PM_WAIT_READONLY)) { /* Tell startup process to finish recovery */ signal_child(StartupPID,SIGUSR2); } This means that if pg_ctl sends a SIGUSR1 after creating a promotion trigger file, and the system state is not such that the signal can be forwarded on to the startup process at that very instant, the signal is simply forgotten. We will reconsider sending it to the startup process only when the postmaster next gets a SIGUSR1, which might not be for a long time. Now, I do not immediately see a way for this to happen: the postmaster should create the startup process before it first enables interrupts, and that list of states seems to cover every possibility. But it seems fragile as heck. Maybe we need to check for promotion in PostmasterStateMachine or related places, not in (or not just in) the signal handler. The other theory is that the startup process received the SIGUSR2 but failed to act on it for a long time. It checks for that only in CheckForStandbyTrigger(), and the calls to that function are in assorted rather unprincipled places, which makes it hard to convince oneself that such a call will always happen soon. This example sure looks like no such call happened. We've seen previous examples of the same thing, eg http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09 Noah raised the test's timeout from 30 seconds to 90 in response to that failure, but it looks to me like this was a mis-diagnosis of the problem. What I suspect is happening is that if the signal arrives at the startup process at the right time, probably around the time it's opening up the streaming connection from the primary, it doesn't get handled promptly. I'm not planning to pursue this further, but someone who's more familiar with the WAL-receiving logic in startup.c ought to. regards, tom lane
pgsql-hackers by date: