Re: Apparent race condition in standby promotion - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Apparent race condition in standby promotion |
Date | |
Msg-id | 20160504061042.GC2409136@tornado.leadboat.com Whole thread Raw |
In response to | Apparent race condition in standby promotion (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Mon, Apr 25, 2016 at 02:09:27PM -0400, Tom Lane wrote: > I'm looking at buildfarm member tern's recent failure: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08 > # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logpromote > server promoting > # > Timed out while waiting for promotion of standby at RewindTest.pm line 166. > > But look at what's in the standby's log: > > LOG: database system was interrupted; last known up at 2016-04-25 01:58:39 UTC > LOG: entering standby mode > LOG: redo starts at 0/2000028 > LOG: consistent recovery state reached at 0/2000C00 > LOG: database system is ready to accept read only connections > LOG: started streaming WAL from primary at 0/3000000 on timeline 1 > LOG: statement: SELECT NOT pg_is_in_recovery() > ... 88 occurrences of that removed ... > LOG: statement: SELECT NOT pg_is_in_recovery() > LOG: received promote request > FATAL: terminating walreceiver process due to administrator command > > The standby server didn't notice the promote request until AFTER > the perl script had probed 90 times, once a second, for promotion > to finish. What's up with that? This resembles symptoms I studied last September. > The other theory is that the startup process received the SIGUSR2 > but failed to act on it for a long time. It checks for that only > in CheckForStandbyTrigger(), and the calls to that function are > in assorted rather unprincipled places, which makes it hard to > convince oneself that such a call will always happen soon. Right. In every case, I caught the startup process taking too long in a single XLOG_DBASE_CREATE replay. I don't suspect a bug; that is a costly record to replay, especially given the machine's slow filesystem metadata operations. I could reproduce such symptoms reliably, on any system, with the attached demonstration patch. While we could hack things until the message prints earlier, that won't make promotion finish earlier. > We've seen previous examples of the same thing, eg > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09 > Noah raised the test's timeout from 30 seconds to 90 in response > to that failure, but it looks to me like this was a mis-diagnosis > of the problem. I can understand you guessing the 2016-02-09 failure inspired that change, but it was either a coincidence or merely reminded me to commit an already-written change. I had performed burn-in runs, with timeouts raised to 900s, totalling 25177 executions of src/bin/pg_rewind/t/002_databases.pl. The longest promote duration among those runs was 43s, so I doubled that quantity and rounded up. For the 2016-04-25 failure, note this bit of 001_basic_master.log: LOG: statement: CHECKPOINT LOG: statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby'; ... 39 duplicate lines removed ... LOG: statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby'; LOG: received immediate shutdown request That's 41s just to copy WAL. The slowest of those 002_databases.pl runs used 7s, and the worst I've seen in 2579 runs of 001_basic.pl is 22s. The failed run likely coincided with exceptional system load. If this happens again anytime soon, I should triple (or so) that 90s timeout.
Attachment
pgsql-hackers by date: