BUG #8686: Standby could not restart. - Mailing list pgsql-bugs
From | katsumata.tomonari@po.ntts.co.jp |
---|---|
Subject | BUG #8686: Standby could not restart. |
Date | |
Msg-id | E1VtTni-00082E-Jv@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #8686: Standby could not restart.
Re: BUG #8686: Standby could not restart. |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 8686 Logged by: Tomonari Katsumata Email address: katsumata.tomonari@po.ntts.co.jp PostgreSQL version: 9.2.6 Operating system: Red Hat Enterprise Linux 6.2 x86_64 Description: Hi, I'm testing whether a standby could restart with asynchronous replication. The testcase is very simple like below. (A) start asyncronous replication (B) stop standby with "-m f" (C) start standby I tried (B) and (C) periodically. In almost cases, it worked fine. But standby could not start one time because of PANIC. ---- Log ----- [Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was shut down in recovery at 2013-12-09 10:42:29 JST [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273 cp: cannot stat `../arc/00000002.history': No such file or directory [Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359 cp: cannot stat `../arc/000000010000000100000008': No such file or directory [Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state reached at 1/8E7F110 [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency, xlog.c:7366 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file "000000010000000100000007" from archive [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile, xlog.c:3273 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230 [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827 [Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation base/16384/16388 does not exist [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel 1663/16384/16388; tid 1365/152; new 1365/153 [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page, xlogutils.c:66 [Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains references to invalid pages [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel 1663/16384/16388; tid 1365/152; new 1365/153 [Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page, xlogutils.c:91 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID 12560) was terminated by signal 6: Aborted [Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit, postmaster.c:2895 [Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other active server processes [Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash, postmaster.c:2682 [Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was interrupted while in recovery at log time 2013-12-09 10:32:14 JST [Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. [Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289 --------- I tried to fix this problem. At first, I doubted the recovery state reached "consistent" before redo starts. And then I checked pg_control and related WAL. The WAL sequence is like below. WAL--(a)--(b)--(c)--(d)--(e)--> ================================================ (a) Latest checkpoint's REDO location 1/783B230 (b) hot_update 1/7842010 (c) truncate 1/8E7E5C8 (d) Latest checkpoint location 1/8E7F0B0 (e) Minimum recovery ending location 1/8E7F110 ================================================ >From these things, I found it has happened with this scenario. ---------- (1) standby starting (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not absecnt (3) reachedConsistency is set to true at 1/8E7F110 in CheckRecoveryConsistent (4) redo start from 1/783B230 (5) PANIC at 1/7842010 because reachedConsistency has set already and operating against a block which will be truncated at (c). ---------- At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), so reachedConsistency is set to true at step(3). I think it's not need to increase EndRecPtr while seeking checkpoint location. I tried to revise it and this worked fine. I think this is very very narrow gate, but it could happen. I'm not sure this problem happen with synchronous replication or another version of PostgreSQL(ie. 9.3/9.1/9.0), but at least we need to fix it. regards, ---------------- Tomonari Katsumata
pgsql-bugs by date: