Archive recovery won't be completed on some situation. - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Archive recovery won't be completed on some situation. |
Date | |
Msg-id | 20140314.193220.123692229.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
Responses |
Re: Archive recovery won't be completed on some situation.
Re: Archive recovery won't be completed on some situation. |
List | pgsql-hackers |
Hello, we found that postgreql won't complete archive recovery foever on some situation. This occurs HEAD, 9.3.3, 9.2.7, 9.1.12. Restarting server with archive recovery fails as following just after it was killed with SIGKILL after pg_start_backup and some wal writes but before pg_stop_backup. | FATAL: WAL ends before end of online backup | HINT: Online backup started with pg_start_backup() must be | ended with pg_stop_backup(), and all WAL up to that point must | be available at recovery. What the mess is once entering this situation, I could find no formal operation to exit from it. On this situation, 'Backup start location' in controldata has some valid location but corresponding 'end of backup' WAL record won't come forever. But I think PG cannot tell the situation dintinctly whether the 'end of backup' reocred is not exists at all or it will come later especially when the server starts as a streaming replication hot-standby. One solution for it would be a new parameter in recovery.conf which tells that the operator wants the server to start as if there were no backup label ever before when the situation comes. It looks ugly and somewhat danger but seems necessary. The first attached file is the script to replay the problem, and the second is the patch trying to do what is described above. After applying this patch on HEAD and uncommneting the 'cancel_backup_label_on_failure = true' in test.sh, the test script runs as following, | LOG: record with zero length at 0/2010F40 | WARNING: backup_label was canceled. | HINT: server might have crashed during backup mode. | LOG: consistent recovery state reached at 0/2010F40 | LOG: redo done at 0/2010DA0 What do you thing about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center #! /bin/sh killall postgres rm -rf $PGDATA/* initdb cat >> $PGDATA/postgresql.conf <<EOF wal_level = hot_standby EOF pg_ctl start -w sleep 1 psql postgres -c "select pg_start_backup('hoge');" psql postgres -c "create table t (a int);" killall -9 postgres # pg_ctl stop -m f cat > $PGDATA/recovery.conf <<EOF # standby_mode = on restore_command = '/bin/true' recovery_target_timeline = 'latest' # cancel_backup_label_on_failure = true EOF pg_ctl start sleep 5 pg_ctl stop -w pg_ctl start sleep 5 pg_ctl stop -w diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cdbe305..d1f93bb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -230,6 +230,7 @@ static TimestampTz recoveryDelayUntilTime;static bool StandbyModeRequested = false;static char *PrimaryConnInfo= NULL;static char *PrimarySlotName = NULL; +static bool cancelBackupLabelOnFailure = false;static char *TriggerFile = NULL;/* are we currently in standby mode? */ @@ -5569,6 +5570,16 @@ readRecoveryCommandFile(void) ereport(DEBUG2, (errmsg("min_recovery_apply_delay= '%s'", item->value))); } + else if (strcmp(item->name, "cancel_backup_label_on_failure") == 0) + { + if (!parse_bool(item->value, &cancelBackupLabelOnFailure)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a Boolean value", + "cancel_backup_label_on_failure"))); + ereport(DEBUG2, + (errmsg_internal("cancel_backup_label_on_failure = '%s'", item->value))); + } else ereport(FATAL, (errmsg("unrecognized recovery parameter \"%s\"", @@ -7111,6 +7122,21 @@ StartupXLOG(void) record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false); } while (record != NULL); + if (cancelBackupLabelOnFailure && + ControlFile->backupStartPoint != InvalidXLogRecPtr) + { + /* + * Try to force complete recocovery when backup_label was + * found but end-of-backup record has not been found. + */ + + ControlFile->backupStartPoint = InvalidXLogRecPtr; + + ereport(WARNING, + (errmsg("backup_label was canceled."), + errhint("server might have crashed during backup mode."))); + CheckRecoveryConsistency(); + } /* * end of main redo apply loop */
pgsql-hackers by date: