pg_basebackup from cascading standby after timeline switch - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | pg_basebackup from cascading standby after timeline switch |
Date | |
Msg-id | 50CF2929.5070603@vmware.com Whole thread Raw |
Responses |
Re: pg_basebackup from cascading standby after timeline switch
Re: pg_basebackup from cascading standby after timeline switch |
List | pgsql-hackers |
pg_basebackup -x is supposed to include all the required WAL files in the backup, so that you have everything needed to restore a consistent database. However, it's not including the timeline history files. Usually that's not a problem because normally you don't need to follow any old timelines when restoring, but there is one scenario where it causes a failure to restore: Create a master, a standby, and a cascading standby. Kill the master server, promote the standby to become new master, bumping the timeline. After the cascading standby has followed the timeline switch (either through the archive, which also works on 9.2, or directly via streaming replication which only works on 9.3devel), take a base backup from the cascading standby using pg_basebackup -x. When you try to start the server from the new backup (without setting up a restore_command or streaming replication), you get an error about "unexpected timeline ID 1 in log segment ..." C 2012-12-17 15:55:25.732 EET 534 LOG: database system was interrupted while in recovery at log time 2012-12-17 15:55:15 EET C 2012-12-17 15:55:25.732 EET 534 HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. C 2012-12-17 15:55:25.732 EET 534 LOG: creating missing WAL directory "pg_xlog/archive_status" C 2012-12-17 15:55:25.732 EET 534 LOG: unexpected timeline ID 1 in log segment 000000020000000000000003, offset 0 C 2012-12-17 15:55:25.732 EET 534 LOG: invalid checkpoint record C 2012-12-17 15:55:25.733 EET 534 FATAL: could not locate required checkpoint record C 2012-12-17 15:55:25.733 EET 534 HINT: If you are not restoring from a backup, try removing the file "/home/heikki/pgsql.master/data-standbyC/backup_label". C 2012-12-17 15:55:25.733 EET 533 LOG: startup process (PID 534) exited with exit code 1 C 2012-12-17 15:55:25.733 EET 533 LOG: aborting startup due to startup process failure The timeline was bumped within the log segment 000000020000000000000003, so the beginning of the file uses timeline 1, up to the checkpoint record that changes the timeline. Normally, recovery accepts that because timeline 1 is an ancestor of timeline 2, but because the backup does not include the timelime history file, it does not know that. This does not happen if you run pg_basebackup against the master server, because in the master it forces an xlog switch, which ensures that the new xlog file only contains pages with the latest timeline ID. There's even comments in pg_start_backup explaining that that's the reason for the xlog switch: > /* > * Force an XLOG file switch before the checkpoint, to ensure that the > * WAL segment the checkpoint is written to doesn't contain pages with > * old timeline IDs. That would otherwise happen if you called > * pg_start_backup() right after restoring from a PITR archive: the > * first WAL segment containing the startup checkpoint has pages in > * the beginning with the old timeline ID. That can cause trouble at > * recovery: we won't have a history file covering the old timeline if > * pg_xlog directory was not included in the base backup and the WAL > * archive was cleared too before starting the backup. > * > * This also ensures that we have emitted a WAL page header that has > * XLP_BKP_REMOVABLE off before we emit the checkpoint record. > * Therefore, if a WAL archiver (such as pglesslog) is trying to > * compress out removable backup blocks, it won't remove any that > * occur after this point. > * > * During recovery, we skip forcing XLOG file switch, which means that > * the backup taken during recovery is not available for the special > * recovery case described above. > */ > if (!backup_started_in_recovery) > RequestXLogSwitch(); I'm not happy with the fact that we just ignore the problem in a backup taken from a standby, silently giving the user a backup that won't start up. Why not include the timeline history file in the backup? That seems like a good idea regardless of this issue. I also wonder if pg_basebackup should include *all* timeline history files in the backup, not just the latest one strictly required to restore. They're fairly small, so our approach has generally been to try to include them all in the archive, and not try to prune them, so the same might make sense here. - Heikki
pgsql-hackers by date: