Thread: PG 9.0 EBS Snapshot Backups on Slave
Hello, I am playing with a script that implements physical backups by snapshotting the EBS-backed software RAID. My basic workflowis this: 1. Stop PG on the slave 2. pg_start_backup on the master 3. On the slave: A. unmount the PG RAID B. snapshot each disk in the raid C. mount the PG RAID 4. pg_stop_backup 5. Restart PG on the slave Step 3 is actually quite fast, however, on the master, I end up seeing the following warning: WARNING: transaction log file "00000001000000CC00000076" could not be archived: too many failures I am guessing (I will confirm with timestamps later) this warning happens during steps 3A-3C, however my questions belowstand regardless of when this failure occurs. It is worth noting that, the slave (seemingly) catches up eventually, recovering later log files with streaming replicationcurrent. Can I trust this state? Should I be concerned about this warning? Is it a simple blip that can easily be ignored, or have I lost data? From googling,it looks like retry attempts is not a configurable parameter (it appears to have retried a handful of times). If this is indeed a real problem, am I best off changing my archive_command to retain logs in a transient location when Iam in "snapshot mode", and then ship them in bulk once the snapshot has completed? Are there any other remedies that I ammissing? Thank you very much for your time, Andrew Hannon
On Monday, January 23, 2012 07:54:16 PM Andrew Hannon wrote: > It is worth noting that, the slave (seemingly) catches up eventually, > recovering later log files with streaming replication current. Can I trust > this state? > Should be able to. The master will also actually retry the logs and eventually ship them all too, in my experience.
On Mon, Jan 23, 2012 at 8:02 PM, Alan Hodgson <ahodgson@simkin.ca> wrote: > On Monday, January 23, 2012 07:54:16 PM Andrew Hannon wrote: >> It is worth noting that, the slave (seemingly) catches up eventually, >> recovering later log files with streaming replication current. Can I trust >> this state? >> > > Should be able to. The master will also actually retry the logs and eventually > ship them all too, in my experience. > Right, as long as the failure case is temporary, the master should retry, and things should work themselves out. It's good to have some level of monitoring in place for such operations to make sure replay doesn't get stalled. That said, have you tested this backup? I'm a little concerned you'll have ended up with something unusable because you aren't starting xlog files that are going on during the snapshot time. It's possible that you won't need them in most cases (we have a script called "zbackup"[1] which does similar motions using zfs, though on zfs the snapshot really is instantaneous, in I can't remember a time when we got stuck by that, but that might just be faulty memory. A better approach would probably be to take the omnipitr code [2], which already had provisions for slaves from backups and catching the appropriate wal files, and rewrite the rsync bits to use snapshots instead, which would give you some assurances against possibly missing files. [1] this script is old and crufty, but provides a good example: http://labs.omniti.com/labs/pgtreats/browser/trunk/tools/zbackup.sh [2] https://github.com/omniti-labs/omnipitr Robert Treat conjecture: xzilla.net consulting: omniti.com