Thread: PG 9.0 EBS Snapshot Backups on Slave

PG 9.0 EBS Snapshot Backups on Slave

From

Andrew Hannon

Date:

23 January 2012, 20:54:34

Hello,

I am playing with a script that implements physical backups by snapshotting the EBS-backed software RAID. My basic
workflowis this: 

1. Stop PG on the slave
2. pg_start_backup on the master
3. On the slave:
   A. unmount the PG RAID
   B. snapshot each disk in the raid
   C. mount the PG RAID
4. pg_stop_backup
5. Restart PG on the slave

Step 3 is actually quite fast, however, on the master, I end up seeing the following warning:

WARNING:  transaction log file "00000001000000CC00000076" could not be archived: too many failures

I am guessing (I will confirm with timestamps later) this warning happens during steps 3A-3C, however my questions
belowstand regardless of when this failure occurs. 

It is worth noting that, the slave (seemingly) catches up eventually, recovering later log files with streaming
replicationcurrent. Can I trust this state? 

Should I be concerned about this warning? Is it a simple blip that can easily be ignored, or have I lost data? From
googling,it looks like retry attempts is not a configurable parameter (it appears to have retried a handful of times). 

If this is indeed a real problem, am I best off changing my archive_command to retain logs in a transient location when
Iam in "snapshot mode", and then ship them in bulk once the snapshot has completed? Are there any other remedies that I
ammissing? 

Thank you very much for your time,

Andrew Hannon

Re: PG 9.0 EBS Snapshot Backups on Slave

From

Alan Hodgson

Date:

23 January 2012, 21:02:58

On Monday, January 23, 2012 07:54:16 PM Andrew Hannon wrote:
> It is worth noting that, the slave (seemingly) catches up eventually,
> recovering later log files with streaming replication current. Can I trust
> this state?
>

Should be able to. The master will also actually retry the logs and eventually
ship them all too, in my experience.

Re: PG 9.0 EBS Snapshot Backups on Slave

From

Robert Treat

Date:

24 January 2012, 11:04:57

On Mon, Jan 23, 2012 at 8:02 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
> On Monday, January 23, 2012 07:54:16 PM Andrew Hannon wrote:
>> It is worth noting that, the slave (seemingly) catches up eventually,
>> recovering later log files with streaming replication current. Can I trust
>> this state?
>>
>
> Should be able to. The master will also actually retry the logs and eventually
> ship them all too, in my experience.
>

Right, as long as the failure case is temporary, the master should
retry, and things should work themselves out. It's good to have some
level of monitoring in place for such operations to make sure replay
doesn't get stalled.

That said, have you tested this backup? I'm a little concerned you'll
have ended up with something unusable because you aren't starting xlog
files that are going on during the snapshot time. It's possible that
you won't need them in most cases (we have a script called
"zbackup"[1] which does similar motions using zfs, though on zfs the
snapshot really is instantaneous, in I can't remember a time when we
got stuck by that, but that might just be faulty memory. A better
approach would probably be to take the omnipitr code [2], which
already had provisions for slaves from backups and catching the
appropriate   wal files, and rewrite the rsync bits to use snapshots
instead, which would give you some assurances against possibly missing
files.

[1] this script is old and crufty, but provides a good example:
http://labs.omniti.com/labs/pgtreats/browser/trunk/tools/zbackup.sh

[2] https://github.com/omniti-labs/omnipitr

Robert Treat
conjecture: xzilla.net
consulting: omniti.com