Re: Streaming a base backup from master - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Streaming a base backup from master |
Date | |
Msg-id | AANLkTi=+FuGHf7BfQPZrvGWDdU1geG+JBgzA5TCPyOsa@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming a base backup from master (Martijn van Oosterhout <kleptog@svana.org>) |
Responses |
Re: Streaming a base backup from master
Re: Streaming a base backup from master Re: Streaming a base backup from master Re: Streaming a base backup from master |
List | pgsql-hackers |
On Fri, Sep 3, 2010 at 8:30 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > > rsync is not rocket science. All you need is for the receiving end to > send a checksum for each block it has. The server side does the same > checksum and for each block sends back "same" or "new data". Well rsync is closer to rocket science than that. It does rolling checksums and can handle data being moved around, which vacuum does do so it's probably worthwhile. *However* I tihnk you're all headed in the wrong direction here. I don't think rsync is what anyone should be doing with their backups at all. It still requires scanning through *all* your data even if you've only changed a small percentage (which it seems is the use case you're concerned about) and it results in corrupting your backup while the rsync is in progress and having a window with no usable backup. You could address that with rsync --compare-dest but then you're back to needing space and i/o for whole backups every time even if you're only changing small parts of the database. The industry standard solution that we're missing that we *should* be figuring out how to implement is incremental backups. I've actually been thinking about this recently and I think we could do it fairly easily with our existing infrastructure. I was planning on doing it as an external utility but it would be tempting to be able to request an external backup via the streaming protocol so maybe it would be better a bit more integrated. The way I see it there are two alternatives. You need to start by figuring out which blocks have been modified since the last backup (or selected reference point). You can do this either by scanning every data file and picking every block with an LSN > the reference LSN. Or you can do it by scanning the WAL since that point and accumulating a list of block numbers. Either way you then need to archive all those blocks into a special file format which includes meta-information to dictate which file and what block number each block represents. Also it would be useful to include the reference LSN and the beginning and ending LSN of the backup so that we can verify when restoring it that we're starting with a recent enough database and that we've replayed the right range of WAL to bring it to a consistent state. It's tempting to make the incremental backup file format just a regular WAL file with a series of special WAL records which just contain a backup block. That might be a bit confusing since it would be a second unrelated LSN series but I like the idea of being able to use the same bits of code to handle the "holes" and maybe other code. On the whole I think it would be just a little too weird though. -- greg
pgsql-hackers by date: