Thread: Replication/cloning: rsync vs modification dates?
I'm speccing up a three-node database for reliability, making use of streaming replication, and it's all working but I have a bit of a performance concern. Suppose a node dies and is removed from the cluster, but then returns (say, a day or two later). I could, of course, utterly wipe the existing data on that node and take a fresh copy from the master, but that would entail transferring the entire content of the database. The recommended option appears to be rsync, which saves on network traffic, but still has to read and hash every byte of data. Can the individual files' modification timestamps be relied upon? If so, it'd potentially mean a lot of savings, as the directory entries can be read fairly efficiently. I could still then use rsync to transfer those files (so if it's only a small part that's changed, we take advantage of its optimizations too). This may be digging too deep into the internals to be dependable for future versions. If so, I'd rather put the extra load on the servers than risk a future upgrade breaking replication subtly. Chris Angelico
On 7/16/12, Chris Angelico <rosuav@gmail.com> wrote: > I'm speccing up a three-node database for reliability, making use of > streaming replication, and it's all working but I have a bit of a > performance concern. > > > Can the individual files' modification timestamps be relied upon? If > so, it'd potentially mean a lot of savings, as the directory entries > can be read fairly efficiently. I could still then use rsync to > transfer those files (so if it's only a small part that's changed, we > take advantage of its optimizations too). I did several weeks of tests on 9.1.3 using mod time and file size rather than checksumming the files, that did not appear to cause any problems and it sped up the rsync considerably. (This was about a 40 GB database.) -- Mike Nolan
On Tue, Jul 17, 2012 at 1:40 AM, Michael Nolan <htfoot@gmail.com> wrote: > I did several weeks of tests on 9.1.3 using mod time and file size > rather than checksumming the files, that did not appear to cause any problems > and it sped up the rsync considerably. (This was about a 40 GB database.) Thanks! Is file size a necessary part of the check, or can mod time alone cover it? I'm looking at having my monitoring application automatically bring database nodes up, so it looks like the simplest way to handle it will be to have the new slave mandatorially do the backup/rsync, even if it's been down for only a couple of minutes. With a mod time check, I could hopefully do this without too much hassle. ChrisA
On 7/16/12, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Jul 17, 2012 at 1:40 AM, Michael Nolan <htfoot@gmail.com> wrote: >> I did several weeks of tests on 9.1.3 using mod time and file size >> rather than checksumming the files, that did not appear to cause any >> problems >> and it sped up the rsync considerably. (This was about a 40 GB >> database.) > > Thanks! Is file size a necessary part of the check, or can mod time > alone cover it? > > I'm looking at having my monitoring application automatically bring > database nodes up, so it looks like the simplest way to handle it will > be to have the new slave mandatorially do the backup/rsync, even if > it's been down for only a couple of minutes. With a mod time check, I > could hopefully do this without too much hassle. As I understand the docs for rsync, it will use both mod time and file size if told not to do checksums. -- Mike Nolan
On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: > As I understand the docs for rsync, it will use both mod time and file size > if told not to do checksums. Oh, so it does, I misread. Thanks! Time+size it is. ChrisA
On Mon, Jul 16, 2012 at 8:01 PM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: >> As I understand the docs for rsync, it will use both mod time and file size >> if told not to do checksums. I wonder if it is correct in general to use mtime and size to perform these checks from the point of view of PostgreSQL. If it works with the current version then is there a guaranty that it will work with the future versions? > > Oh, so it does, I misread. Thanks! Time+size it is. > > ChrisA > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sergey Konoplev a database architect, software developer at PostgreSQL-Consulting.com http://www.postgresql-consulting.com Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +79160686204
On 7/16/12, Sergey Konoplev <sergey.konoplev@postgresql-consulting.com> wrote: > On Mon, Jul 16, 2012 at 8:01 PM, Chris Angelico <rosuav@gmail.com> wrote: >> On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: >>> As I understand the docs for rsync, it will use both mod time and file >>> size >>> if told not to do checksums. > > I wonder if it is correct in general to use mtime and size to perform > these checks from the point of view of PostgreSQL. > > If it works with the current version then is there a guaranty that it > will work with the future versions? There are many things for which no guarantee of future compatibility (or sufficiency) are the case. For that matter, there's really no assurance that timestamp+size is sufficient NOW. But checksums aren't 100% reliable, either. without doing a byte by byte comparison of two files, there's no way to ensure they are identical. -- Mike Nolan
On Tue, Jul 17, 2012 at 4:35 AM, Sergey Konoplev <sergey.konoplev@postgresql-consulting.com> wrote: > On Mon, Jul 16, 2012 at 8:01 PM, Chris Angelico <rosuav@gmail.com> wrote: >> On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: >>> As I understand the docs for rsync, it will use both mod time and file size >>> if told not to do checksums. > > I wonder if it is correct in general to use mtime and size to perform > these checks from the point of view of PostgreSQL. > > If it works with the current version then is there a guaranty that it > will work with the future versions? That was my exact question. Ideally, I'd like to hear from someone who works with the Postgres internals, but the question may not even be possible to answer. ChrisA
I think it's pretty easy to show that timestamp+size isn't good enough to do this 100% reliably. Imagine that your timestamps have a millisecond resolution. I assume this will vary based on OS / filesystem, but the pointremains the same no matter what size it is. You can have multiple writes occur in the same quantized "instant". If the prior rsync just happened to catch the first write (at T+0.1ms) in that instant but not the second (which happenedat T+0.4ms), the second may not be transferred. But the modification time is the same for the two writes. All that said, I think the chances of this actually happening is vanishingly small. I personally use rsync without checksumsand have had no problems. On Jul 16, 2012, at 2:42 PM, Chris Angelico wrote: > On Tue, Jul 17, 2012 at 4:35 AM, Sergey Konoplev > <sergey.konoplev@postgresql-consulting.com> wrote: >> On Mon, Jul 16, 2012 at 8:01 PM, Chris Angelico <rosuav@gmail.com> wrote: >>> On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: >>>> As I understand the docs for rsync, it will use both mod time and file size >>>> if told not to do checksums. >> >> I wonder if it is correct in general to use mtime and size to perform >> these checks from the point of view of PostgreSQL. >> >> If it works with the current version then is there a guaranty that it >> will work with the future versions? > > That was my exact question. Ideally, I'd like to hear from someone who > works with the Postgres internals, but the question may not even be > possible to answer. > > ChrisA > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On 07/16/12 2:42 PM, Chris Angelico wrote: > On Tue, Jul 17, 2012 at 4:35 AM, Sergey Konoplev > <sergey.konoplev@postgresql-consulting.com> wrote: >> >I wonder if it is correct in general to use mtime and size to perform >> >these checks from the point of view of PostgreSQL. >> > >> >If it works with the current version then is there a guaranty that it >> >will work with the future versions? > That was my exact question. Ideally, I'd like to hear from someone who > works with the Postgres internals, but the question may not even be > possible to answer. as much as anything else, this is dependent on your OS properly updating mtime on an open file that's getting random writes. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On 7/16/12, Steven Schlansker <steven@likeness.com> wrote: > I think it's pretty easy to show that timestamp+size isn't good enough to do > this 100% reliably. That may not be a problem if the slave server synchronization code always starts to play back WAL entries at a time before the worst case for timestamp precision. I'm assuming here that the WAL playback process works something like this: Look at a WAL entry, see if the disk block it references matches the 'before' indicators for that block in the WAL. If so, update it to the 'after' data content. There are two non-matching conditions: If the disk block information indicates that it should match a later update, then that block does not need to be updated. But if the disk block information indicates that it should match an earlier update than the one in the WAL entry, then the synchronization fails.
On Tue, Jul 17, 2012 at 07:42:38AM +1000, Chris Angelico wrote: > On Tue, Jul 17, 2012 at 4:35 AM, Sergey Konoplev > <sergey.konoplev@postgresql-consulting.com> wrote: > > On Mon, Jul 16, 2012 at 8:01 PM, Chris Angelico <rosuav@gmail.com> wrote: > >> On Tue, Jul 17, 2012 at 1:58 AM, Michael Nolan <htfoot@gmail.com> wrote: > >>> As I understand the docs for rsync, it will use both mod time and file size > >>> if told not to do checksums. > > > > I wonder if it is correct in general to use mtime and size to perform > > these checks from the point of view of PostgreSQL. > > > > If it works with the current version then is there a guaranty that it > > will work with the future versions? > > That was my exact question. Ideally, I'd like to hear from someone who > works with the Postgres internals, but the question may not even be > possible to answer. You might want to look at the hackers list thread I started about the same topic a week before your post: http://archives.postgresql.org/pgsql-hackers/2012-07/msg00416.php Basically, you can only use mtime/size if you are replaying WAL. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Jul 27, 2012 at 9:53 AM, Bruce Momjian <bruce@momjian.us> wrote: > You might want to look at the hackers list thread I started about the > same topic a week before your post: > > http://archives.postgresql.org/pgsql-hackers/2012-07/msg00416.php > > Basically, you can only use mtime/size if you are replaying WAL. I'll check that out in a bit; but hot standby includes replaying WAL, right? That's what we're doing - full live replication with possibility to "pg_ctl promote" a slave straight up to master. ChrisA
On Fri, Jul 27, 2012 at 09:57:55AM +1000, Chris Angelico wrote: > On Fri, Jul 27, 2012 at 9:53 AM, Bruce Momjian <bruce@momjian.us> wrote: > > You might want to look at the hackers list thread I started about the > > same topic a week before your post: > > > > http://archives.postgresql.org/pgsql-hackers/2012-07/msg00416.php > > > > Basically, you can only use mtime/size if you are replaying WAL. > > I'll check that out in a bit; but hot standby includes replaying WAL, > right? That's what we're doing - full live replication with > possibility to "pg_ctl promote" a slave straight up to master. Yes, WAL is replayed in that case and any sub-second changes are going to be replayed from the WAL log. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Jul 27, 2012 at 9:57 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Fri, Jul 27, 2012 at 9:53 AM, Bruce Momjian <bruce@momjian.us> wrote: >> You might want to look at the hackers list thread I started about the >> same topic a week before your post: >> >> http://archives.postgresql.org/pgsql-hackers/2012-07/msg00416.php >> >> Basically, you can only use mtime/size if you are replaying WAL. > > I'll check that out in a bit; but hot standby includes replaying WAL, > right? That's what we're doing - full live replication with > possibility to "pg_ctl promote" a slave straight up to master. Hi, thanks for that link. Just got a chance to read through the thread. In this post[1] the script executes "checkpoint" before "pg_start_backup" - is that important? According to the docs[2]: "There is an optional second parameter of type boolean. If true, it specifies executing pg_start_backup as quickly as possible. This forces an immediate checkpoint which will cause a spike in I/O operations, slowing any concurrently executing queries." Is "checkpoint; select pg_start_backup('foo');" the same as "select pg_start_backup('foo',true);"? And what are the consequences of not calling for a checkpoint that way? My understanding of the docs is that the pg_start_backup call will hang until a checkpoint happens organically, ie delaying the backup rather than other clients, but I'm not really sure and haven't a sample database big or busy enough to test this on. Other than that, I think our current setup is fine. I have a script that, every time a computer attempts to join the cluster, redoes the "start backup, rsync, stop backup" sequence. I'm depending on (and assuming) the correct transfer of the last bit of log via the replication link, as soon as the new slave starts up - presumably this'll all be provided from wal_keep_segments. Again, thanks for the pointer! A good read. ChrisA [1] http://archives.postgresql.org/pgsql-hackers/2012-07/msg00417.php [2] http://www.postgresql.org/docs/9.1/static/functions-admin.html
On Fri, Jul 27, 2012 at 02:13:31PM +1000, Chris Angelico wrote: > On Fri, Jul 27, 2012 at 9:57 AM, Chris Angelico <rosuav@gmail.com> wrote: > > On Fri, Jul 27, 2012 at 9:53 AM, Bruce Momjian <bruce@momjian.us> wrote: > >> You might want to look at the hackers list thread I started about the > >> same topic a week before your post: > >> > >> http://archives.postgresql.org/pgsql-hackers/2012-07/msg00416.php > >> > >> Basically, you can only use mtime/size if you are replaying WAL. > > > > I'll check that out in a bit; but hot standby includes replaying WAL, > > right? That's what we're doing - full live replication with > > possibility to "pg_ctl promote" a slave straight up to master. > > Hi, thanks for that link. Just got a chance to read through the thread. > > In this post[1] the script executes "checkpoint" before > "pg_start_backup" - is that important? According to the docs[2]: > > "There is an optional second parameter of type boolean. If true, it > specifies executing pg_start_backup as quickly as possible. This > forces an immediate checkpoint which will cause a spike in I/O > operations, slowing any concurrently executing queries." A checkpoint is always issued by pg_start_backup(). The boolean controls whether the checkpoint is immediate or smoothed, meaning it can take a while to return a status of complete. > Is "checkpoint; select pg_start_backup('foo');" the same as "select > pg_start_backup('foo',true);"? And what are the consequences of not > calling for a checkpoint that way? My understanding of the docs is > that the pg_start_backup call will hang until a checkpoint happens > organically, ie delaying the backup rather than other clients, but I'm > not really sure and haven't a sample database big or busy enough to > test this on. Right, checkpoint is started by pg_start_backup() but is smoothed by default. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +