Re: [BUG] pg_basebackup from disconnected standby fails - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: [BUG] pg_basebackup from disconnected standby fails |
Date | |
Msg-id | 20160614.203111.229211034.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: [BUG] pg_basebackup from disconnected standby fails (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: [BUG] pg_basebackup from disconnected standby fails
Re: [BUG] pg_basebackup from disconnected standby fails Re: [BUG] pg_basebackup from disconnected standby fails |
List | pgsql-hackers |
Hello, thank you for looking this. At Fri, 10 Jun 2016 17:39:59 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTv5gmKQcNDoFGTGqoqXz2xLz4RRw247oqOJzZTVy6-7Q@mail.gmail.com> > On Thu, Jun 9, 2016 at 9:55 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, I found that pg_basebackup from a replication standby > > fails after the following steps, on 9.3 and the master. > > > > - start a replication master > > - start a replication standby > > - stop the master in the mode other than immediate. > > > > pg_basebackup to the standby will fail with the following error. > > > >> pg_basebackup: could not get transaction log end position from server: > >> ERROR: could not find any WAL files > > Indeed, and you could just do the following to reproduce the failure > with the recovery test suite, so I would suggest adding this test in > the patch: > --- a/src/test/recovery/t/001_stream_rep.pl > +++ b/src/test/recovery/t/001_stream_rep.pl > @@ -24,6 +24,11 @@ $node_standby_1->start; > # pg_basebackup works on a standby). > $node_standby_1->backup($backup_name); > > +# Take a second backup of the standby while the master is offline. > +$node_master->stop; > +$node_standby_1->backup('my_backup_2'); > +$node_master->start; I'm not sure that adding the test case for a particular bug like this is appropriate. But it would be acceptable because it doesn't take long time and it is separate from standard checks. > > After looking more closely, I found that the minRecoveryPoint > > tends to be too small as the backup end point, and up to the > > record at the lastReplayedRecPtr can affect the pages on disk and > > they can go into the backup just taken. > > > > My conclusion here is that do_pg_stop_backup should return > > lastReplayedRecPtr, not minRecoveryPoint. > > I have been thinking quite a bit about this patch, and this logic > sounds quite right to me. When stopping the backup we need to let the > user know up to which point it needs to replay WAL, and relation pages > are touched up to lastReplayedEndRecPtr. Yes, but by design, the changes in buffers don't go into disk until buffer replacing occurs, which updates minRecoveryPoint. My understanding is that the problem is that a restart point that is not accompanied with buffer updates advances only the redo point of the last checkpoint and doesn't update minRecoveryPoint, which may be behind the redo point at the time. It seems to me that we could more agressively advance the minRecoveryPoint (but must not let it go too far..), but it is right for it to aim a bit smaller than the ideal location. So I proposed the patch as a solution instead of changing minRecoveryPoint's movement. As the result, the explanation for it is accurate enough, but obscuring the root cause. > This LSN could be greater > than the minimum recovery point as there is no record to mark the end > of the backup, and pg_control has normally, well surely, being backup > up last but that's not a new problem it would as well have been backup > up before the minimum recovery point has been reached... Yes. it would cause no new problem except that the amount of WAL files to be copied could be larger than ideal amount. > Still, perhaps we could improve the documentation regarding that? Say > it is recommended to enforce the minimum recovery point in pg_control > to the stop backup LSN to ensure that the backup recovers to a > consistent state when taking a backup from a standby. If I understand that correctly, I don't find a explicit mention to minimum recovery point (and should not be, I suppose) in PITR and pg_baseback pages in the documentaiton. Up to where we need to reach a consistent state from a backup is not the "Minimum recovery ending point" in pg_control. It holds the consistency point in the previous recovery. What we need to reach there for a backup is the WAL files up to the LSN returned by the (do_)pg_stop_backup(). All of the files should have been archived on master and should have been automatically transferred by pg_basebackup with -X/x option. (I don't know what to do when -X/x is not specified for pg_basebackup to standby, though..) > I am attaching an updated patch with the test btw. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: