Home > mailing lists

Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby - Mailing list pgsql-hackers

From	Marco Nenciarini
Subject	Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
Date	July 8, 2016 09:40:50
Msg-id	d18fd50a-71a7-22f8-eb79-583e6b68a6f0@2ndquadrant.it Whole thread Raw
In response to	Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby (Michael Paquier <michael.paquier@gmail.com>)
Responses	Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
List	pgsql-hackers

Tree view

On 07/07/16 08:38, Michael Paquier wrote:
> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
> <marco.nenciarini@2ndquadrant.it> wrote:
>> After further analysis, the issue is that we retrieve the starttli from
>> the ControlFile structure, but it was using ThisTimeLineID when writing
>> the backup label.
>>
>> I've attached a very simple patch that fixes it.
>
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
>
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
>
> I am adding that in the list of open items, adding Magnus in CC whose
> commit for non-exclusive backups is at the origin of this defect.
>

While we were testing the patch we noticed another behavior that is not
strictly a bug, but can confuse backup tools:

To quickly produce some WAL files we were executing a series of
pg_switch_xlog+CHECKPOINT, and we noticed that doing a backup from a
standby after that results in a startpoint higher than the stoppoint.

Let me show it on a brand new master/replica cluster (master is port
5496, replica is 6496). The script is attached.

-------------------------------------------------------------------
You are now connected to database "postgres" as user "postgres" via
socket in "/tmp" at port "5496".
SELECT pg_is_in_recovery();
-[ RECORD 1 ]-----+--
pg_is_in_recovery | f

CHECKPOINT;
CHECKPOINT
SELECT pg_switch_xlog();
-[ RECORD 1 ]--+----------
pg_switch_xlog | 0/30000E8

CHECKPOINT;
CHECKPOINT
SELECT pg_switch_xlog();
-[ RECORD 1 ]--+----------
pg_switch_xlog | 0/40000E8

You are now connected to database "postgres" as user "postgres" via
socket in "/tmp" at port "6496".
SELECT pg_is_in_recovery();
-[ RECORD 1 ]-----+--
pg_is_in_recovery | t

SELECT pg_start_backup('tst backup',TRUE,FALSE);
-[ RECORD 1 ]---+----------
pg_start_backup | 0/4000028

SELECT * FROM pg_stop_backup(FALSE);
-[ RECORD 1 ]-------------------------------------------------------------
lsn        | 0/20000F8
labelfile  | START WAL LOCATION: 0/4000028 (file 000000000000000000000004)+
           | CHECKPOINT LOCATION: 0/4000060                               +
           | BACKUP METHOD: streamed                                      +
           | BACKUP FROM: standby                                         +
           | START TIME: 2016-07-08 10:46:55 CEST                         +
           | LABEL: tst backup                                            +
           |
spcmapfile |

SELECT * FROM pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_location  | 0/4000060
prior_location       | 0/2000060
redo_location        | 0/4000028
redo_wal_file        | 000000010000000000000004
timeline_id          | 1
prev_timeline_id     | 1
full_page_writes     | t
next_xid             | 0:865
next_oid             | 12670
next_multixact_id    | 1
next_multi_offset    | 0
oldest_xid           | 858
oldest_xid_dbid      | 1
oldest_active_xid    | 865
oldest_multi_xid     | 1
oldest_multi_dbid    | 1
oldest_commit_ts_xid | 865
newest_commit_ts_xid | 865
checkpoint_time      | 2016-07-08 10:46:55+02

SELECT * FROM pg_control_recovery();
-[ RECORD 1 ]-----------------+----------
min_recovery_end_location     | 0/20000F8
min_recovery_end_timeline     | 1
backup_start_location         | 0/0
backup_end_location           | 0/0
end_of_backup_record_required | f

-------------------------------------------------------------------

In particular, the pg_start_backup LSN is 0/4000028 and the
pg_stop_backup LSN is 0/20000F8.


The same issue is present when you do a backup using pg_basebackup:

-------------------------------------------------------------------
transaction log start point: 0/8000028 on timeline 1
pg_basebackup: starting background WAL receiver
22244/22244 kB (100%), 1/1 tablespace
transaction log end point: 0/20000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
-------------------------------------------------------------------

The resulting backup is working perfectly, because Postgres has no use
for pg_stop_backup LSN, but this can confuse any tool that uses the stop
LSN to figure out which WAL files are needed by the backup (in this case
the only file needed is the one containing the start checkpoint).

After some discussion with Álvaro, my proposal is to avoid that by
returning the stoppoint as the maximum between the startpoint and the
min_recovery_end_location, in case of backup from the standby.

The patch is once again a very simple one line diff.

I have attached both patches to this email, as in my opinion they should
go together, because the subject is the same: avoid giving misleading
information to backup tools.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Attachment

pgsql-hackers by date:

From: Simon Riggs
Date: 08 July 2016, 08:59:17
Subject: Re: A Modest Upgrade Proposal

From: Petr Jelinek
Date: 08 July 2016, 10:09:25
Subject: Re: A Modest Upgrade Proposal

Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby - Mailing list pgsql-hackers

Attachment

Previous

Next