Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Date | |
Msg-id | CAEepm=1KQyN+jd+r9X9L2D1zQhH+yofaoUQzVpTBhKm4NG90Yw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
|
List | pgsql-hackers |
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> As for this patch itself, is it reasonable to try to assert that the >>> timeline has in fact changed? > >> The protocol doesn't include the timeline in reply messages, so it's >> not clear how the upstream server would know what timeline the standby >> thinks it's dealing with in any given reply message. The sending >> server has its own idea of the current timeline but it's not in sync >> with the stream of incoming replies. > > Fair enough. But I'd still like an explanation of why only about > half of the population is showing a failure here. Seems like every > machine should be seeing the LSN as moving backwards in this test. > So (a) why aren't they all failing, and (b) should we change the > test to make sure every platform sees that happening? Every machine sees the LSN moving backwards, but the code path that had the assertion only reached if it decides to interpolate, which is timing dependent: there needs to be a future sample in the lag tracking buffer, which I guess is not the case in those runs. For example, "jacana" is one of the animals that reported the assertion failure for 009_twophase.pl but not the one for 004_timeline_switch.pl, in this run: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-04-20%2021%3A00%3A20 This standby2 log fragment corresponds to the LSN zigzag manoeuvre: 2017-04-20 22:04:51.744 GMT [1984] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-20 22:04:51.744 GMT [1984] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-20 22:04:51.744 GMT [1984] LOG: replication terminated by primary server 2017-04-20 22:04:51.744 GMT [1984] DETAIL: End of WAL reached on timeline 1 at 0/3028D50. 2017-04-20 22:04:51.744 GMT [1708] LOG: new target timeline is 2 2017-04-20 22:04:51.744 GMT [1984] LOG: restarted WAL streaming at 0/3000000 on timeline 2 It looks approximately the same on my development machine where I got the assertion failure until commit 84638808: 2017-04-24 09:04:46.563 NZST [95164] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-24 09:04:46.563 NZST [95164] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-24 09:04:46.564 NZST [95164] LOG: replication terminated by primary server 2017-04-24 09:04:46.564 NZST [95164] DETAIL: End of WAL reached on timeline 1 at 0/3028470. 2017-04-24 09:04:46.565 NZST [95160] LOG: new target timeline is 2 2017-04-24 09:04:46.565 NZST [95164] LOG: restarted WAL streaming at 0/3000000 on timeline 2 -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: