Re: [BUGS] Bug in Physical Replication Slots (at least9.5)? - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: [BUGS] Bug in Physical Replication Slots (at least9.5)? |
Date | |
Msg-id | 20170328.155100.219725603.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)? (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
|
List | pgsql-hackers |
This conflicts with 6912acc (replication lag tracker) so just rebased on a6f22e8. At Fri, 17 Mar 2017 16:48:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.164827.46663014.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, > > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com> > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI < > > horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > I managed to reproduce this. A little tweak as the first patch > > > lets the standby to suicide as soon as walreceiver sees a > > > contrecord at the beginning of a segment. > > > > > > - M(aster): createdb as a master with wal_keep_segments = 0 > > > (default), min_log_messages = debug2 > > > - M: Create a physical repslot. > > > - S(tandby): Setup a standby database. > > > - S: Edit recovery.conf to use the replication slot above then > > > start it. > > > - S: touch /tmp/hoge > > > - M: Run pgbench ... > > > - S: After a while, the standby stops. > > > > LOG: #################### STOP THE SERVER > > > > > > - M: Stop pgbench. > > > - M: Do 'checkpoint;' twice. > > > - S: rm /tmp/hoge > > > - S: Fails to catch up with the following error. > > > > > > > FATAL: could not receive data from WAL stream: ERROR: requested WAL > > > segment 00000001000000000000002B has already been removed > > > > > > > > I have been testing / reviewing the latest patch > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might > > need some more clarification on this. > > > > Before applying the patch, I tried re-producing the above error - > > > > - I had master->standby in streaming replication > > - Took the backup of master > > - with a low max_wal_size and wal_keep_segments = 0 > > - Configured standby with recovery.conf > > - Created replication slot on master > > - Configured the replication slot on standby and started the standby > > I suppose the "configure" means primary_slot_name in recovery.conf. > > > - I got the below error > > > > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at > > 0/F2000140: wanted 24, got 0 > > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from > > primary at 0/F2000000 on timeline 1 > > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data > > from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has > > already been removed > > Maybe you created the master slot with non-reserve (default) mode > and put a some-minites pause after making the backup and before > starting the standby. For the case the master slot doesn't keep > WAL segments unless the standby connects so a couple of > checkpoints can blow away the first segment required by the > standby. This is quite reasonable behavior. The following steps > makes this more sure. > > > - Took the backup of master > > - with a low max_wal_size = 2 and wal_keep_segments = 0 > > - Configured standby with recovery.conf > > - Created replication slot on master > + - SELECT pg_switch_wal(); on master twice. > + - checkpoint; on master twice. > > - Configured the replication slot on standby and started the standby > > Creating the slot with the following command will save it. > > =# select pg_create_physical_replication_slot('s1', true); > > > > and i could notice that the file "0000000100000000000000F2" was removed > > from the master. This can be easily re-produced and this occurs > > irrespective of configuring replication slots. > > > > As long as the file "0000000100000000000000F2" is available on the master, > > standby continues to stream WALs without any issues. > ... > > If the scenario i created to reproduce the error is correct, then, applying > > the patch is not making a difference. > > Yes, the patch is not for saving this case. The patch saves the > case where the previous segment to the first required segment by > standby was removed and it contains the first part of a record > continues to the first required segment. On the other hand this > case is that the segment at the start point of standby is just > removed. > > > I think, i need help in building a specific test case which will re-produce > > the specific BUG related to physical replication slots as reported ? > > > > Will continue to review the patch, once i have any comments on this. > > Thaks a lot! -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: