Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From vignesh C
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CALDaNm1u=QV1a7w48BWgkg6GnK20eE_y+raL43D46_R9Wnt8wg@mail.gmail.com
Whole thread Raw
In response to Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List pgsql-hackers
On Sun, 29 Jun 2025 at 11:52, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> Thanks everyone who are working on the bug. IIUC the remained task is
> to add code comments for avoiding the same mistake again described here:
>
> > Sounds reasonable. As per analysis till now, it seems removal of new
> > assert is correct and we just need to figure out the reason in all
> > failure cases as to why the physical slot's restart_lsn goes backward,
> > and then add a comment somewhere to ensure that we don't repeat a
> > similar mistake in the future.
>
> I've wrote a draft for that. How do you think?

I analyzed a scenario involving physical replication where the
restart_lsn appears to go backward by fewer files:
In my case, current restart_lsn = 0/A721E68, which corresponds to a
WAL record in the file 00000001000000000000000A. However, the recovery
process is progressing slowly and is currently at 0/3058078.
Now when the primary_conninfo is changed the walsender/walreceiver
process will be restarted. As a result, the standby starts requesting
WAL streaming from 0/3058078 the last applied wal lsn, which is in the
file 000000010000000000000003.

Since this WAL segment (000000010000000000000003) has already been
removed due to the restart_lsn being far ahead, the primary throws an
error saying WAL segment is removed. Even though the apply pointer is
at 0/3058078 (the record currently being applied), the standby already
has WAL segments up to 0/A721E68 written and available locally. So,
despite the walsender process failing to start due to missing
segments, the recovery process on the standby continues applying WAL
up to 0/A721E68 using the locally available files (including
00000001000000000000000A). Once all local WAL has been applied, the
standby will attempt to start streaming again—this time using the
latest WAL record (in my case, 0/A74000). Since this record still
exists on the primary, the walsender process starts successfully, and
physical replication resumes normally.

This confirms that even in physical replication, a backward-moving
restart_lsn does not cause issues as the standby retains the necessary
WAL locally and can catch up without requiring removed segments from
the primary.

I was able to reproduce this scenario using the
physical_replication_restart_lsn_backward patch attached, here I have
changed to increase the checkpoint to happen frequently and slowed the
recovery process to apply the wal records with delays. For me I was
able to reproduce this on every run. The logs for the same are
attached in the logs.zip file and test used to reproduce this is
available at test_restart_lsn_backward.zip file:

Important log information from the attached logs:
Restart_lsn has moved to 0/A6E8000 in the primary node:
2025-07-02 22:34:44.563 IST walsender[330205] standby1 LOG:
PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
restart_lsn to 0/A6E8000
2025-07-02 22:34:44.563 IST walsender[330205] standby1 STATEMENT:
START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1

When the walsender is restarted, it will request a older WAL which has
been removed and throw an error:
2025-07-02 22:34:44.577 IST walsender[330428] standby1 STATEMENT:
START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
2025-07-02 22:34:44.577 IST walsender[330428] standby1 ERROR:
requested WAL segment 000000010000000000000003 has already been
removed

Standby node has written WAL upto 0/A6E8000:
2025-07-02 22:34:44.563 IST walreceiver[330204] LOG:  XLogWalRcvFlush
LogstreamResult.Flush initialized to 0/A6E8000
2025-07-02 22:34:44.563 IST walreceiver[330204] LOG:  sending write
0/A6E8000 flush 0/A6E8000 apply 0/

Walreceiver process terminated:
2025-07-02 22:34:44.564 IST walreceiver[330204] FATAL:  terminating
walreceiver process due to administrator command

Walreceiver requests wal from 000000010000000000000003 WAL file which
is removed:
2025-07-02 22:34:44.577 IST walreceiver[330427] LOG:  started
streaming WAL from primary at 0/3000000 on timeline 1
2025-07-02 22:34:44.577 IST walreceiver[330427] LOG:  WalReceiverMain
LogstreamResult.Flush initialized to 0/3047958
2025-07-02 22:34:44.577 IST walreceiver[330427] LOG:  sending write
0/3047958 flush 0/3047958 apply 0/3047958
2025-07-02 22:34:44.577 IST walreceiver[330427] FATAL:  could not
receive data from WAL stream: ERROR:  requested WAL segment
000000010000000000000003 has already been removed

Even though the WAL file is removed there is no issue, recovery
process applies the WAL from locally available WAL file:
2025-07-02 22:34:44.578 IST startup[330195] LOG:  Applied wal record
end rec 0/3047990, read rec 0/3047958
2025-07-02 22:34:44.580 IST checkpointer[330193] LOG:  recovery
restart point at 0/3047938

Once all the available WAL's are applied the walsender is started
again and replication proceeds smoothly:
2025-07-02 22:35:15.002 IST walsender[330922] standby1 LOG:  received
replication command: START_REPLICATION SLOT "sb1_slot" 0/A000000
TIMELINE 1
2025-07-02 22:35:15.002 IST walsender[330922] standby1 STATEMENT:
START_REPLICATION SLOT "sb1_slot" 0/A000000 TIMELINE 1

I'm ok with adding the comments.

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Don't keep closed WAL segment in page cache after replay
Next
From: Nathan Bossart
Date:
Subject: Re: Use RELATION_IS_OTHER_TEMP where possible