Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CAPpHfdvk5RxdKZuFDFgDet6ZAzVW0ojxP-pjjqZPFZUW2N5gEA@mail.gmail.com
Whole thread Raw
In response to Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Jun 19, 2025 at 1:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> > > > I think, it is a good idea. Once we do not use the generated data, it is ok
> > > > just to generate WAL segments using the proposed function. I've tested this
> > > > function. The tests worked as expected with and without the fix. The attached
> > > > patch does the change.
> > >
> > > Sorry, forgot to attach the patch. It is created on the current master branch.
> > > It may conflict with your corrections. I hope, it could be useful.
> >
> > Thank you.  I've integrated this into a patch to improve these tests.
> >
> > Regarding assertion failure, I've found that assert in
> > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > previously set by ReplicationSlotReserveWal().  As I can see,
> > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > backward.  The commit in ReplicationSlotReserveWal() even states there
> > is a "chance that we have to retry".
> >
>
> I don't see how this theory can lead to a restart_lsn of a slot going
> backwards. The retry mentioned there is just a retry to reserve the
> slot's position again if the required WAL is already removed. Such a
> retry can only get the position later than the previous restart_lsn.

Yes, if retry is needed, then the new position must be later for sure.
What I mean is that ReplicationSlotReserveWal() can reserve something
later than what standby is going to read (and correspondingly report
with PhysicalConfirmReceivedLocation()).

> >  Thus, I propose to remove the
> > assertion introduced by ca307d5cec90.
> >
>
> If what I said above is correct, then the following part of the commit
> message will be incorrect:
> "As stated in the ReplicationSlotReserveWal() comment, this is not
> always true. Additionally, this issue has been spotted by some
> buildfarm
> members."

I agree, this comment needs improvement in terms of clarity.

Meanwhile I've pushed the patch for TAP tests, which I think didn't
get any objections.

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
Next
From: Alexander Korotkov
Date:
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly