Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
Date
Msg-id CAFiTN-vBtd2VbWAgFj2uPJeZtwwhCYRT67osnF9qZdMoj4nHZg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Sep 9, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> >
> > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old
xlogfiles. 
> >
> > Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number
ofaccumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). 
> >
> > This WAL accumulation is usually caused by :
> >
> > 1. Inactive replication slot
> > 2. PITR failing to keep up
> >
> > In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before
checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done
relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). 
> >
>
> Isn't it better to fix the reasons for WAL accumulation? Because even
> without recovery, this can fill up the disk. For example, one can use
> idle_replication_slot_timeout for inactive slots. Similarly, we can
> see what leads to slow PITR and try to avoid that.

I agree that in the ideal world it's better if someone can set
'idle_replication_slot_timeout' correctly so that we don't even create
WAL accumulation.  But that's not always the case with the user and
there are situations where WAL gets accumulated.  In this context, the
goal is to address the problem after it has already happened,
minimizing additional downtime for the user.  I feel this is a
reasonable goal although we can think more about whether it is worth
issuing the extra checkpoint for improving this situation.

--
Regards,
Dilip Kumar
Google



pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6
Next
From: Ajin Cherian
Date:
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby