Home > mailing lists

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles - Mailing list pgsql-hackers

From	Dilip Kumar
Subject	Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
Date	September 9 10:12:00
Msg-id	CAFiTN-vBtd2VbWAgFj2uPJeZtwwhCYRT67osnF9qZdMoj4nHZg@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles (Amit Kapila <amit.kapila16@gmail.com>)
List	pgsql-hackers

Tree view

On Tue, Sep 9, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:
> >
> > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old
xlogfiles. 
> >
> > Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number
ofaccumulated WAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). 
> >
> > This WAL accumulation is usually caused by :
> >
> > 1. Inactive replication slot
> > 2. PITR failing to keep up
> >
> > In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before
checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done
relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). 
> >
>
> Isn't it better to fix the reasons for WAL accumulation? Because even
> without recovery, this can fill up the disk. For example, one can use
> idle_replication_slot_timeout for inactive slots. Similarly, we can
> see what leads to slow PITR and try to avoid that.

I agree that in the ideal world it's better if someone can set
'idle_replication_slot_timeout' correctly so that we don't even create
WAL accumulation.  But that's not always the case with the user and
there are situations where WAL gets accumulated.  In this context, the
goal is to address the problem after it has already happened,
minimizing additional downtime for the user.  I feel this is a
reasonable goal although we can think more about whether it is worth
issuing the extra checkpoint for improving this situation.

--
Regards,
Dilip Kumar
Google

pgsql-hackers by date:

From: Andrei Lepikhov
Date: 09 September, 10:02:01
Subject: Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

From: Ajin Cherian
Date: 09 September, 10:23:25
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby

Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles - Mailing list pgsql-hackers

Previous

Next