Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
Date
Msg-id CAA4eK1LANwLdEhavTfTtmOD8LJ8uUoMY7FtPX_3YF7ge=Z7TcA@mail.gmail.com
Whole thread Raw
In response to [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles  (Nitin Motiani <nitinmotiani@google.com>)
Responses Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
List pgsql-hackers
On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog
files.
>
> Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of
accumulatedWAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). 
>
> This WAL accumulation is usually caused by :
>
> 1. Inactive replication slot
> 2. PITR failing to keep up
>
> In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before
checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done
relativelyquickly and most of the delay is due to RemoveOldXlogFiles(). 
>

Isn't it better to fix the reasons for WAL accumulation? Because even
without recovery, this can fill up the disk. For example, one can use
idle_replication_slot_timeout for inactive slots. Similarly, we can
see what leads to slow PITR and try to avoid that.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Adding pg_dump flag for parallel export to pipes
Next
From: Andrei Lepikhov
Date:
Subject: Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6