On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlog
files.
>
> Why : We have seen instances where the crash recovery takes very long (tens of minutes to hours) if a large number of
accumulatedWAL files need to be cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours).
>
> This WAL accumulation is usually caused by :
>
> 1. Inactive replication slot
> 2. PITR failing to keep up
>
> In the above cases when the resolution (deleting inactive slot/disabling PITR) is followed by a crash (before
checkpointcould run), we see the recovery take a very long time. Note that in these cases the actual WAL replay is done
relativelyquickly and most of the delay is due to RemoveOldXlogFiles().
>
Isn't it better to fix the reasons for WAL accumulation? Because even
without recovery, this can fill up the disk. For example, one can use
idle_replication_slot_timeout for inactive slots. Similarly, we can
see what leads to slow PITR and try to avoid that.
--
With Regards,
Amit Kapila.