Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From shveta malik
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CAJpy0uD8fj4Z_SPdmKMrd+1hQJBcX_EbOeDPCAAx9xeZZGvE0g@mail.gmail.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > >
> > > I have attached the latest patch.
> > >
> >
> > Thanks. I have started going through it.
> >
> > I’m not sure if this has already been discussed; I couldn’t find any
> > mention of it in the thread. Why don’t we persist
> > 'slot_sync_skip_reason' (it is outside of
> > ReplicationSlotPersistentData)? If a slot wasn’t synced during the
> > last cycle and the server restarts, it would be helpful to know the
> > reason it wasn’t synced prior to the node restart.
> >
>
> Please find a few more comments:
>
> 1)
> last_slot_sync_skip
>
> Will 'last_slotsync_skip_at' be a better name?
>
> Since we refer worker as slotsync in docs, I feel slotsync seems a
> more natural choice than slot_sync. Also '_at' gives clarity that it
> is about time rather than a boolean (which currently it seems like).
>
> Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall
> these be slotsync_skip_count and slotsync_skip_reason.
>
> 2)
> +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason
> skip_reason,
> + bool acquire_slot)
>
> It looks like there is only one caller that passes acquire_slot as
> true, while all others pass false. Instead of keeping the acquire_slot
> parameter, will it be better if we remove it and add an
> Assert(MyReplication) to ensure the slot is already acquired? We can
> add a comment stating that this function expects the slot to be
> acquired by the caller. The one caller that currently passes
> acquire_slot = true can acquire the slot explicitly before invoking
> this function. Thoughts?
>
> 3)
> In update_and_persist_local_synced_slot(), we get both
> 'found_consistent_snapshot' and 'remote_slot_precedes' from
> update_local_synced_slot(). But skipsync-reason for
> 'remote_slot_precedes' is updated inside update_local_synced_slot()
> while skipsync-reason for '!found_consistent_snapshot' is updated in
> caller update_and_persist_local_synced_slot.  Is there a reason for
> that?
>
> 4)
> What about the case where the slot is invalidated and sync is skipped?
> I do not see any stats for that. See 'Skip the sync of an invalidated
> slot' in synchronize_one_slot(). If it is already discussed and
> concluded, please add a comment.
>

Few more on 001:

5)
The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It
sounds more like some WAL issue, rather than indicating that the WAL
hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED.

6)
Instead of calling 'update_slot_sync_skip_stats' at multiple places,
how about we just update the skip_reason everywhere and make a call to
'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO,
that will look cleaner. Thoughts?

thanks
Shveta



pgsql-hackers by date:

Previous
From: Paul A Jungwirth
Date:
Subject: GiST README typos
Next
From: Alexander Pyhalov
Date:
Subject: Re: Asynchronous MergeAppend