Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
| From | shveta malik |
|---|---|
| Subject | Re: How can end users know the cause of LR slot sync delays? |
| Date | |
| Msg-id | CAJpy0uD8fj4Z_SPdmKMrd+1hQJBcX_EbOeDPCAAx9xeZZGvE0g@mail.gmail.com Whole thread Raw |
| In response to | Re: How can end users know the cause of LR slot sync delays? (shveta malik <shveta.malik@gmail.com>) |
| List | pgsql-hackers |
On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > I have attached the latest patch. > > > > > > > Thanks. I have started going through it. > > > > I’m not sure if this has already been discussed; I couldn’t find any > > mention of it in the thread. Why don’t we persist > > 'slot_sync_skip_reason' (it is outside of > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the > > last cycle and the server restarts, it would be helpful to know the > > reason it wasn’t synced prior to the node restart. > > > > Please find a few more comments: > > 1) > last_slot_sync_skip > > Will 'last_slotsync_skip_at' be a better name? > > Since we refer worker as slotsync in docs, I feel slotsync seems a > more natural choice than slot_sync. Also '_at' gives clarity that it > is about time rather than a boolean (which currently it seems like). > > Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall > these be slotsync_skip_count and slotsync_skip_reason. > > 2) > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason > skip_reason, > + bool acquire_slot) > > It looks like there is only one caller that passes acquire_slot as > true, while all others pass false. Instead of keeping the acquire_slot > parameter, will it be better if we remove it and add an > Assert(MyReplication) to ensure the slot is already acquired? We can > add a comment stating that this function expects the slot to be > acquired by the caller. The one caller that currently passes > acquire_slot = true can acquire the slot explicitly before invoking > this function. Thoughts? > > 3) > In update_and_persist_local_synced_slot(), we get both > 'found_consistent_snapshot' and 'remote_slot_precedes' from > update_local_synced_slot(). But skipsync-reason for > 'remote_slot_precedes' is updated inside update_local_synced_slot() > while skipsync-reason for '!found_consistent_snapshot' is updated in > caller update_and_persist_local_synced_slot. Is there a reason for > that? > > 4) > What about the case where the slot is invalidated and sync is skipped? > I do not see any stats for that. See 'Skip the sync of an invalidated > slot' in synchronize_one_slot(). If it is already discussed and > concluded, please add a comment. > Few more on 001: 5) The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It sounds more like some WAL issue, rather than indicating that the WAL hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED. 6) Instead of calling 'update_slot_sync_skip_stats' at multiple places, how about we just update the skip_reason everywhere and make a call to 'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO, that will look cleaner. Thoughts? thanks Shveta
pgsql-hackers by date: