Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAJpy0uC+e4D_f4DchkcvgarZ8WpjFSOxgVxbpiaafmEUGD6OXw@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (Ajin Cherian <itsajin@gmail.com>) |
Responses |
Re: Improve pg_sync_replication_slots() to wait for primary to advance
|
List | pgsql-hackers |
On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > Attached v11 patch addressing the above comments. > > > > > > > Please find a few comments: > > > > 1) > > > > + Retry is done after 2 > > + * sec wait. Exits early if promotion is triggered or certain critical > > > > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait. > > > > Changed. > > > 2) > > + /* > > + * Fetch remote slot info for the given slot_names. If slot_names is NIL, > > + * fetch all failover-enabled slots. Note that we reuse slot_names from > > + * the previous iteration; re-fetching all failover slots each time could > > + * cause an endless loop. > > + */ > > > > a) > > the previous iteration --> the first iteration. > > > > b) Also we can mention the reason why we take names from first > > iteration instead of going for pending ones alone, something like: > > > > Instead of reprocessing only the pending slots in each iteration, it's > > better to process all the slots received in the first iteration. > > This ensures that by the time we're done, all slots reflect the latest values. > > > > 3) > > + remote_slots = fetch_remote_slots(wrconn, slot_names); > > + > > + > > + /* Attempt to synchronize slots */ > > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending); > > > > One extra blank line can be removed > > > > Fixed. > > > 4) > > > > + /* Clean up slot_names if allocated in TopMemoryContext */ > > + if (slot_names) > > + list_free_deep(slot_names); > > > > Can we please move it before 'ReplicationSlotCleanup'. > > > > Fixed. > > > 5) > > In case of error in subsequent iteration, slot_names allocated from > > TopMemoryContext will be left unfreed? > > > > I've changed the logic so that even on error, slot_names are freed. I see that you have freed 'slot_names' in config-changed and promotion case but the ERROR can come from other flows as well. The idea was to somehow to free it (if possible) in slotsync_failure_callback() by passing it as an argument, like we do for 'wrconn'. > > 6) > > + ListCell *lc; > > + bool first_slot = true; > > > > Shall we move these two to concerned if-block: > > if (slot_names != NIL) > > > > Changed. > > > 7) > > * The slot_persistence_pending flag is used by the pg_sync_replication_slots > > * API to track if any slots could not be persisted and need to be retried. > > > > a) Instead of mentioning only about slot_persistence_pending argument > > in concerned function's header, we shall define all the arguments. > > > > b) We can remove the 'flag' term from the comments as it is a > > function-argument now. > > > > Changed. > > > 8) > > I think we should add briefly in the header of the file about the new > > behaviour of API. > > > > Added. > > Attaching patch v12 addressing these comments. > Thank You for the patch. Please find a few comments: 1) + bool slot_persistence_pending = false; We can move this declaration outside of the loop. And I think we don't need to initialize as we are resetting it to false before each iteration. 2) + /* Switch to long-lived TopMemoryContext to store slot names */ + oldcontext = MemoryContextSwitchTo(TopMemoryContext); + + /* Extract slot names from the remote slots */ + slot_names = extract_slot_names(remote_slots); + + MemoryContextSwitchTo(oldcontext); I think it will be better if we move 'MemoryContextSwitchTo' calls inside extract_slot_names() itself. The entire logic related to 'slot_names' will then be consolidated in one place 3) + * The slot_persistence_pending flag is used by the pg_sync_replication_slots + * API to track if any slots could not be persisted and need to be retried. Can we change it to below. We can have it started in a new line after a blank line (see how remote_slot_precedes, found_consistent_snapshot are defined) *slot_persistence_pending is set to true if any of the slots fail to persist. It is utilized by the pg_sync_replication_slots() API. Please change it in both synchronize_one_slot() and update_and_persist_local_synced_slot() 4) a) + Update the + * slot_persistence_pending flag, so the API can retry. */ b) /* update the flag, so that the API can retry */ It will be good if we can remove 'flag' usage from both occurrences in update_and_persist_local_synced_slot(). 5) Similar to ProcessSlotSyncInterrupts() for worker, shall we have one such function for API which can have all 3 things: { /* * If we've been promoted, then no point * continuing. */ if (SlotSyncCtx->stopSignaled) { ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("exiting from slot synchronization as" " promotion is triggered"))); } CHECK_FOR_INTERRUPTS(); if (ConfigReloadPending) slotsync_api_reread_config(); } And similar to the worker case, we can have it checked in the beginning of the loop. Thoughts? thanks Shveta
pgsql-hackers by date: