Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | Ajin Cherian |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAFPTHDbrcygRmgysH-GSww2-O4hyY+idWQZn53ogss96pL-Xxw@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (Ashutosh Sharma <ashu.coek88@gmail.com>) |
Responses |
Re: Improve pg_sync_replication_slots() to wait for primary to advance
|
List | pgsql-hackers |
On Thu, Sep 4, 2025 at 4:35 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Sep 3, 2025 at 3:19 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Thanks for the patch. Please find a few comments: > > 1) > /* Clean up slot_names if allocated in TopMemoryContext */ > if (slot_names) > { > oldcontext = MemoryContextSwitchTo(TopMemoryContext); > list_free_deep(slot_names); > MemoryContextSwitchTo(oldcontext); > } > > I think we can free slot_names without switching the context. Can you > please check this? > Removed this. > 2) > We should add a comment for: > a) why we are using the slot-names from the first cycle instead of > fetching all failover slots in each cycle. > b) why are we relocating remote_slot list everytime. > I have added a comment, let me know if any more is required. > > 3) > @@ -1130,7 +1180,7 @@ slotsync_reread_config(void) > > - Assert(sync_replication_slots); > + Assert(!AmLogicalSlotSyncWorkerProcess() || sync_replication_slots); > > Do we still need this change after slotsync_api_reread_config? > Removed. > 4) > +static void ProcessSlotSyncInterrupts(void); > > This is not needed. > Removed. > > 5) > > + /* update flag, so that we retry */ > + slot_persistence_pending = true; > > Can we tweak it to: 'Update the flag so that the API can retry' Updated. > > 6) > SyncReplicationSlots(): > + /* Free the current remote_slots list */ > + if (remote_slots) > + list_free_deep(remote_slots); > > Do we need a 'remote_slots' check, won't it manage it internally? We > don't have it in ReplSlotSyncWorkerMain(). > Changed. > 7) > slotsync_api_reread_config > > + ereport(ERROR, > + (errcode(ERRCODE_CONFIG_FILE_ERROR), > + errmsg("cannot continue slot synchronization due to parameter changes"), > + errdetail("Critical replication parameters (primary_conninfo, > primary_slot_name, or hot_standby_feedback) have changed since > pg_sync_replication_slots() started."), > + errhint("Retry pg_sync_replication_slots() to use the updated > configuration."))); > > I am unsure if we need to mention '(primary_conninfo, > primary_slot_name, or hot_standby_feedback)', but would like to know > what others think. > Leaving this as is now. On Mon, Sep 8, 2025 at 2:20 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Sat, Sep 6, 2025 at 12:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > On Fri, Sep 5, 2025 at 6:52 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > I've tested this and I see that interrupts are being handled by > > > > sending SIGQUIT and SIGINT to the backend process. > > > > > > Can you please point me to the code (the call to > > > CHECK_FOR_INTERRUPTS()) which processes these interrupts while > > > pg_sync_replication_slots() is executing, especially when the function > > > is waiting while syncing a slot. > > > > > > > I noticed that the function libpqrcv_processTuples, which is invoked > > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is > > currently helping in processing interrupts while we are in an infinite > > loop within SyncReplicationSlots(). I’m just pointing this out based > > on my observation while reviewing the changes in this patch. Ajin, > > please correct me if I’m mistaken. If not, can we always rely on this > > particular check for interrupts. > > It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far > away. It's better to have one being called from SyncReplicationSlots() > which has the wait loop. That's how the other functions which have > potentially long wait loops do. > Ok, I agree. Added the Check. On Mon, Sep 8, 2025 at 2:33 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi, > > Sharing some of my review comments, please look if these make sense to you. > > > + /* > + * If we've been promoted, then no point > + * continuing. > + */ > + if (SlotSyncCtx->stopSignaled) > + { > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("exiting from slot synchronization as" > + " promotion is triggered"))); > + break; > + } > "break" statement here looks redundant to me. > Removed. > -- > > + * > + * Repeatedly fetches and updates replication slot information from the > + * primary until all slots are at least "sync ready". Retry is done after 2 > + * sec wait. Exits early if promotion is triggered or certain critical > + * configuration parameters have changed. > */ > > wait for 2 seconds before retrying - the constant is > SLOTSYNC_API_NAPTIME_MS, so technically it may *not* always be 2s if > the macro changes. Maybe reword to “wait for SLOTSYNC_API_NAPTIME_MS > before retrying” would look better? > I've removed the reference to 2 seconds. > -- > > /* Retry until all slots are sync ready atleast */ > > and > > > /* Done if all slots are atleast sync ready */ > > atleast -> "at least". I am just making this comment because at few > places in the same file I see "at least" and not "atleast". > Changed. > -- > > +static void ProcessSlotSyncInterrupts(void); > > Is this change related to this patch? > It was used earlier, and forgot to change it. Removed now. > -- > > + <command>CREATE SUBSCRIPTION</command> during slot creation. After that, > + synchronization can be be performed either manually by calling > + <link linkend="pg-sync-replication-slots"> > > double "be". > Fixed On Fri, Sep 5, 2025 at 11:21 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote: > > AFAICS, it's a matter of expanding the scope of what's returned by > those functions. The worker may not want to use the whole expanded > scope but the API will use it. That shouldn't change the functionality > of the worker, but it will help avoid the global variable - which have > much wider scope and their maintenance can be prone to bugs. > > > Ok, added a new return parameter for these functions that will return if there are any slots pending persistence and removed the global variable. Attached v11 patch addressing the above comments. regards, Ajin Cherian Fujitsu Australia
Attachment
pgsql-hackers by date: