Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAJpy0uAMa6xjVsc9tA8+f0ij4L+gw2gw8F=R1gmeaLwEyHpTSw@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
List | pgsql-hackers |
On Fri, Sep 5, 2025 at 6:51 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > +/* > > > + * Flag used by pg_sync_replication_slots() > > > + * to do retries if the slot did not persist while syncing. > > > + */ > > > +static bool slot_persistence_pending = false; > > > > > > I don't think we need to keep a global variable for this. The variable > > > is used only inside SyncReplicationSlots() and the call depth is not > > > more than a few calls. From synchronize_slots(), before which the > > > variable is reset and after which the variable is checked, to > > > update_and_persist_local_synced_slot() which sets the variable, all > > > the functions return bool. All of them can be made to return an > > > integer status instead indicating the result of the operation. If we > > > do so we could check the return value of synchronize_slots() to decide > > > whether to retry or not, isntead of maintaining a global variable > > > which has a much wider scope than required. It's difficult to keep it > > > updated over the time. > > > > > > > The problem is that all those calls synchronize_slots() and > > update_and_persist_local_synced_slot() are shared with the slotsync > > worker logic and API. Hence, changing this will affect slotsync_worker > > logic as well. While the API needs to spefically retry only if the > > initial sync fails, the slotsync worker will always be retrying. I > > feel using a global variable is a more convenient way of doing this. > > AFAICS, it's a matter of expanding the scope of what's returned by > those functions. The worker may not want to use the whole expanded > scope but the API will use it. That shouldn't change the functionality > of the worker, but it will help avoid the global variable - which have > much wider scope and their maintenance can be prone to bugs. > I think this can be done. > > > > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated) > > > > > > The function is too cute to be useful. The code should be part of > > > ReplSlotSyncWorkerMain() just like other worker's main functions. > > > > > > > But this wouldn't be part of this feature. > > > > > void > > > SyncReplicationSlots(WalReceiverConn *wrconn) > > > { > > > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn)); > > > { > > > > > > Shouldn't this function call CheckForInterrupts() somewhere in the > > > loop since it could be potentially an infinite loop? > > > > I've tested this and I see that interrupts are being handled by > > sending SIGQUIT and SIGINT to the backend process. > > Can you please point me to the code (the call to > CHECK_FOR_INTERRUPTS()) which processes these interrupts while > pg_sync_replication_slots() is executing, especially when the function > is waiting while syncing a slot. > > > I noticed that the function libpqrcv_processTuples, which is invoked > > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is > > currently helping in processing interrupts while we are in an infinite > > loop within SyncReplicationSlots(). I’m just pointing this out based > > on my observation while reviewing the changes in this patch. Ajin, > > please correct me if I’m mistaken. If not, can we always rely on this > > particular check for interrupts. > > It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far > away. It's better to have one being called from SyncReplicationSlots() > which has the wait loop. That's how the other functions which have > potentially long wait loops do. +1 thanks Shveta
pgsql-hackers by date: