Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uAMa6xjVsc9tA8+f0ij4L+gw2gw8F=R1gmeaLwEyHpTSw@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
On Fri, Sep 5, 2025 at 6:51 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > > +/*
> > > + * Flag used by pg_sync_replication_slots()
> > > + * to do retries if the slot did not persist while syncing.
> > > + */
> > > +static bool slot_persistence_pending = false;
> > >
> > > I don't think we need to keep a global variable for this. The variable
> > > is used only inside SyncReplicationSlots() and the call depth is not
> > > more than a few calls. From synchronize_slots(), before which the
> > > variable is reset and after which the variable is checked, to
> > > update_and_persist_local_synced_slot() which sets the variable, all
> > > the functions return bool. All of them can be made to return an
> > > integer status instead indicating the result of the operation. If we
> > > do so we could check the return value of synchronize_slots() to decide
> > > whether to retry or not, isntead of maintaining a global variable
> > > which has a much wider scope than required. It's difficult to keep it
> > > updated over the time.
> > >
> >
> > The problem is that all those calls synchronize_slots() and
> > update_and_persist_local_synced_slot() are shared with the slotsync
> > worker logic and API. Hence, changing this will affect slotsync_worker
> > logic as well. While the API needs to spefically retry only if the
> > initial sync fails, the slotsync worker will always be retrying. I
> > feel using a global variable is a more convenient way of doing this.
>
> AFAICS, it's a matter of expanding the scope of what's returned by
> those functions. The worker may not want to use the whole expanded
> scope but the API will use it. That shouldn't change the functionality
> of the worker, but it will help avoid the global variable - which have
> much wider scope and their maintenance can be prone to bugs.
>

I think this can be done.

> >
> > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
> > >
> > > The function is too cute to be useful. The code should be part of
> > > ReplSlotSyncWorkerMain() just like other worker's main functions.
> > >
> >
> > But this wouldn't be part of this feature.
> >
> > > void
> > > SyncReplicationSlots(WalReceiverConn *wrconn)
> > > {
> > > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
> > > {
> > >
> > > Shouldn't this function call CheckForInterrupts() somewhere in the
> > > loop since it could be potentially an infinite loop?
> >
> > I've tested this and I see that interrupts are being handled by
> > sending SIGQUIT and SIGINT to the backend process.
>
> Can you please point me to the code (the call to
> CHECK_FOR_INTERRUPTS()) which processes these interrupts while
> pg_sync_replication_slots() is executing, especially when the function
> is waiting while syncing a slot.
>
> > I noticed that the function libpqrcv_processTuples, which is invoked
> > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is
> > currently helping in processing interrupts while we are in an infinite
> > loop within SyncReplicationSlots(). I’m just pointing this out based
> > on my observation while reviewing the changes in this patch. Ajin,
> > please correct me if I’m mistaken. If not, can we always rely on this
> > particular check for interrupts.
>
> It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far
> away. It's better to have one being called from SyncReplicationSlots()
> which has the wait loop. That's how the other functions which have
> potentially long wait loops do.

+1

thanks
Shveta



pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Next
From: Ashutosh Sharma
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance