Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | Ashutosh Sharma |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAE9k0PnRW3UZSVeC0DYSXBiWhzhq9CwkvFZ7GC2TFLmV+_OHhw@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Improve pg_sync_replication_slots() to wait for primary to advance
|
List | pgsql-hackers |
On Tue, Sep 16, 2025 at 11:53 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > > >
> > > > Attached v11 patch addressing the above comments.
> > > >
> > >
> > > Please find a few comments:
> > >
> > > 1)
> > >
> > > + Retry is done after 2
> > > + * sec wait. Exits early if promotion is triggered or certain critical
> > >
> > > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
> > >
> >
> > Changed.
> >
> > > 2)
> > > + /*
> > > + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
> > > + * fetch all failover-enabled slots. Note that we reuse slot_names from
> > > + * the previous iteration; re-fetching all failover slots each time could
> > > + * cause an endless loop.
> > > + */
> > >
> > > a)
> > > the previous iteration --> the first iteration.
> > >
> > > b) Also we can mention the reason why we take names from first
> > > iteration instead of going for pending ones alone, something like:
> > >
> > > Instead of reprocessing only the pending slots in each iteration, it's
> > > better to process all the slots received in the first iteration.
> > > This ensures that by the time we're done, all slots reflect the latest values.
> > >
> > > 3)
> > > + remote_slots = fetch_remote_slots(wrconn, slot_names);
> > > +
> > > +
> > > + /* Attempt to synchronize slots */
> > > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
> > >
> > > One extra blank line can be removed
> > >
> >
> > Fixed.
> >
> > > 4)
> > >
> > > + /* Clean up slot_names if allocated in TopMemoryContext */
> > > + if (slot_names)
> > > + list_free_deep(slot_names);
> > >
> > > Can we please move it before 'ReplicationSlotCleanup'.
> > >
> >
> > Fixed.
> >
> > > 5)
> > > In case of error in subsequent iteration, slot_names allocated from
> > > TopMemoryContext will be left unfreed?
> > >
> >
> > I've changed the logic so that even on error, slot_names are freed.
>
> I see that you have freed 'slot_names' in config-changed and promotion
> case but the ERROR can come from other flows as well. The idea was to
> somehow to free it (if possible) in slotsync_failure_callback() by
> passing it as an argument, like we do for 'wrconn'.
>
Are you suggesting introducing a structure (for example, SlotSyncContext as shown below) that encapsulates both wrconn and slot_names, and then passing a pointer to this structure as the Datum argument to the slotsync_failure_callback cleanup function, so that the callback can handle freeing wrconn and slot_names and maybe some other members within the structure that allocate memory?
/*
* Extended structure that can hold both connection and slot_names info
*/
typedef struct SlotSyncContext
{
WalReceiverConn *wrconn; /* Must be first for compatibility */} SlotSyncContext;
List *slot_names; /* Pointer to slot_names list */
bool extended; /* Flag to indicate extended context */
SyncReplicationSlots(WalReceiverConn *wrconn)
{
SlotSyncContext sync_ctx;
...
...
/* Initialize extended context */
sync_ctx.wrconn = wrconn;
sync_ctx.slot_names_ptr = &slot_names;
sync_ctx.extended = true;
PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&sync_ctx));
...
}
--
With Regards,
Ashutosh Sharma.
pgsql-hackers by date: