Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date
Msg-id CAA4eK1K0QUo-_cT__pdPUbwwab54Tnyg_3oOqOdudnwyhLyJmg@mail.gmail.com
Whole thread Raw
In response to Re: POC: enable logical decoding when wal_level = 'replica' without a server restart  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thu, Sep 11, 2025 at 11:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 10, 2025 at 11:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Sep 6, 2025 at 3:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I've attached the updated patch that incorporated all comments I got so far.
> > >
> >
> > *
> > + /*
> > + * While all processes are using the new status, there could be some
> > + * transactions that might have started with the old status. So wait
> > + * for the running transactions to complete so that logical decoding
> > + * doesn't include transactions that wrote WAL with insufficient
> > + * information.
> > + */
> > + running = GetRunningTransactionData();
> > + LWLockRelease(ProcArrayLock);
> > + LWLockRelease(XidGenLock);
> > +
> > + elog(DEBUG1, "waiting for %d transactions to complete", running->xcnt);
> > +
> > + for (int i = 0; i < running->xcnt; i++)
> > + {
> > + TransactionId xid = running->xids[i];
> > +
> > + if (TransactionIdIsCurrentTransactionId(xid))
> > + continue;
> > +
> > + XactLockTableWait(xid, NULL, NULL, XLTW_None);
> > + }
> >
> > When building a snapshot during the start of logical decoding, we
> > anyway wait for running transactions to finish via the snapbuild
> > machinery. So, why do we need it here? And if it is needed, can we
> > update the comments to explain why it is required in spite of
> > snapbuild machinery doing similar thing?
>
> Fair point. I don't see any reason we need to wait here. Will remove this step.
>

We can add a comment there explaining why we don't wait for
in-progress transactions. This will also be important if we miss
anything and later need to handle it similarly.

One thing related to this which needs a discussion is after this
change, it is possible that part of the transaction contains
additional logical_wal_info. I couldn't think of a problem due to this
but users using pg_waldump or other WAL reading utilities could
question this. One possibility is that we always start including
logical_wal_info for the next new transaction but not sure if that is
required. It would be good if other people involved in the discussion
or otherwise could share their opinion on this point.

> > * Is it a good idea to enable/disable decoding for temporary logical
> > slots? The temporary slots are released during ERROR or at session
> > end, is that a good time to do the disable processing that even
> > requires WAL writing.
>
> I think the same is true for slots with RS_EPEMERAL state. Since it
> could confuse users if automatic effective_wal_level change is
> supported only for non-temporary slots, I personally would like not to
> push aside temporary slots. I agree that it might not be a good time
> to disable processing during process shutdown time; in addition to
> requiring WAL record, it also requires waits for concurrent state
> change processings while it holds all interrupts, which could easily
> involve dead-locks.
>

Yes, all such processing during ERROR and shutdown sounds scary and a
source for problems.

> It might be worth considering doing the disable
> process in a lazy way. For example, other processes (like
> checkpointer) periodically checks the logical decoding status and
> disables it if necessary.
>

Yeah, doing lazily sounds reasonable to me. We need to do lazily only
for ERROR cases, otherwise, during a normal drop_slot, it may be okay.
But OTOH, while dropping the slot as a part of subscription drop, it
could be risky because if due to any reason, the disabling took more
time, the subscription drop operation would look like hang or in worse
case, the connection can time out.

For the shutdown sequence, can't we think of resetting effective_wal
after a restart?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Clear logical slot's 'synced' flag on promotion of standby
Next
From: Michael Paquier
Date:
Subject: Re: Incorrect logic in XLogNeedsFlush()