Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAJpy0uCSM9Go1mHp58w504MEfFEUqwu6vHfDwNnn-SobnQaZig@mail.gmail.com Whole thread Raw |
In response to | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Fri, Aug 8, 2025 at 3:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Aug 5, 2025 at 11:23 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Please find a few comments on v6: > > > > 1) > > +/* > > + * Initialize logical decoding status on shmem at server startup. This > > + * must be called ONCE during postmaster or standalone-backend startup, > > + * before initializing replication slots. > > + */ > > +void > > +StartupLogicalDecodingStatus(bool last_status) > > > > The comment says that it needs to be called 'before initializing > > replication slots' but instead it is called after initializing > > replication slots (i.e. after StartupReplicationSlots). > > Removed. > > > Also, can you please help me understand the need of > > 'StartupLogicalDecodingStatus' when we are doing > > 'UpdateLogicalDecodingStatusEndOfRecovery' later in StartupXLOG. Why > > do we need to set last_status temporarily when the new status can be > > different which will be set in > > UpdateLogicalDecodingStatusEndOfRecovery > > IIUC we need to initialize the logical decoding status with the status > we used to use when the server shutdown or when the basebackup was > taken. This status would be used during recovery and might be changed > by replaying the XLOG_LOGICAL_DECODING_STATUS_CHANGE record. At the > end of recovery, we update the status based on the server's wal_level > and the number of logical replication slots so the new status could be > different from the status used during the recovery. > Okay. > > > > > > 2) > > CreatePublication() has this: > > > > + errmsg("logical decoding needs to be enabled to publish logical changes"), > > + errhint("Set \"wal_level\" to \"logical\" or create a logical > > replication slot with \"replica\" \"wal_level\" before creating > > subscriptions."))); > > > > While rest of the places has this: > > > > + errhint("Set \"wal_level\" >= \"logical\" or create at least one > > logical slot on the primary."))); > > > > Shall we make these errhint consistent? Either all mention > > 'wal_level=replica' condition along with slot-creation part or none. > > Fixed. > > > > > > > 3) > > xlog_decode(): > > > > + case XLOG_LOGICAL_DECODING_STATUS_CHANGE: > > /* > > * This can occur only on a standby, as a primary would > > - * not allow to restart after changing wal_level < logical > > + * not allow to restart after changing wal_level < replica > > * if there is pre-existing logical slot. > > */ > > Assert(RecoveryInProgress()); > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > - errmsg("logical decoding on standby requires \"wal_level\" >= > > \"logical\" on the primary"))); > > + errmsg("logical decoding must be enabled on the primary"))); > > > > Is the comment correct? > > > > a) > > XLOG_LOGICAL_DECODING_STATUS_CHANGE can be result of logical-slot drop > > on primary and not necessarily making wal_level < replica > > > > b) > > I see that even standby does not allow to restart when changing > > wal_level < replica as against what comment says. Have I understood > > the intent correctly? > > > > standby LOG: > > FATAL: logical replication slot "failover_slot_st" exists, but > > "wal_level" < "replica" > > HINT: Change "wal_level" to be "replica" or higher. > > I think I don't get your point. The comment looks correct to me. > > On a primary server, a logical slot can only decode WAL records that > were generated while that slot existed. A > XLOG_LOGICAL_DECODING_STATUS_CHANGE record with logical_decoding=false > is generated in two cases: after the last logical slot is dropped, or > when the server starts with no logical slot and wal_level<='replica'. > In either case, no logical slots can exist that would be able to > decode these WAL records. However, on standby servers, it is possible > to decode these XLOG_LOGICAL_DECODING_STATUS_CHANGE records with > logical_decoding=false, as standbys can decode WAL records > independently of the primary. > Okay, I see your point. Thanks for explaining. > > > > > > 4) > > start_logical_decoding_status_change(): > > + if (LogicalDecodingCtl->transition_in_progress) > > + { > > + LWLockRelease(LogicalDecodingControlLock); > > > > read_logical_decoding_status_transition() takes care of checking > > transition_in_progress, I think we missed to remove above from > > start_logical_decoding_status_change(). > > Fixed. > > > > > > > 5) > > + /* Return if we don't need to change the status */ > > + if (LogicalDecodingCtl->logical_decoding_enabled == new_status) > > + { > > > > Same with this code-logic in start_logical_decoding_status_change(), > > we shall remove it. > > Fixed. > > > > > 6) > > + * If we're in recovery and the startup process is still taking > > + * responsibility to update the status, we cannot change. > > + */ > > + if (!delay_status_change) > > + return false; > > + > > > > This comment is confusing as when in recovery, we can not change state > > otherwise as well even if delay_status_change is false. IIUC, the > > scenario can arise only during promotion, if so, shall we say: > > > > "If we're in recovery and a state transition (e.g., promotion) is in > > progress, wait for the transition to complete and retry on the new > > primary. Otherwise, disallow the status change entirely, as a standby > > cannot modify the logical decoding status." > > Fixed. > > > > > 7) > > The name 'delay_status_change' does not indicate which status or the > > intent of delay. More name options are: defer_logical_status_change, > > wait_for_recovery_transition/completion, > > recovery_transition_in_progress > > I think 'delay' is used in other similar examples in PostgreSQL code. > For instance, we have DELAY_CHKPT_START/COMPLETE/IN_COMMIT that are > set by transactions to delay the actual checkpoint process until these > transactions complete certain operations. In our case, the flag is set > by the startup process in order to delay the actual status change > process by other processes until the recovery completes. Which is a > very similar usage so I believe 'delay' is appropriate here. > > Regarding the 'status', I guess it's relatively obvious in this > context that the status indicates the logical decoding status so I'm > not sure that readers would confuse this name. > Okay, we can retain the same. > > > > 8) > > DisableLogicalDecodingIfNecessary(): > > + > > + /* Write the WAL to disable logical decoding on standbys too */ > > + if (XLogStandbyInfoActive() && !recoveryInProgress) > > + { > > > > Do we need 'recoveryInProgress' check here? > > start_logical_decoding_status_change() has taken care of that. > > Removed. > > > > > 9) > > Comments atop DisableLogicalDecodingIfNecessary: > > > > * This function expects to be called after dropping a possibly-last logical > > * replication slot. Logical decoding can be disabled only when wal_level is set > > * to 'replica' and there is no logical replication slot on the system. > > > > The comment is not completely true, shall we amend the comment to say > > something like: > > > > This function is called after a logical slot is dropped, but it only > > disables logical decoding on primary if it was the last remaining > > logical slot and wal_level < logical. Otherwise, it performs no > > action. > > Thank you for the suggestion. I modified the comment based on the suggestion > > > > > 10) > > When we try to create or drop a logical slot on standby, and if > > delay_status_change is false, shall we immediately exit? Currently it > > does a lot of checks including CheckLogicalSlotExists() which can be > > completely avoided. I think it is worth having a quick > > 'RecoveryInProgress() && !delay_status_change' check in the beginning. > > Yeah, we can simplify the start_logical_decoding_status_change() logic more. > > I've attached the updated patch. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: