Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAD21AoAz1RkCfs-VD6Sm9bCFKiDC=9O-KAtcjxXeL76O3z8PaQ@mail.gmail.com Whole thread Raw |
In response to | RE: POC: enable logical decoding when wal_level = 'replica' without a server restart ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
RE: POC: enable logical decoding when wal_level = 'replica' without a server restart Re: POC: enable logical decoding when wal_level = 'replica' without a server restart RE: POC: enable logical decoding when wal_level = 'replica' without a server restart |
List | pgsql-hackers |
On Wed, Aug 27, 2025 at 7:45 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > > Assuming that logical_decoding written in the WAL is false here, and a logical > > > replication slot is created just after that. In my experiments below happened: > > > > > > > Let me clarify each step: > > > > > 1. startup process updated logical_decoding_enabled to false, at line 8652. > > > > I assume that logical_decoding_enabled was enabled before step 1. > > Right. Initially logical replication slot exist on both primary and standby. > More detail; the standby slot was created by the slotsync worker. > > > > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical > > > slot and started logical decoding with fast_foward mode. > > > > I guess that the postmaster launched the slotsync worker before the > > startup changes the status since logical decoding was enabled as I > > mentioned above, which seems fine to me. > > As you said, the slotsync worker has already been launched when the status is > changed. I felt logical slot() should not be created after the status on the shared > memory is changed. > > > > 3. startup invalidated logical slots due to the wal_level. the slot created at > > > step2 was automatically dropped, because it was not sync-readly yet. > > > 4. startup process shut down the slotsync worker. > > > 5. start process read the STATUS_CHANGE record again, which has the value > > "true". > > > it requested to restart the sync worker. > > > 6. restarted sync worker synchronize the slot again... > > > > > > For me it works well but it is bit a strange because 1) logical decoding is > > > started even when effective_wal_level is false, > > > > I think it's a race condition between the postmaster and the startup, > > it could happen even between the backend and the startup; the startup > > disables logical decoding right after the backend passes > > CheckLogicalDecodingRequirements() check. I think it's technically > > okay since all WAL records before the STATUS_CHANGE should have the > > logical information. Even if it starts to do logical decoding, it > > would end up decoding the STATUS_CHANGE record and with an error (see > > xlog_decode()). My understanding of where the synced slot starts to move was not right; it starts from the remote slot's restart_lsn, which could be far ahead from the STATUS_CHANGE record that the startup process is applying but where logical decoding should be enabled. It doesn't happen that the slotsync worker tries to decode non-logical WAL records even if it advances the slot after the startup disabled logical decoding. > To clarify, are you thinking that it is no need to be fixed, because eventually > the system becomes the appropriate state, right? IIUC you're concerned it's possible that the slotsync worker creates or advances a logical slot between the startup changes the logical decoding status to false and sends the stop signal. TBH I have no idea how efficiently to fix it. I've considered a simple idea that the slotsync worker checks IsLogicalDecodingEnabled() before trying to sync one logical slot. However, it doesn't solve the race condition; the startup process can disable logical decoding right after the slotsync passed the check, in which case users would see the logical slot is created after logical decoding is disabled. Another race condition that we might need to deal with is, the slotsync worker is launched while logical decoding is still enabled, but if the startup sends the stop signal to the slotsync worker before the worker sets its pid to SlotSyncCtx->pid, the worker will keep running. I've added the check !IsLogicalDecodingEnabled() to the slotsync worker's initialization. > > > > and 2) the synced slot is > > > dropped once with below message: > > > > > > ``` > > > LOG: terminating process 1474448 to release replication slot "test2" > > > DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at > > least one logical slot on the primary server. > > > CONTEXT: WAL redo at 0/030000B8 for > > XLOG/LOGICAL_DECODING_STATUS_CHANGE: false > > > ERROR: canceling statement due to conflict with recovery > > > DETAIL: User was using a logical replication slot that must be invalidated. > > > ``` > > > > > > Can we stop the sync worker before updating the status? IIUC this is one of the > > > solution. > > > > I think it would lead to another race condition; the slotsync worker > > can start again before updating the status. > > Hmm, okay. > > Another small comment: this data structure is not used in other files, no need to set extern. > > ``` > extern LogicalDecodingCtlData *LogicalDecodingCtl; > ``` Removed. I've attached the updated patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: