Thread: Re: Standby server with cascade logical replication could not be properly stopped under load

On Thu, May 22, 2025 at 7:51 AM Alexey Makhmutov
<a.makhmutov@postgrespro.ru> wrote:
>
> Assuming following configuration with three connected servers A->B->C: A
> (primary), B (physical standby) and C (logical replica connected to B).
> If server A is under load and B is applying incoming WAL records while
> also streaming data via logical replication to C, then attempt to stop
> server B in 'fast' mode may by unsuccessful. In this case server will
> remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes
> running in an infinite busy loop (consuming a CPU core each). To get
> server out of this state it's required either to either stop B using
> 'immediate' mode or stop server C (which will cause 'walsender'
> processes on server B to exit). This issue is reproducible on latest
> 'master', as well as on current PG 16/17 branches.
>
> Attached is a test scenario to reproduce the issue: 'test_scenario.zip'.
> This archive contains shell scripts to create
> the required environment (all three serves) and then to execute required
> steps to get server into incorrect state. First, edit the 'test_env.sh'
> file and specify path to PG binaries in PG_PATH variable and optionally
> set of ports used by test instances in 'pg_port' array. Then execute the
> 'test_prepare.sh' script, which will create, configure and start all
> three PG instances. Servers then could be started and stopped using
> corresponding start and stop scripts. To reproduce the issue, ensure
> that all three servers are running and execute the 'test_execute.sh'
> script. This script starts 'pgbench' instance in background for 30
> seconds to create load on server A, waits for 20 seconds and then try to
> stop the B instance using default 'fast' mode. Expected behavior is
> normal shutdown for B, while observed behavior is different: shutdown
> attempt fails and each remaining 'walsender' process consumes entire CPU
> core. To get out of this state one could use 'stop-C.sh' script to stop
> the server C, as it will complete shutdown process of B instance as well.
>
> My understanding is that this issue seems to be caused by the logic in
> 'GetStandbyFlushRecPtr' function, which returns current flush point for
> received WAL data. This position is used in 'XLogSendLogical' to
> calculate whether current walsender is in 'caught up' state (i.e.
> whether we send all the available data to downstream instance). During
> shutdown walsenders are allowed to continue their work until they are in
> 'caught up' state, while 'postmaster' is waiting for their completion.
> Currently 'GetStandbyFlushRecPtr' returns position of last stored
> record, rather than last applied record. This is correct for physical
> replication as we can send data to downstream instance without applying
> it to local system. However, for logical replication this seems to be
> incorrect, as we could not decode data until it's applied on current
> instance. So, if current stored WAL position differs from applied
> position while server is being stopped, then
> 'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a
> busy loop, waiting for applied position to advance. The recovery process
> is already stopped at this moment, so this will be an infinite loop.
> Probably either 'GetStandbyFlushRecPtr' or
> 'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into
> consideration such case with  logical replication.
>
> Attached is also a patch, which aims to fix this issue:
> 0001-Use-only-replayed-position-as-target-flush-point-for.patch. It
> tries to to modify behavior of 'GetStandbyFlushRecPtr' function to
> ensure that it returns only applied position for logical replication.
> This function could be also invoked from slot synchronization routines
> and in this case it retains current behavior by returning last stored
> position.
>

The problem stated in 'logical-walsender' on 'physical standby' looks
genuine. I agree with the analysis for slot-sync as well. Slot-sync
does not need the fix as it deals only with flush-position and does
not care about replay-position. Since the problem area falls under
'Allow logical decoding on standbys', I am adding Bertrand for further
comments on this fix.

thanks
Shveta



Hi,

On Thu, May 22, 2025 at 04:29:37PM +0300, Alexey Makhmutov wrote:
> Hi Ajin and Shveta, thank you for looking at this issue.
> 
> On 5/22/25 12:48, Ajin Cherian wrote:
> 
> > Just a small comment: can you explicitly mention in the comments that
> > "for logical replication we can only send records that have already
> > been replayed else we might get stuck in shutdown" or something to
> > that effect as that distinction is important for future developers in
> > this area.
> 
> Thank you. I've expanded the comment to make it more verbose to clarify
> expected logic in this case.
> Updated patch is attached as
> '0001-Use-only-replayed-position-as-target-flush-point-2.patch'.

Thanks for the report!

I agree with your analysis and I think that your proposed fix make sense. I
wonder if the comment on top of GetStandbyFlushRecPtr() could be updated a bit
though.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi Bertrand, thank you for looking at this issue.

On 5/22/25 18:31, Bertrand Drouvot wrote:
> I agree with your analysis and I think that your proposed fix make sense. I
> wonder if the comment on top of GetStandbyFlushRecPtr() could be updated a bit
> though.

I've tried to update method comment as well in the updated patch: 
0001-Use-only-replayed-position-as-target-flush-point-3-pg-master.patch.
The same patch could be applied on top of PG 17 as well.

PG 16 does not have the slot synchronization logic, so its 
implementation of GetStandbyFlushRecPtr is slightly different and 
comments need to be updated to reflect this discrepancy as well. So, 
here is a variant of the same patch updated to be compatible with PG 16: 
0002-Use-only-replayed-position-as-target-flush-point-3-pg-16.patch

Thanks,
Alexey

Attachment


On Thu, May 29, 2025 at 5:02 PM Michael Paquier <michael@paquier.xyz> wrote:
nd master versions.

Yeah, I think that this is a sensible move.  Documenting the reason
why GetXLogReplayRecPtr() is called is important, but the comment you
are suggesting could be better, IMO, say:
"For cascading logical WAL senders, we use the replay LSN rather than
the flush LSN, as decoding would only happen with records already
replayed.  This distinction is important especially during the
shutdown sequence, as cascading logical WAL senders can only catch up
with records that have been already replayed, not flushed."

That feels that I'm repeating myself a bit twice if formulated this
way.  If you have a better suggestion, feel free..


I think this new fix is much better and cleaner. A suggestion for the comment:
"For cascading logical WAL senders, we use the replay LSN instead of the flush LSN, since logical decoding on a standby only processes WAL that has been replayed. This distinction becomes particularly important during shutdown, as new WAL is no longer replayed and the last replayed LSN marks the furthest point up to which decoding can proceed."

regards,
Ajin Cherian
Fujitsu Australia
On Thu, May 29, 2025 at 08:28:01PM +1000, Ajin Cherian wrote:
> I think this new fix is much better and cleaner. A suggestion for the
> comment:
> "For cascading logical WAL senders, we use the replay LSN instead of the
> flush LSN, since logical decoding on a standby only processes WAL that has
> been replayed. This distinction becomes particularly important during
> shutdown, as new WAL is no longer replayed and the last replayed LSN marks
> the furthest point up to which decoding can proceed."

Yes, that sounds better than my previous suggestion.  Thanks.
--
Michael

Attachment