Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From Xuneng Zhou
Subject Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id CABPTF7VsyWXAQyc-qc8i1OkwsP5hWE6WoGVCH1zbiUhX0=rFCw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
List pgsql-bugs
Hi Michael, Noah,

Thanks for looking into this.

On Wed, Oct 29, 2025 at 1:01 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Oct 26, 2025 at 09:12:41PM -0700, Noah Misch wrote:
> > On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> > (Long-term, in master only, perhaps we should introduce another status like
> > 'connecting'.  Perhaps enact the connecting->streaming status transition just
> > before tendering the first byte of streamed WAL to the startup process.
> > Alternatively, enact that transition when the startup process accepts the
> > first streamed byte.  Then your application's health check would get what it
> > wants.)
>
> Having a "connecting" status here would make sense to me, yep.

+1, I am also interested in hacking on this.

> > This change would be wrong if WALRCV_STOPPING were a reachable state here.
> > That state is the startup process asking walreceiver to stop.  walreceiver may
> > then still be installing segments, so this location would want to call
> > XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
> > is a transient state while the startup process is in ShutdownWalRcv().  Hence,
> > I expect STOPPING never appears here, and there's no bug.  An assertion may be
> > in order.
>
> A WAL receiver marked as in STOPPING state should never be reached in
> this code path as far as I recall the matter.  An assertion sounds
> like a cheap insurance anyway.

Should we wrap this assertion within #ifdef USE_ASSERT_CHECKING?

#ifdef USE_ASSERT_CHECKING
{
WalRcvData *walrcv = WalRcv;
WalRcvState state;
SpinLockAcquire(&walrcv->mutex);
state = walrcv->walRcvState;
SpinLockRelease(&walrcv->mutex);
Assert(state != WALRCV_STOPPING);
}
#endif

> > Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
> > wrong in v14 for 5mon before this report, clearly we had a blind spot.
>
> Hmm.  I think that we could tweak the recovery 004_timeline_switch.pl
> that has scenarios for TLI switches, which is what this is about.  And
> there is no need to rely on a failover of a previous primary to a
> promoted standby, we could do the same with a cascading standby.
>
> Based on the logs I can see at 14.18:
> 2025-10-29 11:09:36.964 JST walreceiver[711] LOG:  replication
> terminated by primary server
> 2025-10-29 11:09:36.964 JST walreceiver[711] DETAIL:  End of WAL
> reached on timeline 1 at 0/03024D18.
> 2025-10-29 11:09:36.964 JST walreceiver[711] FATAL:  terminating
> walreceiver process due to administrator command
> 2025-10-29 11:09:36.964 JST startup[710] LOG:  new target timeline is 2
>
> And on older versions, at 14.17:
> 2025-10-29 11:14:32.745 JST walreceiver[5857] LOG:  replication
> terminated by primary server
> 2025-10-29 11:14:32.745 JST walreceiver[5857] DETAIL:  End of WAL
> reached on timeline 1 at 0/03024D18.
> 2025-10-29 11:14:32.745 JST startup[5856] LOG:  new target timeline
> is 2
>
> I was thinking about two options to provide some coverage:
> - Check that the PID of the WAL receiver is still the same before and
> after the TLI switch, with a restart or a reload after changing
> primary_conninfo.  This cannot be made stable without an injection
> point or equivalent to stop the WAL receiver from doing any job before
> it does the TLI jump and trigger the code path we are discussing here.
> - Check for the logs that we never issue a "terminating walreceiver
> process due to administrator command".  This would be simpler in the
> long term.
>
> Hence I would suggest just the following addition:
> --- a/src/test/recovery/t/004_timeline_switch.pl
> +++ b/src/test/recovery/t/004_timeline_switch.pl
> @@ -66,6 +66,11 @@ my $result =
>    $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
>  is($result, qq(2000), 'check content of standby 2');
>
> +# Check the logs, WAL receiver should not have been stopped.  There is no need
> +# to rely on a position in the logs: a new log file is used on node restart.
> +ok( !$node_standby_2->log_contains(
> +  "FATAL: .* terminating walreceiver process due to administrator command"),
> +  'WAL receiver should not be stopped across timeline jumps');
>
>  # Ensure that a standby is able to follow a primary on a newer timeline
>  # when WAL archiving is enabled.

I attempted to write a TAP test to reproduce this scenario. Since I’m
not very familiar with the recovery-related TAP framework, it was
somewhat difficult to replicate the exact conditions of the issue.
Here's the log of the attached TAP test against both the HEAD and the
patched versions:

Patched
2025-10-29 13:24:18.016 CST walreceiver[4244] LOG:  replication
terminated by primary server
2025-10-29 13:24:18.016 CST walreceiver[4244] DETAIL:  End of WAL
reached on timeline 1 at 0/030000A0.
2025-10-29 13:24:18.016 CST startup[4243] DEBUG:  transaction ID wrap
limit is 2147484395, limited by database with OID 1
2025-10-29 13:24:18.016 CST startup[4243] CONTEXT:  WAL redo at
0/03000028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/03000028; tli 1; prev
tli 1; fpw true; wal_level replica; xid 0:758; oid 16389; multi 1;
offset 0; oldest xid 748 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown
2025-10-29 13:24:18.016 CST walreceiver[4244] DEBUG:  walreceiver
ended streaming and awaits new instructions
2025-10-29 13:24:18.016 CST startup[4243] LOG:  new target timeline is 2

Head
2025-10-29 13:29:03.697 CST walreceiver[7140] LOG:  replication
terminated by primary server
2025-10-29 13:29:03.697 CST walreceiver[7140] DETAIL:  End of WAL
reached on timeline 1 at 0/030000A0.
2025-10-29 13:29:03.697 CST startup[7139] DEBUG:  transaction ID wrap
limit is 2147484395, limited by database with OID 1
2025-10-29 13:29:03.697 CST startup[7139] CONTEXT:  WAL redo at
0/03000028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/03000028; tli 1; prev
tli 1; fpw true; wal_level replica; xid 0:758; oid 16389; multi 1;
offset 0; oldest xid 748 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown
2025-10-29 13:29:03.697 CST walreceiver[7140] DEBUG:  walreceiver
ended streaming and awaits new instructions
2025-10-29 13:29:03.697 CST walreceiver[7140] FATAL:  terminating
walreceiver process due to administrator command
2025-10-29 13:29:03.697 CST startup[7139] LOG:  new target timeline is 2

With the prior patch applied, the walreceiver process remains alive as
expected. However, in the unpatched (HEAD) version, I didn’t observe
the repeated restart behavior shown in the original bug report logs. I
guess that to reproduce it precisely, we need to reach a state similar
to the following log entry:

Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current
database system timeline 7 before current recovery point 1/9A806380

At that point, the sleep should occur, allowing us to monitor the
walreceiver’s “waiting” state and verify process stability by checking
whether the PID remains consistent.

However, I think Michael’s approach is better — it’s simpler, more
straightforward, and captures the buggy behavior well. I plan to
incorporate it into the next version of the patch.

> Note: the issue is also reachable on REL_13_STABLE, through
> 69a498eb6465.  Are you confident that we would be able to get
> something into this branch by the next minor release?  Based on the
> state of the problem and the advanced analysis, it sounds to me that
> we should be able to conclude by the middle of next week, or something
> close to that.
>
> > postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
> > "standby.signal+primary_conninfo" case.  How will this patch interact with
> > that case?
>
> This does not sound like an issue to me?  This mentions a case where
> we don't have a restore_command and where changes are pushed to the
> local pg_wal/ by an external source.  If a WAL receiver is waiting,
> the patch means that ResetInstallXLogFileSegmentActive() would be
> called and the InstallXLogFileSegmentActive flag reset.  I don't see
> why that's not OK.  Or are you foreseeing something I don't?

I agree that the patch would handle that case correctly as well.
When the walreceiver stops streaming—regardless of the
reason—ResetInstallXLogFileSegmentActive() is invoked, which clears
the InstallXLogFileSegmentActive flag.
This ensures that the startup process can safely read WAL files from
pg_wal/ without conflicts, allowing the assertion
Assert(!IsInstallXLogFileSegmentActive()) in the archive recovery code
to pass as expected.

Best,
Xuneng

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Next
From: Richard Guo
Date:
Subject: Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results