Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id aQGfoKGgmAbPATp5@paquier.xyz
Whole thread Raw
In response to Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
List pgsql-bugs
On Sun, Oct 26, 2025 at 09:12:41PM -0700, Noah Misch wrote:
> On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> (Long-term, in master only, perhaps we should introduce another status like
> 'connecting'.  Perhaps enact the connecting->streaming status transition just
> before tendering the first byte of streamed WAL to the startup process.
> Alternatively, enact that transition when the startup process accepts the
> first streamed byte.  Then your application's health check would get what it
> wants.)

Having a "connecting" status here would make sense to me, yep.

> This change would be wrong if WALRCV_STOPPING were a reachable state here.
> That state is the startup process asking walreceiver to stop.  walreceiver may
> then still be installing segments, so this location would want to call
> XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
> is a transient state while the startup process is in ShutdownWalRcv().  Hence,
> I expect STOPPING never appears here, and there's no bug.  An assertion may be
> in order.

A WAL receiver marked as in STOPPING state should never be reached in
this code path as far as I recall the matter.  An assertion sounds
like a cheap insurance anyway.

> Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
> wrong in v14 for 5mon before this report, clearly we had a blind spot.

Hmm.  I think that we could tweak the recovery 004_timeline_switch.pl
that has scenarios for TLI switches, which is what this is about.  And
there is no need to rely on a failover of a previous primary to a
promoted standby, we could do the same with a cascading standby.

Based on the logs I can see at 14.18:
2025-10-29 11:09:36.964 JST walreceiver[711] LOG:  replication
terminated by primary server
2025-10-29 11:09:36.964 JST walreceiver[711] DETAIL:  End of WAL
reached on timeline 1 at 0/03024D18.
2025-10-29 11:09:36.964 JST walreceiver[711] FATAL:  terminating
walreceiver process due to administrator command
2025-10-29 11:09:36.964 JST startup[710] LOG:  new target timeline is 2

And on older versions, at 14.17:
2025-10-29 11:14:32.745 JST walreceiver[5857] LOG:  replication
terminated by primary server
2025-10-29 11:14:32.745 JST walreceiver[5857] DETAIL:  End of WAL
reached on timeline 1 at 0/03024D18.
2025-10-29 11:14:32.745 JST startup[5856] LOG:  new target timeline
is 2

I was thinking about two options to provide some coverage:
- Check that the PID of the WAL receiver is still the same before and
after the TLI switch, with a restart or a reload after changing
primary_conninfo.  This cannot be made stable without an injection
point or equivalent to stop the WAL receiver from doing any job before
it does the TLI jump and trigger the code path we are discussing here.
- Check for the logs that we never issue a "terminating walreceiver
process due to administrator command".  This would be simpler in the
long term.

Hence I would suggest just the following addition:
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -66,6 +66,11 @@ my $result =
   $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(2000), 'check content of standby 2');

+# Check the logs, WAL receiver should not have been stopped.  There is no need
+# to rely on a position in the logs: a new log file is used on node restart.
+ok( !$node_standby_2->log_contains(
+  "FATAL: .* terminating walreceiver process due to administrator command"),
+  'WAL receiver should not be stopped across timeline jumps');

 # Ensure that a standby is able to follow a primary on a newer timeline
 # when WAL archiving is enabled.

Note: the issue is also reachable on REL_13_STABLE, through
69a498eb6465.  Are you confident that we would be able to get
something into this branch by the next minor release?  Based on the
state of the problem and the advanced analysis, it sounds to me that
we should be able to conclude by the middle of next week, or something
close to that.

> postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
> "standby.signal+primary_conninfo" case.  How will this patch interact with
> that case?

This does not sound like an issue to me?  This mentions a case where
we don't have a restore_command and where changes are pushed to the
local pg_wal/ by an external source.  If a WAL receiver is waiting,
the patch means that ResetInstallXLogFileSegmentActive() would be
called and the InstallXLogFileSegmentActive flag reset.  I don't see
why that's not OK.  Or are you foreseeing something I don't?
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #19097: System catalog modifications are allowed by alter
Next
From: Xuneng Zhou
Date:
Subject: Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18