Re: [PATCH] Fix fragile walreceiver test. - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: [PATCH] Fix fragile walreceiver test.
Date
Msg-id aQrzs1VFGz6cF2bN@paquier.xyz
Whole thread Raw
In response to [PATCH] Fix fragile walreceiver test.  (Bryan Green <dbryan.green@gmail.com>)
Responses Re: [PATCH] Fix fragile walreceiver test.
List pgsql-hackers
On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> Problem: restart() kills the walreceiver (as it should), which writes
> that exact FATAL message to the log. The test then searches the log and
> finds it.

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

> The test has a comment claiming "a new log file is used on node
> restart". TAP tests use pg_ctl with a fixed filename that gets reused
> across restarts. No log rotation.

I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.

> The fix is obvious: check that the walreceiver PID stays constant.
> That's what we actually care about anyway.

Hmm.  The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof.  On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved.  This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow.  You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2.  I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

> This matters because changes to I/O behavior elsewhere in the code can
> make this test fail spuriously. I hit it while working on O_CLOEXEC
> handling for Windows.

Fun.  And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?

[1]: https://www.postgresql.org/message-id/aQGfoKGgmAbPATp5@paquier.xyz
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Extended Statistics set/restore/clear functions.
Next
From: Vaibhav Dalvi
Date:
Subject: Re: Non-text mode for pg_dumpall