On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> Problem: restart() kills the walreceiver (as it should), which writes
> that exact FATAL message to the log. The test then searches the log and
> finds it.
Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
> The test has a comment claiming "a new log file is used on node
> restart". TAP tests use pg_ctl with a fixed filename that gets reused
> across restarts. No log rotation.
I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.
> The fix is obvious: check that the walreceiver PID stays constant.
> That's what we actually care about anyway.
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.
> This matters because changes to I/O behavior elsewhere in the code can
> make this test fail spuriously. I hit it while working on O_CLOEXEC
> handling for Windows.
Fun. And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?
[1]: https://www.postgresql.org/message-id/aQGfoKGgmAbPATp5@paquier.xyz
--
Michael