Re: [PATCH] Fix fragile walreceiver test. - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: [PATCH] Fix fragile walreceiver test.
Date
Msg-id CABPTF7WOqRKu6Agb6rFgZZ=X2mtaVbM7G=GMYVBx=F2L2TPQsw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Fix fragile walreceiver test.  (Michael Paquier <michael@paquier.xyz>)
Responses Re: [PATCH] Fix fragile walreceiver test.
List pgsql-hackers
Hi,

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> > Problem: restart() kills the walreceiver (as it should), which writes
> > that exact FATAL message to the log. The test then searches the log and
> > finds it.
>
> Timing issue then, the buildfarm has not been complaining on this one
> AFAIK, there have been no recoveryCheck failures reported:
> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> > The test has a comment claiming "a new log file is used on node
> > restart". TAP tests use pg_ctl with a fixed filename that gets reused
> > across restarts. No log rotation.
>
> I've fat-fingered this assumption, indeed, missing that one would need
> to do an extra rotate_logfile() before the restart.
>
> > The fix is obvious: check that the walreceiver PID stays constant.
> > That's what we actually care about anyway.

Thanks for catching and analyzing this!

> Hmm.  The reason why I didn't use a PID matching check (mentioned at
> [1]) is that this is not entirely bullet-proof.  On a very slow
> machine, one could assume that standby_1 generates some records and
> that these are replayed by standby_2 *before* the PID of the WAL
> receiver is retrieved.  This could lead to false positives in some
> cases, and a bunch of buildfarm members are very slow.  You have a
> point that these would unlikely happen in normal runs, so a PID
> matching check would be relevant most of the time anyway, even if the
> original PID has been fetched after the TLI jump has been processed in
> standby_2.  I'd rather keep the log check, TBH, bypassing it with an
> extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

> > This matters because changes to I/O behavior elsewhere in the code can
> > make this test fail spuriously. I hit it while working on O_CLOEXEC
> > handling for Windows.
>
> Fun.  And the WAL receiver never stops after the restart of standby_2
> with the log entry present in the server logs generated before the
> restart, right?
>

Best,
Xuneng

Attachment

pgsql-hackers by date:

Previous
From: Antonin Houska
Date:
Subject: Re: Adding REPACK [concurrently]
Next
From: Nishant Sharma
Date:
Subject: Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement