Home > mailing lists

Re: [PATCH] Fix fragile walreceiver test. - Mailing list pgsql-hackers

From	Xuneng Zhou
Subject	Re: [PATCH] Fix fragile walreceiver test.
Date	November 5 10:30:30
Msg-id	CABPTF7WOqRKu6Agb6rFgZZ=X2mtaVbM7G=GMYVBx=F2L2TPQsw@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Fix fragile walreceiver test. (Michael Paquier <michael@paquier.xyz>)
Responses	Re: [PATCH] Fix fragile walreceiver test.
List	pgsql-hackers

Tree view

Hi,

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
> > Problem: restart() kills the walreceiver (as it should), which writes
> > that exact FATAL message to the log. The test then searches the log and
> > finds it.
>
> Timing issue then, the buildfarm has not been complaining on this one
> AFAIK, there have been no recoveryCheck failures reported:
> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> > The test has a comment claiming "a new log file is used on node
> > restart". TAP tests use pg_ctl with a fixed filename that gets reused
> > across restarts. No log rotation.
>
> I've fat-fingered this assumption, indeed, missing that one would need
> to do an extra rotate_logfile() before the restart.
>
> > The fix is obvious: check that the walreceiver PID stays constant.
> > That's what we actually care about anyway.

Thanks for catching and analyzing this!

> Hmm.  The reason why I didn't use a PID matching check (mentioned at
> [1]) is that this is not entirely bullet-proof.  On a very slow
> machine, one could assume that standby_1 generates some records and
> that these are replayed by standby_2 *before* the PID of the WAL
> receiver is retrieved.  This could lead to false positives in some
> cases, and a bunch of buildfarm members are very slow.  You have a
> point that these would unlikely happen in normal runs, so a PID
> matching check would be relevant most of the time anyway, even if the
> original PID has been fetched after the TLI jump has been processed in
> standby_2.  I'd rather keep the log check, TBH, bypassing it with an
> extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

> > This matters because changes to I/O behavior elsewhere in the code can
> > make this test fail spuriously. I hit it while working on O_CLOEXEC
> > handling for Windows.
>
> Fun.  And the WAL receiver never stops after the restart of standby_2
> with the log entry present in the server logs generated before the
> restart, right?
>

Best,
Xuneng

Attachment

v2-0001-Fix-timing-issue-in-recovery-004_timeline_switch-.patch

pgsql-hackers by date:

From: Antonin Houska
Date: 05 November, 10:12:19
Subject: Re: Adding REPACK [concurrently]

From: Nishant Sharma
Date: 05 November, 10:32:07
Subject: Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement

Re: [PATCH] Fix fragile walreceiver test. - Mailing list pgsql-hackers

Attachment

Previous

Next