Re: [PATCH] Fix fragile walreceiver test. - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: [PATCH] Fix fragile walreceiver test.
Date
Msg-id CABPTF7WCWqQ2DrioSbUAShZk9Qm7Expf6NU6b9=97vQnNU7yGw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Fix fragile walreceiver test.  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi,

On Wed, Nov 5, 2025 at 3:56 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
> > On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
> >> Timing issue then, the buildfarm has not been complaining on this one
> >> AFAIK, there have been no recoveryCheck failures reported:
> >> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
>
> drongo has just reported one failure, so I stand corrected:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50
>
> And one log rotation should be enough before the restart.
>
> >> Hmm.  The reason why I didn't use a PID matching check (mentioned at
> >> [1]) is that this is not entirely bullet-proof.  On a very slow
> >> machine, one could assume that standby_1 generates some records and
> >> that these are replayed by standby_2 *before* the PID of the WAL
> >> receiver is retrieved.  This could lead to false positives in some
> >> cases, and a bunch of buildfarm members are very slow.  You have a
> >> point that these would unlikely happen in normal runs, so a PID
> >> matching check would be relevant most of the time anyway, even if the
> >> original PID has been fetched after the TLI jump has been processed in
> >> standby_2.  I'd rather keep the log check, TBH, bypassing it with an
> >> extra rotate_logfile() before the restart of standby_2.
> >
> > I’ve also prepared a patch for this method.
>
> That's exactly what I have done a couple of minutes ago, and noticed
> your message before applying the fix so I've listed you are a
> co-author on this one.
>

Thanks.

> I have also kept the PID check after pondering a bit about it.  A TLI
> jump could be replayed before we grab the initial PID, but in most
> cases it should be able to do its work correctly.

Checking the PID seems straightforward and makes sense to me mostly.

Best,
Xuneng



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded
Next
From: Alexander Korotkov
Date:
Subject: Re: Newly created replication slot may be invalidated by checkpoint