On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
> On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>> Timing issue then, the buildfarm has not been complaining on this one
>> AFAIK, there have been no recoveryCheck failures reported:
>> https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
drongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50
And one log rotation should be enough before the restart.
>> Hmm. The reason why I didn't use a PID matching check (mentioned at
>> [1]) is that this is not entirely bullet-proof. On a very slow
>> machine, one could assume that standby_1 generates some records and
>> that these are replayed by standby_2 *before* the PID of the WAL
>> receiver is retrieved. This could lead to false positives in some
>> cases, and a bunch of buildfarm members are very slow. You have a
>> point that these would unlikely happen in normal runs, so a PID
>> matching check would be relevant most of the time anyway, even if the
>> original PID has been fetched after the TLI jump has been processed in
>> standby_2. I'd rather keep the log check, TBH, bypassing it with an
>> extra rotate_logfile() before the restart of standby_2.
>
> I’ve also prepared a patch for this method.
That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.
I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
--
Michael