Re: pgsql: Improve runtime and output of tests for replication slots checkp - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date
Msg-id 2542023.1750456985@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: Improve runtime and output of tests for replication slots checkp  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-committers
Melanie Plageman <melanieplageman@gmail.com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.

It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:

# Continue the checkpoint.
$node->safe_psql('postgres',
    q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});

# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');

That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing.  On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:

2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG:  statement: select
pg_replication_slot_advance('slot_physical',pg_current_wal_lsn()) 
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG:  statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG:  checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s)added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s,
average=0.000s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 
2025-06-20 17:52:25.604 EDT [2538553] LOG:  received immediate shutdown request

But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.

If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.

            regards, tom lane



pgsql-committers by date:

Previous
From: Tom Lane
Date:
Subject: pgsql: Remove planner's have_dangerous_phv() join-order restriction.
Next
From: Alexander Korotkov
Date:
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp