Thread: pgsql: Improve runtime and output of tests for replication slots checkp

pgsql: Improve runtime and output of tests for replication slots checkp

From
Alexander Korotkov
Date:
Improve runtime and output of tests for replication slots checkpointing.

The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL.  On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.

This commit modifies the mentioned tests to utilize the $node->advance_wal()
function, thereby reducing runtime. Once we do not use the generated data,
the proposed function is a good alternative, which cuts the total wall-clock
run time.

While here, remove superfluous '\n' characters from several note() calls;
these appeared literally in the build-farm logs and looked odd.  Also, remove
excessive 'shared_preload_libraries' GUC from the config and add a check for
'injection_points' extension availability.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/fbc5d94e-6fbd-4a64-85d4-c9e284a58eb2%40gmail.com
Backpatch-through: 17

Branch
------
REL_17_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/5ed50f9386f04d9b8a88af8902b06cb747db7b54

Modified Files
--------------
src/test/recovery/t/046_checkpoint_logical_slot.pl | 34 +++++++++++-----------
.../recovery/t/047_checkpoint_physical_slot.pl     | 26 +++++++++--------
2 files changed, 31 insertions(+), 29 deletions(-)



On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <akorotkov@postgresql.org> wrote:
Improve runtime and output of tests for replication slots checkpointing.

The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL.  On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.

Quite a few animals have started failing since this commit (for example [1]) . I haven't looked into why, but I suspect something is wrong.
stderr:
#   Failed test 'Logical slot still valid'
#   at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl line 134.
#          got: 'death by signal at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# '
#     expected: ''

I see this failing on my fork's CI, so it seems like it could have been caught earlier?

- Melanie
 

Re: pgsql: Improve runtime and output of tests for replication slots checkp

From
Alexander Korotkov
Date:
On Fri, Jun 20, 2025, 19:10 Melanie Plageman <melanieplageman@gmail.com> wrote:

On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <akorotkov@postgresql.org> wrote:
Improve runtime and output of tests for replication slots checkpointing.

The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL.  On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.

Quite a few animals have started failing since this commit (for example [1]) . I haven't looked into why, but I suspect something is wrong.
stderr:
#   Failed test 'Logical slot still valid'
#   at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl line 134.
#          got: 'death by signal at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# '
#     expected: ''

I see this failing on my fork's CI, so it seems like it could have been caught earlier?

Melanie, thank you for pointing. I'll take a look at this later today.

------
Regards,
Alexander Korotkov
Melanie Plageman <melanieplageman@gmail.com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.

It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:

# Continue the checkpoint.
$node->safe_psql('postgres',
    q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});

# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');

That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing.  On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:

2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG:  statement: select
pg_replication_slot_advance('slot_physical',pg_current_wal_lsn()) 
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG:  statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG:  checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s)added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s,
average=0.000s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 
2025-06-20 17:52:25.604 EDT [2538553] LOG:  received immediate shutdown request

But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.

If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.

            regards, tom lane



Re: pgsql: Improve runtime and output of tests for replication slots checkp

From
Alexander Korotkov
Date:
On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > But in the buildfarm failures I don't see any 'checkpoint complete'
> > before the shutdown.
>
> Ooops, I lied: we have at least one case where the checkpoint does
> finish but then it hangs up anyway:
>
>
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
>
> 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG:  statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
> 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG:  checkpoint complete: wrote 1 buffers (0.0%), wrote 0
SLRUbuffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1,
longest=0.001s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 
> 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG:  disconnection: session time: 0:00:00.073 user=bf
database=postgreshost=[local] 
> 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG:  disconnection: session time: 0:00:01.538 user=bf
database=postgreshost=[local] 
> 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG:  statement: SELECT 1
> 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG:  received immediate shutdown request
> 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG:  database system is shut down
>
> I'm still confused about whether the 046 script intends to
> sometimes test cases where the checkpoint doesn't have time to
> complete.  But it seems that whatever is going on here
> is a bit subtle and platform-dependent.

I think this indicates unfinished intention to wait for checkpoint
completion.  But I think both cases (checkpoint finished and
unfinished) should work correctly.  So, I believe there is a backend
problem.  I'm trying to reproduce this locally.  Sorry for the
confusion.

------
Regards,
Alexander Korotkov
Supabase



Re: pgsql: Improve runtime and output of tests for replication slots checkp

From
Alexander Korotkov
Date:
On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG:  statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG:  checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG:  disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG:  disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG:  statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG:  received immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG:  database system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete.  But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion.  But I think both cases (checkpoint finished and
> unfinished) should work correctly.  So, I believe there is a backend
> problem.  I'm trying to reproduce this locally.  Sorry for the
> confusion.

I manage to reproduce the bug locally.  I've got the following backtrace.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
    frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
    frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10
    frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
    frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
    frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12
    frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14
    frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11
    frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3
    frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13
    frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9
    frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13

While looking at frame 5 I found the following code to be executed.

targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;

/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
  Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));

And I see the following variable values.

(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138

I hardly understand how is this possible given it was compiled with "-O0".  I'm planning to continue investigating this tomorrow.

------
Regards,
Alexander Korotkov
Supabase