Thread: pgsql: Improve runtime and output of tests for replication slots checkp
pgsql: Improve runtime and output of tests for replication slots checkp
From
Alexander Korotkov
Date:
Improve runtime and output of tests for replication slots checkpointing. The TAP tests that verify logical and physical replication slot behavior during checkpoints (046_checkpoint_logical_slot.pl and 047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each, generating approximately 520 MB of WAL. On slow machines, or when compiled with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the tests to run for 8-9 minutes and occasionally time out, as seen on the buildfarm animal prion. This commit modifies the mentioned tests to utilize the $node->advance_wal() function, thereby reducing runtime. Once we do not use the generated data, the proposed function is a good alternative, which cuts the total wall-clock run time. While here, remove superfluous '\n' characters from several note() calls; these appeared literally in the build-farm logs and looked odd. Also, remove excessive 'shared_preload_libraries' GUC from the config and add a check for 'injection_points' extension availability. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/fbc5d94e-6fbd-4a64-85d4-c9e284a58eb2%40gmail.com Backpatch-through: 17 Branch ------ REL_17_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/5ed50f9386f04d9b8a88af8902b06cb747db7b54 Modified Files -------------- src/test/recovery/t/046_checkpoint_logical_slot.pl | 34 +++++++++++----------- .../recovery/t/047_checkpoint_physical_slot.pl | 26 +++++++++-------- 2 files changed, 31 insertions(+), 29 deletions(-)
Re: pgsql: Improve runtime and output of tests for replication slots checkp
From
Melanie Plageman
Date:
On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <akorotkov@postgresql.org> wrote:
Improve runtime and output of tests for replication slots checkpointing.
The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL. On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.
Quite a few animals have started failing since this commit (for example [1]) . I haven't looked into why, but I suspect something is wrong.
stderr: # Failed test 'Logical slot still valid' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl line 134. # got: 'death by signal at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181. # ' # expected: ''I see this failing on my fork's CI, so it seems like it could have been caught earlier?
- Melanie
Re: pgsql: Improve runtime and output of tests for replication slots checkp
From
Alexander Korotkov
Date:
On Fri, Jun 20, 2025, 19:10 Melanie Plageman <melanieplageman@gmail.com> wrote:
On Thu, Jun 19, 2025 at 7:31 PM Alexander Korotkov <akorotkov@postgresql.org> wrote:Improve runtime and output of tests for replication slots checkpointing.
The TAP tests that verify logical and physical replication slot behavior
during checkpoints (046_checkpoint_logical_slot.pl and
047_checkpoint_physical_slot.pl) inserted two batches of 2 million rows each,
generating approximately 520 MB of WAL. On slow machines, or when compiled
with '-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE', this caused the
tests to run for 8-9 minutes and occasionally time out, as seen on the
buildfarm animal prion.Quite a few animals have started failing since this commit (for example [1]) . I haven't looked into why, but I suspect something is wrong.stderr: # Failed test 'Logical slot still valid' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/046_checkpoint_logical_slot.pl line 134. # got: 'death by signal at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181. # ' # expected: ''I see this failing on my fork's CI, so it seems like it could have been caught earlier?
Melanie, thank you for pointing. I'll take a look at this later today.
------
Regards,
Alexander Korotkov
Melanie Plageman <melanieplageman@gmail.com> writes: > Quite a few animals have started failing since this commit (for example > [1]) . I haven't looked into why, but I suspect something is wrong. It looks to me like it's being triggered by this questionable bit in 046_checkpoint_logical_slot.pl: # Continue the checkpoint. $node->safe_psql('postgres', q{select injection_points_wakeup('checkpoint-before-old-wal-removal')}); # Abruptly stop the server (1 second should be enough for the checkpoint # to finish; it would be better). $node->stop('immediate'); That second comment is pretty unintelligible, but I think it's expecting that we'd give the checkpoint 1 second to complete, which the code is *not* doing. On my own machine it looks like the checkpoint does manage to complete within about 1ms, just barely before the shutdown arrives: 2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG: statement: select pg_replication_slot_advance('slot_physical',pg_current_wal_lsn()) 2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal') 2025-06-20 17:52:25.603 EDT [2538557] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s)added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s, average=0.000s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 2025-06-20 17:52:25.604 EDT [2538553] LOG: received immediate shutdown request But in the buildfarm failures I don't see any 'checkpoint complete' before the shutdown. If this is an accurate diagnosis then it indicates both a test bug (it should delay here, or else the comment needs fixed to explain what we're actually testing) and a backend bug, because an immediate stop a/k/a crash before completing the checkpoint should not lead to failure to function after the next restart. regards, tom lane
Re: pgsql: Improve runtime and output of tests for replication slots checkp
From
Alexander Korotkov
Date:
On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > But in the buildfarm failures I don't see any 'checkpoint complete' > > before the shutdown. > > Ooops, I lied: we have at least one case where the checkpoint does > finish but then it hangs up anyway: > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check > > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal') > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRUbuffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgreshost=[local] > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgreshost=[local] > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1 > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down > > I'm still confused about whether the 046 script intends to > sometimes test cases where the checkpoint doesn't have time to > complete. But it seems that whatever is going on here > is a bit subtle and platform-dependent. I think this indicates unfinished intention to wait for checkpoint completion. But I think both cases (checkpoint finished and unfinished) should work correctly. So, I believe there is a backend problem. I'm trying to reproduce this locally. Sorry for the confusion. ------ Regards, Alexander Korotkov Supabase
Re: pgsql: Improve runtime and output of tests for replication slots checkp
From
Alexander Korotkov
Date:
On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.
I manage to reproduce the bug locally. I've got the following backtrace.
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10
frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11
frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13
While looking at frame 5 I found the following code to be executed.
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;
/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
And I see the following variable values.
(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
I hardly understand how is this possible given it was compiled with "-O0". I'm planning to continue investigating this tomorrow.
------
Regards,
Alexander Korotkov
Supabase
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.
I manage to reproduce the bug locally. I've got the following backtrace.
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10
frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11
frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13
While looking at frame 5 I found the following code to be executed.
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;
/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
And I see the following variable values.
(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
I hardly understand how is this possible given it was compiled with "-O0". I'm planning to continue investigating this tomorrow.
------
Regards,
Alexander Korotkov
Supabase