Thread: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Hi,

Buildfarm identified one issue at [1] where it could not identify a
partial WAL record spanning across 2 pages was written due to
immediate shutdown.
Consider a scenario where a WAL  record is split across multiple WAL
pages. If the server crashes before the entire WAL record is written,
the recovery process detects the incomplete (broken) record. In such
cases, a special flag, XLP_FIRST_IS_OVERWRITE_CONTRECORD, is set in
the page header to indicate this condition. When we attempt to
retrieve changes using a logical replication slot, the system reads
WAL data based on the record header. In one specific instance, a total
of 8133 bytes were required to reconstruct the WAL record. Of this,
277 bytes were expected to be present in the subsequent WAL page.
However, only 248 bytes were available, because no new WAL records had
been generated as there were no operations performed on the system
after the crash. As a result, the slot_get_changes function
continuously checks whether the current WAL flush position has reached
0x29004115 (the offset corresponding to the required 277 bytes in the
next WAL page). Since no new WAL is being generated, the flush
position always returns 0x290040F8 (the 248-byte offset), causing the
function to wait indefinitely at read_local_xlog_page_guts function.

Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only  after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.

I don't have a consistent test to reproduce this issue, currently this
issue can be reproduced by running the 046_checkpoint_logical_slot
test about 50 times. This test 046_checkpoint_logical_slot was
reverted recently after it caused a few buildfarm failures discussed
at [2]. The same test is also attached as
reverted_test_046_checkpoint_logical_slot.patch.

Added members who were involved in the discussion of this issue.

Thoughts?

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08
[2] - https://www.postgresql.org/message-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz%3DPV9fdN32UOh-Z3tA%40mail.gmail.com

Regards,
Vignesh

Attachment
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote:
> Currently, the logic attempts to read the complete WAL record based on
> the size obtained before the crash—even though only a partial record
> was written. It then checks the page header to determine whether the
> XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only  after reading the
> complete WAL record at XLogDecodeNextRecord function, but since that
> much WAL data was not available in the system we never get a chance to
> check the header after this.. To address this issue, a more robust
> approach would be to first read the page header, check if the
> XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
> to read the WAL record size if the record is not marked as a partial
> overwrite. This would prevent the system from waiting for WAL data
> that will never arrive. Attached partial_wal_record_fix.patch patch
> for this.

So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest.  After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?

> I don't have a consistent test to reproduce this issue, currently this
> issue can be reproduced by running the 046_checkpoint_logical_slot
> test about 50 times. This test 046_checkpoint_logical_slot was
> reverted recently after it caused a few buildfarm failures discussed
> at [2]. The same test is also attached as
> reverted_test_046_checkpoint_logical_slot.patch.

Seeing the noise that this originally created in the CFbot and the
buildfarm, even if the issue would be only triggered after a timeout,
I'd vote for relying on this test being good enough for the purpose of
this race condition.  Another reliable approach would be to make the
code wait before reading the record in the internal loop of
ReadPageInternal() with an injection point when we know that we have a
contrecord, but I'm not really excited about this prospect in
xlogreader.c which can be also used in the frontend.
--
Michael

Attachment
On Thu, Jun 26, 2025 at 09:40:05AM +0000, Hayato Kuroda (Fujitsu) wrote:
> I've spent time how we use the injection point to cause the same situation, which
> generates the OVERWRITE_CONTRECORD in-between the page, but it seems difficult.

Hmm.  I was wondering first about detecting a OVERWRITE_CONTRECORD
while looping in the loop of XLogDecodeNextRecord() with a
IS_INJECTION_POINT_ATTACHED(), then do a wait.  But this method means
that the injection point is linked to the fix we are trying to bundle
here because it would need to be placed between the two
ReadPageInternal() as we need to read the header first.  Usually I try
to place points so as they are generic enough, but that's a vague
philosophy.

> So... I have no idea to create deterministic reproducers, it is OK for me to use
> 046 test for the purpose.

Let's reuse 046 and call it a day.  It would not be the first
reproducer that fails on timeout if a problem arises.  The important
fact is to be informed about the failure, while keeping tests fast in
the "normal" cases.  We have a few tests like that already in the
tree, where a timeout is the best thing we can rely on when the code
fails its test.
--
Michael

Attachment
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
> On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
>> So you are suggesting the addition of an extra ReadPageInternal() that
>> forces a read of only the read, perform the checks on the header, then
>> read the rest.  After reading SizeOfXLogShortPHD worth of data,
>> shouldn't the checks on xlp_rem_len be done a bit earlier than what
>> you are proposing in this patch?
>
> Modified

It seems to me that this assert can be moved after the second page
read:
Assert(SizeOfXLogShortPHD <= readOff);

Coming back to the point of Kuroda-san about performance, did you do
some checks related to that and did you measure any difference?  I
suspect none of that because in most cases we are just going to fetch
the next page and we would trigger the fast-exit path of
ReadPageInternal() on the second call when fetching the rest.  I still
need to get an idea of all that by myself, probably with various
lengths of logical message records.

Perhaps this code could be improved in the future with less page
reads.  Anyway, what you are doing here is simple enough that it is a
no-brainer for the back-branches because we are just forcing our way
through with a new short header validation, so logically that's sound
as far as I can see.
--
Michael

Attachment
On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
> > On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
> >> So you are suggesting the addition of an extra ReadPageInternal() that
> >> forces a read of only the read, perform the checks on the header, then
> >> read the rest.  After reading SizeOfXLogShortPHD worth of data,
> >> shouldn't the checks on xlp_rem_len be done a bit earlier than what
> >> you are proposing in this patch?
> >
> > Modified
>
> It seems to me that this assert can be moved after the second page
> read:
> Assert(SizeOfXLogShortPHD <= readOff);
>
> Coming back to the point of Kuroda-san about performance, did you do
> some checks related to that and did you measure any difference?  I
> suspect none of that because in most cases we are just going to fetch
> the next page and we would trigger the fast-exit path of
> ReadPageInternal() on the second call when fetching the rest.  I still
> need to get an idea of all that by myself, probably with various
> lengths of logical message records.
>
> Perhaps this code could be improved in the future with less page
> reads.  Anyway, what you are doing here is simple enough that it is a
> no-brainer for the back-branches because we are just forcing our way
> through with a new short header validation, so logically that's sound
> as far as I can see.

I was unable to reproduce the same test failure on the PG17 branch,
even after running the test around 500 times. However, on the master
branch, the failure consistently reproduces approximately once in
every 50 runs. I also noticed that while the buildfarm has reported
multiple failures for this test for the master branch, none of them
appear to be on the PG17 branch. I'm not yet sure why this discrepancy
exists.

Regards,
Vignesh



On Fri, Jun 27, 2025 at 9:29 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
> > > On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
> > >> So you are suggesting the addition of an extra ReadPageInternal() that
> > >> forces a read of only the read, perform the checks on the header, then
> > >> read the rest.  After reading SizeOfXLogShortPHD worth of data,
> > >> shouldn't the checks on xlp_rem_len be done a bit earlier than what
> > >> you are proposing in this patch?
> > >
> > > Modified
> >
> > It seems to me that this assert can be moved after the second page
> > read:
> > Assert(SizeOfXLogShortPHD <= readOff);
>
> I felt this Assert can be here to ensure we’ve read SizeOfXLogShortPHD
> before checking the page header contents. But for the second page
> read, the following existing Assert should be enough:
> Assert(pageHeaderSize <= readOff);

Thanks make sense.

> >
> > Coming back to the point of Kuroda-san about performance, did you do
> > some checks related to that and did you measure any difference?  I
> > suspect none of that because in most cases we are just going to fetch
> > the next page and we would trigger the fast-exit path of
> > ReadPageInternal() on the second call when fetching the rest.  I still
> > need to get an idea of all that by myself, probably with various
> > lengths of logical message records.
>
> The test execution times are measured in microseconds. In the results
> table below, the first row indicates the message size, and each value
> represents the median of 5 test runs.
> The attached script was used to run these tests. In the attached
> script the MSG_SIZE variable in the script should be changed to 1000(1
> page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages
> approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000
>  (1220 pages approx) and 100000000 (12207 pages approx) and be run.
> Test execution time can be taken from run_*.dat files that will be
> generated.
>
> Size   | 1000    | 10000  | 25000    | 50000    | 100000  | 1000000
> --------|-----------|-----------|------------|-------------|------------|--------------
> Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1
> Patch | 9222.7 | 9889   | 10897.1 | 12904.2 | 16858.4 | 87115.5
>
> Size    | 10000000  | 100000000
> ---------|----------------|-----------------
> HEAD | 804965.6   | 331639.7
> Patch  | 804942.6   | 321198.6
>
> The performance results show that the patch does not introduce any
> noticeable overhead across varying message sizes, I felt there was no
> impact because of the additional page header read.

Yeah, that doesn't seem like a regression.

--
Regards,
Dilip Kumar
Google



Dear Vignesh,

> I was unable to reproduce the same test failure on the PG17 branch,
> even after running the test around 500 times. However, on the master
> branch, the failure consistently reproduces approximately once in
> every 50 runs. I also noticed that while the buildfarm has reported
> multiple failures for this test for the master branch, none of them
> appear to be on the PG17 branch. I'm not yet sure why this discrepancy
> exists.

I was also not able to reproduce as-is. After analyzing bit more, I found on
PG17, the workload cannot generate an FPI_FOR_HINT. The type of WAL record
has longer length than the page there was a possibility that the WAL record
could be flushed partially in HEAD. But in PG17 it could not happen so that
OVERWRITE_CONTRECORD won't be appeared.

I modified the test code like [1] and confirmed that the same stuck could happen
on PG17. It generates a long record which can go across the page and can be
flushed partially.

[1]: 
```
--- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -123,6 +123,10 @@ $node->safe_psql('postgres',
 $node->safe_psql('postgres',
        q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
 
+# Generate a long WAL record
+$node->safe_psql('postgres',
+       q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED