Thread: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

On Mon, May 19, 2025 at 8:08 PM Duncan Sands
<duncan.sands@deepbluecap.com> wrote:
>
> PostgreSQL v17.5 (Ubuntu 17.5-1.pgdg24.04+1); Ubuntu 24.04.2 LTS (kernel
> 6.8.0); x86-64
>
> Good morning from DeepBlueCapital.  Soon after upgrading to 17.5 from 17.4, we
> started seeing logical replication failures with publisher errors like this:
>
>    ERROR:  invalid memory alloc request size 1196493216
>
> (the exact size varies).  Here is a typical log extract from the publisher:
>
> 2025-05-19 10:30:14 CEST \[1348336-465] remote\_production\_user\@blue DEBUG:
> 00000: write FB03/349DEF90 flush FB03/349DEF90 apply FB03/349DEF90 reply\_time
> 2025-05-19 10:30:07.467048+02
> 2025-05-19 10:30:14 CEST \[1348336-466] remote\_production\_user\@blue LOCATION:
>   ProcessStandbyReplyMessage, walsender.c:2431
> 2025-05-19 10:30:14 CEST \[1348336-467] remote\_production\_user\@blue DEBUG:
> 00000: skipped replication of an empty transaction with XID: 207637565
> 2025-05-19 10:30:14 CEST \[1348336-468] remote\_production\_user\@blue CONTEXT:
> slot "jnb\_production", output plugin "pgoutput", in the commit callback,
> associated LSN FB03/349FF938
> 2025-05-19 10:30:14 CEST \[1348336-469] remote\_production\_user\@blue LOCATION:
>   pgoutput\_commit\_txn, pgoutput.c:629
> 2025-05-19 10:30:14 CEST \[1348336-470] remote\_production\_user\@blue DEBUG:
> 00000: UpdateDecodingStats: updating stats 0x5ae1616c17a8 0 0 0 0 1 0 1 191
> 2025-05-19 10:30:14 CEST \[1348336-471] remote\_production\_user\@blue LOCATION:
>   UpdateDecodingStats, logical.c:1943
> 2025-05-19 10:30:14 CEST \[1348336-472] remote\_production\_user\@blue DEBUG:
> 00000: found top level transaction 207637519, with catalog changes
> 2025-05-19 10:30:14 CEST \[1348336-473] remote\_production\_user\@blue LOCATION:
>   SnapBuildCommitTxn, snapbuild.c:1150
> 2025-05-19 10:30:14 CEST \[1348336-474] remote\_production\_user\@blue DEBUG:
> 00000: adding a new snapshot and invalidations to 207616976 at FB03/34A1AAE0
> 2025-05-19 10:30:14 CEST \[1348336-475] remote\_production\_user\@blue LOCATION:
>   SnapBuildDistributeSnapshotAndInval, snapbuild.c:915
> 2025-05-19 10:30:14 CEST \[1348336-476] remote\_production\_user\@blue ERROR:
> XX000: invalid memory alloc request size 1196493216
>
> If I'm reading it right, things go wrong on the publisher while preparing the
> message, i.e. it's not a subscriber problem.
>

Right, I also think so.

> This particular instance was triggered by a large number of catalog
> invalidations: I dumped what I think is the relevant WAL with "pg_waldump -s
> FB03/34A1AAE0 -p 17/main/ --xid=207637519" and the output was a single long line:
>
...
...
>
> While it is long, it doesn't seem to merit allocating anything like 1GB of
> memory.  So I'm guessing that postgres is miscalculating the required size somehow.
>

We fixed a bug in commit 4909b38af0 to distribute invalidation at the
transaction end to avoid data loss in certain cases, which could cause
such a problem. I am wondering that even prior to that commit, we
would eventually end up allocating the required memory for a
transaction for all the invalidations because of repalloc in
ReorderBufferAddInvalidations, so why it matter with this commit? One
possibility is that we need allocations for multiple in-progress
transactions now. I'll think more about this. It would be helpful if
you could share more details about the workload, or if possible, a
testcase or script using which we can reproduce this problem.

--
With Regards,
Amit Kapila.



On Wed, May 21, 2025 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> <duncan.sands@deepbluecap.com> wrote:
> >
> > While it is long, it doesn't seem to merit allocating anything like 1GB of
> > memory.  So I'm guessing that postgres is miscalculating the required size somehow.
> >
>
> We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> transaction end to avoid data loss in certain cases, which could cause
> such a problem. I am wondering that even prior to that commit, we
> would eventually end up allocating the required memory for a
> transaction for all the invalidations because of repalloc in
> ReorderBufferAddInvalidations, so why it matter with this commit? One
> possibility is that we need allocations for multiple in-progress
> transactions now.
>

I think the problem here is that when we are distributing
invalidations to a concurrent transaction, in addition to queuing the
invalidations as a change, we also copy the distributed invalidations
along with the original transaction's invalidations via repalloc in
ReorderBufferAddInvalidations. So, when there are many in-progress
transactions, each would try to copy all its accumulated invalidations
to the remaining in-progress transactions. This could lead to such an
increase in allocation request size. However, after queuing the
change, we don't need to copy it along with the original transaction's
invalidations. This is because the copy is only required when we don't
process any changes in cases like ReorderBufferForget(). I have
analyzed all such cases, and my analysis is as follows:

ReorderBufferForget()
------------------------------
It is okay not to perform the invalidations that we got from other
concurrent transactions during ReorderBufferForget. This is because
ReorderBufferForget executes invalidations when we skip the
transaction being decoded, as it is not from a database of interest.
So, we execute only to invalidate shared catalogs (See comment at the
caller of ReorderBufferForget). It is sufficient to execute such
invalidations in the source transaction only because the transaction
being skipped wouldn't have loaded anything in the shared catalog.

ReorderBufferAbort()
-----------------------------
ReorderBufferAbort() process invalidation when it has already streamed
some changes. Whenever it would have streamed the change, it would
have processed the concurrent transactions' invalidation messages that
happened before the statement that led to streaming. That should be
sufficient for us.

Consider the following variant of the original case that required the
distribution of invalidations:
1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: INSERT INTO unrelated_tab VALUES(1);
6) S2: ROLLBACK;
7) S2: INSERT INTO d VALUES('d3');
8) S1: INSERT INTO d VALUES('d4');

The problem with the sequence is that the insert from 3) could be
decoded *after* 4) in step 5) due to streaming, and that to decode the
insert (which happened before the ALTER) the catalog snapshot and
cache state is from *before* the ALTER TABLE. Because the transaction
started in 3) doesn't modify any catalogs, no invalidations are
executed after decoding it. The result could be that the cache looks
like it did at 3), not like after 4). However, this won't create a
problem because while streaming at 5), we would execute invalidation
from S-1 due to the change added via message
REORDER_BUFFER_CHANGE_INVALIDATION in ReorderBufferAddInvalidations.

ReorderBufferInvalidate
--------------------------------
The reason is the same as ReorderBufferForget(), as it executes
invalidations for the same reason, but with a different function to
avoid the cleanup of the buffer at the end.

XLOG_XACT_INVALIDATIONS
-------------------------------------------
While processing XLOG_XACT_INVALIDATIONS, we don't need invalidations
accumulated from other xacts because this is a special case to execute
invalidations from a particular command (DDL) in a transaction. It
won't build any cache, so it can't create any invalid state.

--
With Regards,
Amit Kapila.



Hi Amit and Shlok, thanks for thinking about this issue.  We are working on 
reproducing it in our test environment.  Since it seems likely to be related to 
our primary database being very busy with lots of concurrency and large 
transactions, we are starting by creating a streaming replication copy of our 
primary server (this copy to run 17.5, with the primary on 17.4), with the idea 
of then doing logical replication from the standby to see if we hit the same 
issue.  If so, that gives something to poke at, and we can move on to something 
better from there.

Best wishes, Duncan.

On 21/05/2025 07:48, Amit Kapila wrote:
> On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> <duncan.sands@deepbluecap.com> wrote:
>>
>> PostgreSQL v17.5 (Ubuntu 17.5-1.pgdg24.04+1); Ubuntu 24.04.2 LTS (kernel
>> 6.8.0); x86-64
>>
>> Good morning from DeepBlueCapital.  Soon after upgrading to 17.5 from 17.4, we
>> started seeing logical replication failures with publisher errors like this:
>>
>>     ERROR:  invalid memory alloc request size 1196493216
>>
>> (the exact size varies).  Here is a typical log extract from the publisher:
>>
>> 2025-05-19 10:30:14 CEST \[1348336-465] remote\_production\_user\@blue DEBUG:
>> 00000: write FB03/349DEF90 flush FB03/349DEF90 apply FB03/349DEF90 reply\_time
>> 2025-05-19 10:30:07.467048+02
>> 2025-05-19 10:30:14 CEST \[1348336-466] remote\_production\_user\@blue LOCATION:
>>    ProcessStandbyReplyMessage, walsender.c:2431
>> 2025-05-19 10:30:14 CEST \[1348336-467] remote\_production\_user\@blue DEBUG:
>> 00000: skipped replication of an empty transaction with XID: 207637565
>> 2025-05-19 10:30:14 CEST \[1348336-468] remote\_production\_user\@blue CONTEXT:
>> slot "jnb\_production", output plugin "pgoutput", in the commit callback,
>> associated LSN FB03/349FF938
>> 2025-05-19 10:30:14 CEST \[1348336-469] remote\_production\_user\@blue LOCATION:
>>    pgoutput\_commit\_txn, pgoutput.c:629
>> 2025-05-19 10:30:14 CEST \[1348336-470] remote\_production\_user\@blue DEBUG:
>> 00000: UpdateDecodingStats: updating stats 0x5ae1616c17a8 0 0 0 0 1 0 1 191
>> 2025-05-19 10:30:14 CEST \[1348336-471] remote\_production\_user\@blue LOCATION:
>>    UpdateDecodingStats, logical.c:1943
>> 2025-05-19 10:30:14 CEST \[1348336-472] remote\_production\_user\@blue DEBUG:
>> 00000: found top level transaction 207637519, with catalog changes
>> 2025-05-19 10:30:14 CEST \[1348336-473] remote\_production\_user\@blue LOCATION:
>>    SnapBuildCommitTxn, snapbuild.c:1150
>> 2025-05-19 10:30:14 CEST \[1348336-474] remote\_production\_user\@blue DEBUG:
>> 00000: adding a new snapshot and invalidations to 207616976 at FB03/34A1AAE0
>> 2025-05-19 10:30:14 CEST \[1348336-475] remote\_production\_user\@blue LOCATION:
>>    SnapBuildDistributeSnapshotAndInval, snapbuild.c:915
>> 2025-05-19 10:30:14 CEST \[1348336-476] remote\_production\_user\@blue ERROR:
>> XX000: invalid memory alloc request size 1196493216
>>
>> If I'm reading it right, things go wrong on the publisher while preparing the
>> message, i.e. it's not a subscriber problem.
>>
> 
> Right, I also think so.
> 
>> This particular instance was triggered by a large number of catalog
>> invalidations: I dumped what I think is the relevant WAL with "pg_waldump -s
>> FB03/34A1AAE0 -p 17/main/ --xid=207637519" and the output was a single long line:
>>
> ...
> ...
>>
>> While it is long, it doesn't seem to merit allocating anything like 1GB of
>> memory.  So I'm guessing that postgres is miscalculating the required size somehow.
>>
> 
> We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> transaction end to avoid data loss in certain cases, which could cause
> such a problem. I am wondering that even prior to that commit, we
> would eventually end up allocating the required memory for a
> transaction for all the invalidations because of repalloc in
> ReorderBufferAddInvalidations, so why it matter with this commit? One
> possibility is that we need allocations for multiple in-progress
> transactions now. I'll think more about this. It would be helpful if
> you could share more details about the workload, or if possible, a
> testcase or script using which we can reproduce this problem.
> 




On Wed, May 21, 2025 at 4:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 21, 2025 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> > <duncan.sands@deepbluecap.com> wrote:
> > >
> > > While it is long, it doesn't seem to merit allocating anything like 1GB of
> > > memory.  So I'm guessing that postgres is miscalculating the required size somehow.
> > >
> >
> > We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> > transaction end to avoid data loss in certain cases, which could cause
> > such a problem. I am wondering that even prior to that commit, we
> > would eventually end up allocating the required memory for a
> > transaction for all the invalidations because of repalloc in
> > ReorderBufferAddInvalidations, so why it matter with this commit? One
> > possibility is that we need allocations for multiple in-progress
> > transactions now.
> >
>
> I think the problem here is that when we are distributing
> invalidations to a concurrent transaction, in addition to queuing the
> invalidations as a change, we also copy the distributed invalidations
> along with the original transaction's invalidations via repalloc in
> ReorderBufferAddInvalidations. So, when there are many in-progress
> transactions, each would try to copy all its accumulated invalidations
> to the remaining in-progress transactions. This could lead to such an
> increase in allocation request size.

I agree with this analysis.

> However, after queuing the
> change, we don't need to copy it along with the original transaction's
> invalidations. This is because the copy is only required when we don't
> process any changes in cases like ReorderBufferForget().

It seems that we use the accumulated invalidation message also after
replaying or concurrently aborting a transaction via
ReorderBufferExecuteInvalidations(). Do we need to consider such cases
too?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, 21 May 2025 at 17:18, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> > I think the problem here is that when we are distributing
> > invalidations to a concurrent transaction, in addition to queuing the
> > invalidations as a change, we also copy the distributed invalidations
> > along with the original transaction's invalidations via repalloc in
> > ReorderBufferAddInvalidations. So, when there are many in-progress
> > transactions, each would try to copy all its accumulated invalidations
> > to the remaining in-progress transactions. This could lead to such an
> > increase in allocation request size. However, after queuing the
> > change, we don't need to copy it along with the original transaction's
> > invalidations. This is because the copy is only required when we don't
> > process any changes in cases like ReorderBufferForget(). I have
> > analyzed all such cases, and my analysis is as follows:
>
> Based on the analysis, I created a PoC which avoids the repalloc().
> Invalidation messages distributed by SnapBuildDistributeSnapshotAndInval() are
> skipped to add in the list, just queued - repalloc can be skipped. Also, the function
> distributes messages only in the list, so received messages won't be sent again.
>
> Now a patch for PG17 is created for testing purpose. Duncan, can you apply this and
> confirms whether the issue can be solved?
>
Hi,

I was able to reproduce the issue with following test:

1. First begin 9 concurrent txn. (BEGIN; INSERT into t1 values(11);)
2. In 10th concurrent txn : perform 1000 DDL (ALTER PUBLICATION ADD/DROP TABLE)
3. For each concurrent 9 txn. Perform:
    i. Add 1000 DDL
    ii. COMMIT;
    iii. BEGIN; INSERT into t1 values(11);
4. Perform step (2 and 3) in loop

This steps reproduced the error:
2025-05-22 19:03:35.111 JST [63150] sub1 ERROR:  invalid memory alloc
request size 1555752832
2025-05-22 19:03:35.111 JST [63150] sub1 STATEMENT:  START_REPLICATION
SLOT "sub1" LOGICAL 0/0 (proto_version '4', streaming 'parallel',
origin 'any', publication_names '"pub1"')

I have also attached the test script for the same.
Also, I tried to run the test with Kuroda-san's patch and it did not
reproduce the issue.

Thanks and Regards,
Shlok Kyal

Attachment
On Thu, May 22, 2025 at 6:29 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit, Sawada-san,
>
> > Good point. After replaying the transaction, it doesn't matter because
> > we would have already relayed the required invalidation while
> > processing REORDER_BUFFER_CHANGE_INVALIDATION messages. However
> > for
> > concurrent abort case it could matter. See my analysis for the same
> > below:
> >
> > Simulation of concurrent abort
> > ------------------------------------------
> > 1) S1: CREATE TABLE d(data text not null);
> > 2) S1: INSERT INTO d VALUES('d1');
> > 3) S2: BEGIN; INSERT INTO d VALUES('d2');
> > 4) S2: INSERT INTO unrelated_tab VALUES(1);
> > 5) S1: ALTER PUBLICATION pb ADD TABLE d;
> > 6) S2: INSERT INTO unrelated_tab VALUES(2);
> > 7) S2: ROLLBACK;
> > 8) S2: INSERT INTO d VALUES('d3');
> > 9) S1: INSERT INTO d VALUES('d4');
>
> > The problem with the sequence is that the insert from 3) could be
> > decoded *after* 5) in step 6) due to streaming and that to decode the
> > insert (which happened before the ALTER) the catalog snapshot and
> > cache state is from *before* the ALTER TABLE. Because the transaction
> > started in 3) doesn't actually modify any catalogs, no invalidations
> > are executed after decoding it. Now, assume, while decoding Insert
> > from 4), we detected a concurrent abort, then the distributed
> > invalidation won't be executed, and if we don't have accumulated
> > messages in txn->invalidations, then the invalidation from step 5)
> > won't be performed. The data loss can occur in steps 8 and 9. This is
> > just a theory, so I could be missing something.
>
> I verified this is real or not, and succeeded to reproduce. See appendix the
> detailed steps.
>
> > If the above turns out to be a problem, one idea for fixing it is that
> > for the concurrent abort case (both during streaming and for prepared
> > transaction's processing), we still check all the remaining changes
> > and process only the changes related to invalidations. This has to be
> > done before the current txn changes are freed via
> > ReorderBufferResetTXN->ReorderBufferTruncateTXN.
>
> I roughly implemented the part, PSA the updated version. One concern is whether we
> should consider the case that invalidations can cause ereport(ERROR). If happens,
> the walsender will exit at that time.
>

But, in the catch part, we are already executing invalidations:
...
 /* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(txn->ninvalidations,  txn->invalidations);
...

So, the behaviour should be the same.

--
With Regards,
Amit Kapila.



Dear Hayato Kuroda, thank you so much for working on this problem.  Your patch 
PG17-0001-Avoid-distributing-invalidation-messages-several-tim.patch solves the 
issue for me.  Without it I get an invalid memory alloc request error within 
about twenty minutes.  With your patch, 24 hours have passed with no errors.

Best wishes, Duncan.

On 21/05/2025 13:48, Hayato Kuroda (Fujitsu) wrote:
> Dear hackers,
> 
>> I think the problem here is that when we are distributing
>> invalidations to a concurrent transaction, in addition to queuing the
>> invalidations as a change, we also copy the distributed invalidations
>> along with the original transaction's invalidations via repalloc in
>> ReorderBufferAddInvalidations. So, when there are many in-progress
>> transactions, each would try to copy all its accumulated invalidations
>> to the remaining in-progress transactions. This could lead to such an
>> increase in allocation request size. However, after queuing the
>> change, we don't need to copy it along with the original transaction's
>> invalidations. This is because the copy is only required when we don't
>> process any changes in cases like ReorderBufferForget(). I have
>> analyzed all such cases, and my analysis is as follows:
> 
> Based on the analysis, I created a PoC which avoids the repalloc().
> Invalidation messages distributed by SnapBuildDistributeSnapshotAndInval() are
> skipped to add in the list, just queued - repalloc can be skipped. Also, the function
> distributes messages only in the list, so received messages won't be sent again.
> 
> Now a patch for PG17 is created for testing purpose. Duncan, can you apply this and
> confirms whether the issue can be solved?
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 




On Mon, May 26, 2025 at 4:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 2:52 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > > If the above hypothesis is true, we need to consider another idea so
> > > that we can execute invalidation messages in both cases.
> >
> > The straightforward fix is to check the change queue as well when the transaction
> > has invalidation messages. 0003 implemented that. One downside is that traversing
> > changes can affect performance. Currently we iterates all of changes even a
> > single REORDER_BUFFER_CHANGE_INVALIDATION. I cannot find better solutions for now.
> >
>
> It can impact the performance for large transactions with fewer
> invalidations, especially the ones which has spilled changes because
> it needs to traverse the entire list of changes again at the end.

Agreed.

> The
> other idea would be to add new member(s) in ReorderBufferTXN to
> receive distributed invalidations. For adding the new member in
> ReorderBufferTXN: (a) in HEAD, it should be okay, (b) for
> backbranches, we may be able to add at the end, but we should check if
> there are any extensions using sizeof(ReorderBufferTxn) and if they
> are using what we need to do.

If we can make sure that that change won't break the existing
extensions, I think this would be the most reasonable solution.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Mon, May 26, 2025 at 4:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 2:52 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > > If the above hypothesis is true, we need to consider another idea so
> > > that we can execute invalidation messages in both cases.
> >
> > The straightforward fix is to check the change queue as well when the transaction
> > has invalidation messages. 0003 implemented that. One downside is that traversing
> > changes can affect performance. Currently we iterates all of changes even a
> > single REORDER_BUFFER_CHANGE_INVALIDATION. I cannot find better solutions for now.
> >
>
> It can impact the performance for large transactions with fewer
> invalidations, especially the ones which has spilled changes because
> it needs to traverse the entire list of changes again at the end.

What if we remember all executed REORDER_BUFFER_CHANGE_INVALIDATION in
a queue while replaying the transaction so that we can execute them at
the end in a non-error path, instead of re-traversing the entire list
of changes to execute the inval messages? As for concurrent abort
paths, probably we can consider re-traversing the entire list,
unconditionally invalidating all caches (using
InvalidateSystemCaches()), or somehow traversing the list of changes
only when there might be any REORDER_BUFFER_CHANGE_INVALIDATION in the
rest of changes?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, May 29, 2025 at 12:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 4:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 26, 2025 at 2:52 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > > If the above hypothesis is true, we need to consider another idea so
> > > > that we can execute invalidation messages in both cases.
> > >
> > > The straightforward fix is to check the change queue as well when the transaction
> > > has invalidation messages. 0003 implemented that. One downside is that traversing
> > > changes can affect performance. Currently we iterates all of changes even a
> > > single REORDER_BUFFER_CHANGE_INVALIDATION. I cannot find better solutions for now.
> > >
> >
> > It can impact the performance for large transactions with fewer
> > invalidations, especially the ones which has spilled changes because
> > it needs to traverse the entire list of changes again at the end.
>
> What if we remember all executed REORDER_BUFFER_CHANGE_INVALIDATION in
> a queue while replaying the transaction so that we can execute them at
> the end in a non-error path, instead of re-traversing the entire list
> of changes to execute the inval messages?
>

The current proposed patch (v4) is also traversing only the required
inval messages, as it has maintained a separate queue for that. So,
what will be the advantage of forming such a queue during the
processing of changes? Are you imagining a local instead of a queue at
ReorderBufferTXN level? I feel we still need at ReorderBufferTXN level
to ensure that we can execute those changes across streaming blocks,
otherwise, the cleanup of such a local queue would be tricky and add
to maintenance effort.

One disadvantage of the approach you suggest is that the changes in
the new queue won't be accounted for in logical_decoding_work_mem
computation, which can be done in the proposed approach, although the
patch hasn't implemented it as of now.

A few comments on v4:
===================
1.
+static void
+ReorderBufferExecuteInvalidationsInQueue(ReorderBuffer *rb,
ReorderBufferTXN *txn)
+{
...
...
+ /* Skip other changes because the transaction was aborted */
+ case REORDER_BUFFER_CHANGE_INSERT:
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ case REORDER_BUFFER_CHANGE_DELETE:
+ case REORDER_BUFFER_CHANGE_MESSAGE:
+ case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+ case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+ case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+ case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+ case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+ case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+ case REORDER_BUFFER_CHANGE_TRUNCATE:
+ break;

This cases should never happen, so we should have either an Assert or
an elog here.

2.
+void
+ReorderBufferAddInvalidationsExtended(ReorderBuffer *rb, TransactionId xid,
+   XLogRecPtr lsn, Size nmsgs,
+   SharedInvalidationMessage *msgs,
+   bool needs_distribute)
{
...
+ * XXX: IIUC This must be done only to the toptxns, but is it right?
+ */
+ if (!needs_distribute && !TransactionIdIsValid(txn->toplevel_xid))
+ {
+ ReorderBufferChange *inval_change;
+
+ /* Duplicate the inval change to queue it */
+ inval_change = ReorderBufferAllocChange(rb);
+ inval_change->action = REORDER_BUFFER_CHANGE_INVALIDATION;

The name needs_distribute seems confusing because when this function
is invoked from SnapBuildDistributeSnapshotAndInval(), the parameter
is passed as false; it should be true in that case, and the meaning of
this parameter in the function should be reversed.

--
With Regards,
Amit Kapila.



On Wed, May 28, 2025 at 10:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 29, 2025 at 12:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, May 26, 2025 at 4:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, May 26, 2025 at 2:52 PM Hayato Kuroda (Fujitsu)
> > > <kuroda.hayato@fujitsu.com> wrote:
> > > >
> > > > > If the above hypothesis is true, we need to consider another idea so
> > > > > that we can execute invalidation messages in both cases.
> > > >
> > > > The straightforward fix is to check the change queue as well when the transaction
> > > > has invalidation messages. 0003 implemented that. One downside is that traversing
> > > > changes can affect performance. Currently we iterates all of changes even a
> > > > single REORDER_BUFFER_CHANGE_INVALIDATION. I cannot find better solutions for now.
> > > >
> > >
> > > It can impact the performance for large transactions with fewer
> > > invalidations, especially the ones which has spilled changes because
> > > it needs to traverse the entire list of changes again at the end.
> >
> > What if we remember all executed REORDER_BUFFER_CHANGE_INVALIDATION in
> > a queue while replaying the transaction so that we can execute them at
> > the end in a non-error path, instead of re-traversing the entire list
> > of changes to execute the inval messages?
> >
>
> The current proposed patch (v4) is also traversing only the required
> inval messages, as it has maintained a separate queue for that. So,
> what will be the advantage of forming such a queue during the
> processing of changes? Are you imagining a local instead of a queue at
> ReorderBufferTXN level? I feel we still need at ReorderBufferTXN level
> to ensure that we can execute those changes across streaming blocks,
> otherwise, the cleanup of such a local queue would be tricky and add
> to maintenance effort.

Hmm, right. It seems that we keep accumulating inval messages across
streaming blocks.

> One disadvantage of the approach you suggest is that the changes in
> the new queue won't be accounted for in logical_decoding_work_mem
> computation, which can be done in the proposed approach, although the
> patch hasn't implemented it as of now.

If we serialize the new queue to the disk, we would need to restore
them in PG_CATCH() block in order to execute all inval messages, which
is something that I'd like to avoid as it would involve many
operations that could end up in an error.

If each ReorderBufferTXN has only non-distributed inval messages in
txn->invalidation and distribute only txn->invalidations to other
transactions, the scope of influence of a single Inval Message is
limited to transactions that are being decoded at the same time. How
much is there chance the size of txn->invalidations reach 1GB? Given
the size of SharedInvalidationMessage is 16 bytes, we need about 67k
inval messages generated across concurrent transactions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Dear Sawada-san,

> What if we remember all executed REORDER_BUFFER_CHANGE_INVALIDATION
> in
> a queue while replaying the transaction so that we can execute them at
> the end in a non-error path, instead of re-traversing the entire list
> of changes to execute the inval messages?

I think this idea is similar with v4 patch [1]. It adds another queue which stores the inval
messages when they are being distributed, and the inval queue is consumed while committing.

> As for concurrent abort
> paths, probably we can consider re-traversing the entire list,
> unconditionally invalidating all caches (using
> InvalidateSystemCaches()), or somehow traversing the list of changes
> only when there might be any REORDER_BUFFER_CHANGE_INVALIDATION in
> the
> rest of changes?

In my v4 patch, it is just OK to consume all inval messages in another queue,
because all needed messages are stored before we try to process txn.
Based on that, I feel v4 seems bit simpler approach.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB149667B316377CE0615E15138F567A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Dear Amit,

I created updated patch. I did some self-reviewing and updated.

> One disadvantage of the approach you suggest is that the changes in
> the new queue won't be accounted for in logical_decoding_work_mem
> computation, which can be done in the proposed approach, although the
> patch hasn't implemented it as of now.

This part is not implemented yet, because it is under discussion. Roughly considered,
we must consider several points - it might be complex:

1. What is the ordering of serialization? Should we prioritize spilling
   invalidation messages, or normal changes?
2. What is the filename? Can we use same one as changes?
3. IIUC we shouldn't stream inval messages even when stream=on/parallel case.
   Isn't it hacky?

> A few comments on v4:
> ===================
> 1.
> +static void
> +ReorderBufferExecuteInvalidationsInQueue(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
> +{
> ...
> ...
> + /* Skip other changes because the transaction was aborted */
> + case REORDER_BUFFER_CHANGE_INSERT:
> + case REORDER_BUFFER_CHANGE_UPDATE:
> + case REORDER_BUFFER_CHANGE_DELETE:
> + case REORDER_BUFFER_CHANGE_MESSAGE:
> + case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
> + case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
> + case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
> + case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
> + case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
> + case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
> + case REORDER_BUFFER_CHANGE_TRUNCATE:
> + break;
> 
> This cases should never happen, so we should have either an Assert or
> an elog here.

Fixed. Switch was removed and Assert() was added.

> 2.
> +void
> +ReorderBufferAddInvalidationsExtended(ReorderBuffer *rb, TransactionId xid,
> +   XLogRecPtr lsn, Size nmsgs,
> +   SharedInvalidationMessage *msgs,
> +   bool needs_distribute)
> {
> ...
> + * XXX: IIUC This must be done only to the toptxns, but is it right?
> + */
> + if (!needs_distribute && !TransactionIdIsValid(txn->toplevel_xid))
> + {
> + ReorderBufferChange *inval_change;
> +
> + /* Duplicate the inval change to queue it */
> + inval_change = ReorderBufferAllocChange(rb);
> + inval_change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
> 
> The name needs_distribute seems confusing because when this function
> is invoked from SnapBuildDistributeSnapshotAndInval(), the parameter
> is passed as false; it should be true in that case, and the meaning of
> this parameter in the function should be reversed.

Fixed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment
On Thu, 29 May 2025 at 22:57, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I agree that chances are much lower than current if txn->invalidations
> > doesn't contain invalidations from other transactions, but it is not
> > clear what exactly you are trying to advocate by it. Are you trying to
> > advocate that we should maintain a member similar to txn->invalidation
> > (say txn->distributed_invals) instead of a queue?
>
> Yes, because I guess it's much simpler. I think it would not be a good
> idea to introduce a new concept of accounting the memory usage of the
> distributed inval messages too and serializing them, at least on back
> branches. I think that In case where the txn->distriubted_inval is
> about to overflow (not has to be 1GB) we can invalidate all caches
> instread.

To identify overflow scenarios, I’m considering the following options:
a) Introduce a new txn_flags value, such as RBTXN_INVAL_ALL_CACHE, to
explicitly mark transactions that require full cache invalidation.
b) Add a dedicated parameter to indicate an overflow scenario.
c) setting the newly added nentries_distr to -1, to indicate an
overflow scenario.

Do you have any preference or thoughts on which of these approaches
would be cleaner?

Regards,
Vignesh



On Fri, May 30, 2025 at 9:31 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 29 May 2025 at 22:57, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I agree that chances are much lower than current if txn->invalidations
> > > doesn't contain invalidations from other transactions, but it is not
> > > clear what exactly you are trying to advocate by it. Are you trying to
> > > advocate that we should maintain a member similar to txn->invalidation
> > > (say txn->distributed_invals) instead of a queue?
> >
> > Yes, because I guess it's much simpler. I think it would not be a good
> > idea to introduce a new concept of accounting the memory usage of the
> > distributed inval messages too and serializing them, at least on back
> > branches. I think that In case where the txn->distriubted_inval is
> > about to overflow (not has to be 1GB) we can invalidate all caches
> > instread.
>

I agree that it would be simpler, and to avoid invalid memory
allocation requests even for rare cases, we can have the backup logic
to invalidate all caches.

> To identify overflow scenarios, I’m considering the following options:
> a) Introduce a new txn_flags value, such as RBTXN_INVAL_ALL_CACHE, to
> explicitly mark transactions that require full cache invalidation.
> b) Add a dedicated parameter to indicate an overflow scenario.
> c) setting the newly added nentries_distr to -1, to indicate an
> overflow scenario.
>
> Do you have any preference or thoughts on which of these approaches
> would be cleaner?
>

I would prefer (a) as that is an explicit way to indicate that we need
to invalidate all caches. But let us see if Sawada-san has something
else in mind.

--
With Regards,
Amit Kapila.



On Fri, 30 May 2025 at 10:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 30, 2025 at 9:31 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 29 May 2025 at 22:57, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I agree that chances are much lower than current if txn->invalidations
> > > > doesn't contain invalidations from other transactions, but it is not
> > > > clear what exactly you are trying to advocate by it. Are you trying to
> > > > advocate that we should maintain a member similar to txn->invalidation
> > > > (say txn->distributed_invals) instead of a queue?
> > >
> > > Yes, because I guess it's much simpler. I think it would not be a good
> > > idea to introduce a new concept of accounting the memory usage of the
> > > distributed inval messages too and serializing them, at least on back
> > > branches. I think that In case where the txn->distriubted_inval is
> > > about to overflow (not has to be 1GB) we can invalidate all caches
> > > instread.
> >
>
> I agree that it would be simpler, and to avoid invalid memory
> allocation requests even for rare cases, we can have the backup logic
> to invalidate all caches.
>
> > To identify overflow scenarios, I’m considering the following options:
> > a) Introduce a new txn_flags value, such as RBTXN_INVAL_ALL_CACHE, to
> > explicitly mark transactions that require full cache invalidation.
> > b) Add a dedicated parameter to indicate an overflow scenario.
> > c) setting the newly added nentries_distr to -1, to indicate an
> > overflow scenario.
> >
> > Do you have any preference or thoughts on which of these approaches
> > would be cleaner?
> >
>
> I would prefer (a) as that is an explicit way to indicate that we need
> to invalidate all caches. But let us see if Sawada-san has something
> else in mind.

The attached v6 version patch has the changes for the same.
Thoughts?

Regards,
Vignesh

Attachment
On Thu, May 29, 2025 at 9:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 30, 2025 at 9:31 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 29 May 2025 at 22:57, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I agree that chances are much lower than current if txn->invalidations
> > > > doesn't contain invalidations from other transactions, but it is not
> > > > clear what exactly you are trying to advocate by it. Are you trying to
> > > > advocate that we should maintain a member similar to txn->invalidation
> > > > (say txn->distributed_invals) instead of a queue?
> > >
> > > Yes, because I guess it's much simpler. I think it would not be a good
> > > idea to introduce a new concept of accounting the memory usage of the
> > > distributed inval messages too and serializing them, at least on back
> > > branches. I think that In case where the txn->distriubted_inval is
> > > about to overflow (not has to be 1GB) we can invalidate all caches
> > > instread.
> >
>
> I agree that it would be simpler, and to avoid invalid memory
> allocation requests even for rare cases, we can have the backup logic
> to invalidate all caches.
>
> > To identify overflow scenarios, I’m considering the following options:
> > a) Introduce a new txn_flags value, such as RBTXN_INVAL_ALL_CACHE, to
> > explicitly mark transactions that require full cache invalidation.
> > b) Add a dedicated parameter to indicate an overflow scenario.
> > c) setting the newly added nentries_distr to -1, to indicate an
> > overflow scenario.
> >
> > Do you have any preference or thoughts on which of these approaches
> > would be cleaner?
> >
>
> I would prefer (a) as that is an explicit way to indicate that we need
> to invalidate all caches. But let us see if Sawada-san has something
> else in mind.

(a) makes sense to me too. One concern in back branches is that unused
bits in txn_flags might be used by extensions, but it's unlikely.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Sat, 31 May 2025 at 10:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 30, 2025 at 11:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for updating the patch. Here are some review comments:
> >
> > ---
> > +           /*
> > +            * Make sure there's no cache pollution. Unlike the PG_TRY part,
> > +            * this must be done unconditionally because the processing might
> > +            * fail before we reach invalidation messages.
> > +            */
> > +           if (rbtxn_inval_all_cache(txn))
> > +               InvalidateSystemCaches();
> > +           else
> > +               ReorderBufferExecuteInvalidations(txn->ninvalidations_distr,
> > +
> > txn->distributed_invalidations);
> > +
> >
> > If we don't need to execute the distributed inval message in an error
> > path other than detecting concurrent abort, we should describe the
> > reason.
> >
>
> The concurrent abort handling is done for streaming and prepared
> transactions, where we send the transaction changes to the client
> before we read COMMIT/COMMIT PREPARED/ROLLBACK/ROLLBACK PREPARED. Now,
> among these COMMIT/ROLLBACK PREPARED cases are handled as part of a
> prepared transaction case. For ROLLBACK, we will never perform any
> changes from the current transaction, so we don't need distributed
> invalidations to be executed. For COMMIT, if we encounter any errors
> while processing changes (this is when we reach the ERROR path, which
> is not a concurrent abort), then we will reprocess all changes and, at
> the end, execute both the current transaction and distributed
> invalidations. Now, one possibility is that if, after ERROR, the
> caller does slot_advance to skip the ERROR, then we will probably miss
> executing the distributed invalidations, leading to data loss
> afterwards. If the above theory is correct, then it is better to
> execute distributed invalidation even in non-concurrent-abort cases in
> the ERROR path.

One possible reason this scenario may not occur is that
pg_logical_slot_get_changes_guts uses a PG_CATCH block to handle
exceptions, during which it calls InvalidateSystemCaches to clear the
system cache. Because of this, I believe the scenario might not
actually happen.
@Sawada-san / others — Are there any other cases where this could still occur?

Regards,
Vignesh



Dear Hayato Kuroda, I tried 
v4-PG17-0001-Avoid-distributing-invalidation-messages-sev.patch and I can 
confirm that it also resolves my original issue.

Best wishes, Duncan.

On 28/05/2025 14:27, Hayato Kuroda (Fujitsu) wrote:
> Dear Sawada-san, Amit,
> 
>>> It can impact the performance for large transactions with fewer
>>> invalidations, especially the ones which has spilled changes because
>>> it needs to traverse the entire list of changes again at the end.
>>
>> Agreed.
>>
>>> The
>>> other idea would be to add new member(s) in ReorderBufferTXN to
>>> receive distributed invalidations. For adding the new member in
>>> ReorderBufferTXN: (a) in HEAD, it should be okay, (b) for
>>> backbranches, we may be able to add at the end, but we should check if
>>> there are any extensions using sizeof(ReorderBufferTxn) and if they
>>> are using what we need to do.
>>
>> If we can make sure that that change won't break the existing
>> extensions, I think this would be the most reasonable solution.
> 
> Based on the discussion, I created PoC for master/PG17. Please see attached.
> The basic idea is to introduce the new queue which only contains distributed inval
> messages. Contents are consumed at end of transactions. I feel some of codes can
> be re-used so that internal functions are introduced. At least, it could pass
> regression tests and workloads discussed here.
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 




On Mon, Jun 2, 2025 at 3:22 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, 31 May 2025 at 13:27, vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 30 May 2025 at 23:00, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Thank you for updating the patch. Here are some review comments:
> > >
> > > @@ -3439,9 +3464,27 @@ ReorderBufferAddInvalidations(ReorderBuffer
> > > *rb, TransactionId xid,
> > >                               XLogRecPtr lsn, Size nmsgs,
> > >                               SharedInvalidationMessage *msgs)
> > >  {
> > > -   ReorderBufferTXN *txn;
> > > +   ReorderBufferAddInvalidationsExtended(rb, xid, lsn, nmsgs, msgs, false);
> > > +}
> > >
> > > If the patch is the changes for master do we need to have an extended
> > > version of ReorderBufferAddInvalidation()?
> >
> > This has been removed now and ReorderBufferAddDistributedInvalidtions
> > has been added
> >
> > > ---
> > > +           /*
> > > +            * Make sure there's no cache pollution. Unlike the PG_TRY part,
> > > +            * this must be done unconditionally because the processing might
> > > +            * fail before we reach invalidation messages.
> > > +            */
> > > +           if (rbtxn_inval_all_cache(txn))
> > > +               InvalidateSystemCaches();
> > > +           else
> > > +               ReorderBufferExecuteInvalidations(txn->ninvalidations_distr,
> > > +
> > > txn->distributed_invalidations);
> > > +
> > >
> > > If we don't need to execute the distributed inval message in an error
> > > path other than detecting concurrent abort, we should describe the
> > > reason.
> >
> > Removed it to keep it in the common error path
> >
> > > ---
> > > Given that we don't account the memory usage of both
> > > txn->invalidations and txn->distributed_invalidations, probably we can
> > > have a lower limit, say 8MB (or lower?), to avoid memory exhaustion.
> >
> > Modified
> >
> > > ---
> > > +   if ((for_inval && !AllocSizeIsValid(req_mem_size)) ||
> > > +       rbtxn_inval_all_cache(txn))
> > >     {
> > > -       txn->ninvalidations = nmsgs;
> > > -       txn->invalidations = (SharedInvalidationMessage *)
> > > -           palloc(sizeof(SharedInvalidationMessage) * nmsgs);
> > > -       memcpy(txn->invalidations, msgs,
> > > -              sizeof(SharedInvalidationMessage) * nmsgs);
> > > +       txn->txn_flags |= RBTXN_INVAL_ALL_CACHE;
> > > +
> > > +       if (*invalidations)
> > > +       {
> > > +           pfree(*invalidations);
> > > +           *invalidations = NULL;
> > > +           *ninvalidations = 0;
> > > +       }
> > >
> > > RBTXN_INVAL_ALL_CACHE seems to have an effect only on the distributed
> > > inval messages. One question is do we need to care about the overflow
> > > of txn->invalidations as well? If no, does it make sense to have a
> > > separate function like ReorderBufferAddDistributedInvalidtions()
> > > instead of having an extended version of
> > > ReorderBufferAddInvalidations()? Some common routines can also be
> > > declared as a static function if needed.
> >
> > Modified
> >
> > The attached v7 version patch has the changes for the same.
>
> Here is the patch, including updates for the back branches.
> The main difference from master is that the newly added structure
> members are appended at the end in the back branches to preserve
> compatibility. The invalidation addition logic remains consistent with
> master: a new function, ReorderBufferAddDistributedInvalidations, has
> been introduced to handle distributed invalidations. Shared logic
> between ReorderBufferAddInvalidations and
> ReorderBufferAddDistributedInvalidations has been factored out into a
> common helper, ReorderBufferAddInvalidationsCommon. This approach
> simplifies future merges and, in my assessment, does not introduce any
> backward compatibility issues.
>

Thank you for updating the patch. Here are some review comments:

+   req_mem_size = sizeof(SharedInvalidationMessage) *
(txn->ninvalidations_distr + nmsgs);
+
+   /*
+    * If the number of invalidation messages is larger than 8MB, it's more
+    * efficient to invalidate the entire cache rather than processing each
+    * message individually.
+    */
+   if (req_mem_size > (8 * 1024 * 1024) || rbtxn_inval_all_cache(txn))

It's better to define the maximum number of distributed inval messages
per transaction as a macro instead of calculating the memory size
every time.

---
+static void
+ReorderBufferAddInvalidationsCommon(ReorderBuffer *rb, TransactionId xid,
+                                   XLogRecPtr lsn, Size nmsgs,
+                                   SharedInvalidationMessage *msgs,
+                                   ReorderBufferTXN *txn,
+                                   bool for_inval)

This function is quite confusing to me. For instance,
ReorderBufferAddDistributedInvalidations() needs to call this function
with for_inval=false in spite of adding inval messages actually. Also,
the following condition seems not intuisive but there is no comment:

    if (!for_inval || (for_inval && !rbtxn_inval_all_cache(txn)))

Instead of having ReorderBufferAddInvalidationsCommon(), I think we
can have a function say ReorderBufferQueueInvalidations() where we
enqueue the given inval messages as a
REORDER_BUFFER_CHANGE_INVALIDATION change.
ReorderBufferAddInvalidations() adds inval messages to
txn->invalidations and calls that function, while
ReorderBufferQueueInvalidations() adds inval messages to
txn->distributed_ivnalidations and calls that function if the array is
not full.

BTW if we need to invalidate all accumulated caches at the end of
transaction replay anyway, we don't need to add inval messages to
txn->invalidations once txn->distributed_invalidations gets full?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Mon, Jun 2, 2025 at 11:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 2, 2025 at 4:35 AM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Amit, Duncan, Vignesh, Sawada-san,
> >
> > I searched on GitHub and confirmed that no extensions use sizeof(ReorderBufferTXN).
> > Based on that, can we proceed with the fix for backbenches?
> > Note that I checked only repositories that are opened on GitHub. Proprietary
> > ones and hosting on other services cannot be guaranteed.
> >
> > How I search
> > ===========
> > I searched on github with "sizeof(ReorderBufferTXN)", and 259 files were found.
> > All of them were either "reorderbuffer.c" or "reorderbuffer.cpp". reorderbuffer.c
> > came from the forked postgres repos, and "reorderbuffer.cpp" came from the
> > openGauss project, which tries to enhance postgres. In both cases, they are
> > server-side code and will rebase community's commits when we update the header
> > file and sizeof(ReorderBufferTXN).
> > In this check, I could not find extensions that refer to the size; only
> > server-side codes were found. Based on that, we could extend the struct
> > ReorderBufferTXN.
>
> Thank you for checking the source codes on Github. I personally think
> the chance that extensions depend on the ReorderBufferTXN size is low.
> Without that, we would need more complex logic to store inval messages
> per-transaction, which introduce a risk. So I agree with the current
> solution.
>

It is difficult to predict whether proprietary extensions rely on the
sizeof ReorderBufferTXN, but I can't think of a better fix (where we
add new members at the end of ReorderBufferTXN) for backbranches. I
think we should explicitly mention this in the commit message, and in
the worst case, we need to request extension owners (that rely on
sizeof(ReorderBufferTXN)) to rebuild their extension.

--
With Regards,
Amit Kapila.



On Tue, Jun 3, 2025 at 12:07 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 2 Jun 2025 at 22:49, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for updating the patch. Here are some review comments:
> >
> > +   req_mem_size = sizeof(SharedInvalidationMessage) *
> > (txn->ninvalidations_distr + nmsgs);
> > +
> > +   /*
> > +    * If the number of invalidation messages is larger than 8MB, it's more
> > +    * efficient to invalidate the entire cache rather than processing each
> > +    * message individually.
> > +    */
> > +   if (req_mem_size > (8 * 1024 * 1024) || rbtxn_inval_all_cache(txn))
> >
> > It's better to define the maximum number of distributed inval messages
> > per transaction as a macro instead of calculating the memory size
> > every time.
>
> Modified
>
> > ---
> > +static void
> > +ReorderBufferAddInvalidationsCommon(ReorderBuffer *rb, TransactionId xid,
> > +                                   XLogRecPtr lsn, Size nmsgs,
> > +                                   SharedInvalidationMessage *msgs,
> > +                                   ReorderBufferTXN *txn,
> > +                                   bool for_inval)
> >
> > This function is quite confusing to me. For instance,
> > ReorderBufferAddDistributedInvalidations() needs to call this function
> > with for_inval=false in spite of adding inval messages actually. Also,
> > the following condition seems not intuisive but there is no comment:
> >
> >     if (!for_inval || (for_inval && !rbtxn_inval_all_cache(txn)))
> >
> > Instead of having ReorderBufferAddInvalidationsCommon(), I think we
> > can have a function say ReorderBufferQueueInvalidations() where we
> > enqueue the given inval messages as a
> > REORDER_BUFFER_CHANGE_INVALIDATION change.
> > ReorderBufferAddInvalidations() adds inval messages to
> > txn->invalidations and calls that function, while
> > ReorderBufferQueueInvalidations() adds inval messages to
> > txn->distributed_ivnalidations and calls that function if the array is
> > not full.
>
> Modified
>
> > BTW if we need to invalidate all accumulated caches at the end of
> > transaction replay anyway, we don't need to add inval messages to
> > txn->invalidations once txn->distributed_invalidations gets full?
>
> yes, no need to add invalidation messages to txn->invalidation once
> RBTXN_INVAL_ALL_CACHE is set. This is handled now.
>
> The attached v9 version patch has the changes for the same.

Thank you for updating the patch. Here are review comments on v9 patch:

+/*
+ * Maximum number of distributed invalidation messages per transaction.
+ * Each message is ~16 bytes, this allows up to 8 MB of invalidation
+ * message data.
+ */
+#define MAX_DISTR_INVAL_MSG_PER_TXN 524288

The size of SharedInvalidationMessage could change in the future so we
should calculate it at compile time.

---
+   /*
+    * If the complete cache will be invalidated, we don't need to accumulate
+    * the invalidations.
+    */
+   if (!rbtxn_inval_all_cache(txn))
+       ReorderBufferAccumulateInvalidations(&txn->ninvalidations,
+                                            &txn->invalidations, nmsgs, msgs);

We need to explain why we don't check the number of invalidation
messages for txn->invalidations and mark it as inval-all-cache, unlike
ReorderBufferAddDistributedInvalidations().

---
+   /*
+    * If the number of invalidation messages is high, performing a full cache
+    * invalidation is more efficient than handling each message separately.
+    */
+   if (((nmsgs + txn->ninvalidations_distributed) >
MAX_DISTR_INVAL_MSG_PER_TXN) ||
+       rbtxn_inval_all_cache(txn))
    {
-       txn->invalidations = (SharedInvalidationMessage *)
-           repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
-                    (txn->ninvalidations + nmsgs));
+       txn->txn_flags |= RBTXN_INVAL_ALL_CACHE;

I think we don't need to mark the transaction as RBTXN_INVAL_ALL_CACHE
again. I'd rewrite the logic as follows:

if (txn->ninvalidations_distributed + nmsgs >= MAX_DISTR_INVAL_MSG_PER_TXN)
{
    /* mark the txn as inval-all-cache */
    ....
    /* free the accumulated inval msgs */
    ....
}

if (!rbtxn_inval_all_cache(txn))
    ReorderBufferAccumulateInvalidations(...);

---
-               ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
-                                             ninvalidations, msgs);
+               ReorderBufferAddDistributedInvalidations(builder->reorder,
+                                                        txn->xid, lsn,
+                                                        ninvalidations, msgs);

I think we need some comments here to explain why we need to
distribute only inval messages coming from the current transaction.

---
+/* Should the complete cache be invalidated? */
+#define rbtxn_inval_all_cache(txn) \
+( \
+   ((txn)->txn_flags & RBTXN_INVAL_ALL_CACHE) != 0 \
+)

I find that if we rename the flag to something like
RBTXN_INVAL_OVERFLOWED, it would explain the state of the transaction
clearer.

---
Can we have a reasonable test case that covers the inval message overflow cases?

I've attached a patch for some changes and adding more comments (note
that it still has XXX comments). Please include these changes that you
agreed with in the next version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, 4 Jun 2025 at 01:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 3, 2025 at 12:07 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 2 Jun 2025 at 22:49, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Thank you for updating the patch. Here are some review comments:
> > >
> > > +   req_mem_size = sizeof(SharedInvalidationMessage) *
> > > (txn->ninvalidations_distr + nmsgs);
> > > +
> > > +   /*
> > > +    * If the number of invalidation messages is larger than 8MB, it's more
> > > +    * efficient to invalidate the entire cache rather than processing each
> > > +    * message individually.
> > > +    */
> > > +   if (req_mem_size > (8 * 1024 * 1024) || rbtxn_inval_all_cache(txn))
> > >
> > > It's better to define the maximum number of distributed inval messages
> > > per transaction as a macro instead of calculating the memory size
> > > every time.
> >
> > Modified
> >
> > > ---
> > > +static void
> > > +ReorderBufferAddInvalidationsCommon(ReorderBuffer *rb, TransactionId xid,
> > > +                                   XLogRecPtr lsn, Size nmsgs,
> > > +                                   SharedInvalidationMessage *msgs,
> > > +                                   ReorderBufferTXN *txn,
> > > +                                   bool for_inval)
> > >
> > > This function is quite confusing to me. For instance,
> > > ReorderBufferAddDistributedInvalidations() needs to call this function
> > > with for_inval=false in spite of adding inval messages actually. Also,
> > > the following condition seems not intuisive but there is no comment:
> > >
> > >     if (!for_inval || (for_inval && !rbtxn_inval_all_cache(txn)))
> > >
> > > Instead of having ReorderBufferAddInvalidationsCommon(), I think we
> > > can have a function say ReorderBufferQueueInvalidations() where we
> > > enqueue the given inval messages as a
> > > REORDER_BUFFER_CHANGE_INVALIDATION change.
> > > ReorderBufferAddInvalidations() adds inval messages to
> > > txn->invalidations and calls that function, while
> > > ReorderBufferQueueInvalidations() adds inval messages to
> > > txn->distributed_ivnalidations and calls that function if the array is
> > > not full.
> >
> > Modified
> >
> > > BTW if we need to invalidate all accumulated caches at the end of
> > > transaction replay anyway, we don't need to add inval messages to
> > > txn->invalidations once txn->distributed_invalidations gets full?
> >
> > yes, no need to add invalidation messages to txn->invalidation once
> > RBTXN_INVAL_ALL_CACHE is set. This is handled now.
> >
> > The attached v9 version patch has the changes for the same.
>
> Thank you for updating the patch. Here are review comments on v9 patch:
>
> +/*
> + * Maximum number of distributed invalidation messages per transaction.
> + * Each message is ~16 bytes, this allows up to 8 MB of invalidation
> + * message data.
> + */
> +#define MAX_DISTR_INVAL_MSG_PER_TXN 524288
>
> The size of SharedInvalidationMessage could change in the future so we
> should calculate it at compile time.

Modified

> ---
> +   /*
> +    * If the complete cache will be invalidated, we don't need to accumulate
> +    * the invalidations.
> +    */
> +   if (!rbtxn_inval_all_cache(txn))
> +       ReorderBufferAccumulateInvalidations(&txn->ninvalidations,
> +                                            &txn->invalidations, nmsgs, msgs);
>
> We need to explain why we don't check the number of invalidation
> messages for txn->invalidations and mark it as inval-all-cache, unlike
> ReorderBufferAddDistributedInvalidations().

Added comments

> ---
> +   /*
> +    * If the number of invalidation messages is high, performing a full cache
> +    * invalidation is more efficient than handling each message separately.
> +    */
> +   if (((nmsgs + txn->ninvalidations_distributed) >
> MAX_DISTR_INVAL_MSG_PER_TXN) ||
> +       rbtxn_inval_all_cache(txn))
>     {
> -       txn->invalidations = (SharedInvalidationMessage *)
> -           repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
> -                    (txn->ninvalidations + nmsgs));
> +       txn->txn_flags |= RBTXN_INVAL_ALL_CACHE;
>
> I think we don't need to mark the transaction as RBTXN_INVAL_ALL_CACHE
> again. I'd rewrite the logic as follows:
>
> if (txn->ninvalidations_distributed + nmsgs >= MAX_DISTR_INVAL_MSG_PER_TXN)
> {
>     /* mark the txn as inval-all-cache */
>     ....
>     /* free the accumulated inval msgs */
>     ....
> }
>
> if (!rbtxn_inval_all_cache(txn))
>     ReorderBufferAccumulateInvalidations(...);

Modified

> ---
> -               ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
> -                                             ninvalidations, msgs);
> +               ReorderBufferAddDistributedInvalidations(builder->reorder,
> +                                                        txn->xid, lsn,
> +                                                        ninvalidations, msgs);
>
> I think we need some comments here to explain why we need to
> distribute only inval messages coming from the current transaction.

Added comments

> ---
> +/* Should the complete cache be invalidated? */
> +#define rbtxn_inval_all_cache(txn) \
> +( \
> +   ((txn)->txn_flags & RBTXN_INVAL_ALL_CACHE) != 0 \
> +)
>
> I find that if we rename the flag to something like
> RBTXN_INVAL_OVERFLOWED, it would explain the state of the transaction
> clearer.

Modified

> Can we have a reasonable test case that covers the inval message overflow cases?
One of us will work on this and post a separate patch

> I've attached a patch for some changes and adding more comments (note
> that it still has XXX comments). Please include these changes that you
> agreed with in the next version patch.

Thanks for the comments, I merged it.

The attached v10 version patch has the changes for the same.

Regards,
Vignesh

Attachment
On Tue, Jun 3, 2025 at 11:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 4 Jun 2025 at 01:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 12:07 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Mon, 2 Jun 2025 at 22:49, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Thank you for updating the patch. Here are some review comments:
> > > >
> > > > +   req_mem_size = sizeof(SharedInvalidationMessage) *
> > > > (txn->ninvalidations_distr + nmsgs);
> > > > +
> > > > +   /*
> > > > +    * If the number of invalidation messages is larger than 8MB, it's more
> > > > +    * efficient to invalidate the entire cache rather than processing each
> > > > +    * message individually.
> > > > +    */
> > > > +   if (req_mem_size > (8 * 1024 * 1024) || rbtxn_inval_all_cache(txn))
> > > >
> > > > It's better to define the maximum number of distributed inval messages
> > > > per transaction as a macro instead of calculating the memory size
> > > > every time.
> > >
> > > Modified
> > >
> > > > ---
> > > > +static void
> > > > +ReorderBufferAddInvalidationsCommon(ReorderBuffer *rb, TransactionId xid,
> > > > +                                   XLogRecPtr lsn, Size nmsgs,
> > > > +                                   SharedInvalidationMessage *msgs,
> > > > +                                   ReorderBufferTXN *txn,
> > > > +                                   bool for_inval)
> > > >
> > > > This function is quite confusing to me. For instance,
> > > > ReorderBufferAddDistributedInvalidations() needs to call this function
> > > > with for_inval=false in spite of adding inval messages actually. Also,
> > > > the following condition seems not intuisive but there is no comment:
> > > >
> > > >     if (!for_inval || (for_inval && !rbtxn_inval_all_cache(txn)))
> > > >
> > > > Instead of having ReorderBufferAddInvalidationsCommon(), I think we
> > > > can have a function say ReorderBufferQueueInvalidations() where we
> > > > enqueue the given inval messages as a
> > > > REORDER_BUFFER_CHANGE_INVALIDATION change.
> > > > ReorderBufferAddInvalidations() adds inval messages to
> > > > txn->invalidations and calls that function, while
> > > > ReorderBufferQueueInvalidations() adds inval messages to
> > > > txn->distributed_ivnalidations and calls that function if the array is
> > > > not full.
> > >
> > > Modified
> > >
> > > > BTW if we need to invalidate all accumulated caches at the end of
> > > > transaction replay anyway, we don't need to add inval messages to
> > > > txn->invalidations once txn->distributed_invalidations gets full?
> > >
> > > yes, no need to add invalidation messages to txn->invalidation once
> > > RBTXN_INVAL_ALL_CACHE is set. This is handled now.
> > >
> > > The attached v9 version patch has the changes for the same.
> >
> > Thank you for updating the patch. Here are review comments on v9 patch:
> >
> > +/*
> > + * Maximum number of distributed invalidation messages per transaction.
> > + * Each message is ~16 bytes, this allows up to 8 MB of invalidation
> > + * message data.
> > + */
> > +#define MAX_DISTR_INVAL_MSG_PER_TXN 524288
> >
> > The size of SharedInvalidationMessage could change in the future so we
> > should calculate it at compile time.
>
> Modified
>
> > ---
> > +   /*
> > +    * If the complete cache will be invalidated, we don't need to accumulate
> > +    * the invalidations.
> > +    */
> > +   if (!rbtxn_inval_all_cache(txn))
> > +       ReorderBufferAccumulateInvalidations(&txn->ninvalidations,
> > +                                            &txn->invalidations, nmsgs, msgs);
> >
> > We need to explain why we don't check the number of invalidation
> > messages for txn->invalidations and mark it as inval-all-cache, unlike
> > ReorderBufferAddDistributedInvalidations().
>
> Added comments
>
> > ---
> > +   /*
> > +    * If the number of invalidation messages is high, performing a full cache
> > +    * invalidation is more efficient than handling each message separately.
> > +    */
> > +   if (((nmsgs + txn->ninvalidations_distributed) >
> > MAX_DISTR_INVAL_MSG_PER_TXN) ||
> > +       rbtxn_inval_all_cache(txn))
> >     {
> > -       txn->invalidations = (SharedInvalidationMessage *)
> > -           repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
> > -                    (txn->ninvalidations + nmsgs));
> > +       txn->txn_flags |= RBTXN_INVAL_ALL_CACHE;
> >
> > I think we don't need to mark the transaction as RBTXN_INVAL_ALL_CACHE
> > again. I'd rewrite the logic as follows:
> >
> > if (txn->ninvalidations_distributed + nmsgs >= MAX_DISTR_INVAL_MSG_PER_TXN)
> > {
> >     /* mark the txn as inval-all-cache */
> >     ....
> >     /* free the accumulated inval msgs */
> >     ....
> > }
> >
> > if (!rbtxn_inval_all_cache(txn))
> >     ReorderBufferAccumulateInvalidations(...);
>
> Modified
>
> > ---
> > -               ReorderBufferAddInvalidations(builder->reorder, txn->xid, lsn,
> > -                                             ninvalidations, msgs);
> > +               ReorderBufferAddDistributedInvalidations(builder->reorder,
> > +                                                        txn->xid, lsn,
> > +                                                        ninvalidations, msgs);
> >
> > I think we need some comments here to explain why we need to
> > distribute only inval messages coming from the current transaction.
>
> Added comments
>
> > ---
> > +/* Should the complete cache be invalidated? */
> > +#define rbtxn_inval_all_cache(txn) \
> > +( \
> > +   ((txn)->txn_flags & RBTXN_INVAL_ALL_CACHE) != 0 \
> > +)
> >
> > I find that if we rename the flag to something like
> > RBTXN_INVAL_OVERFLOWED, it would explain the state of the transaction
> > clearer.
>
> Modified
>
> > Can we have a reasonable test case that covers the inval message overflow cases?
> One of us will work on this and post a separate patch
>
> > I've attached a patch for some changes and adding more comments (note
> > that it still has XXX comments). Please include these changes that you
> > agreed with in the next version patch.
>
> Thanks for the comments, I merged it.
>
> The attached v10 version patch has the changes for the same.

Thank you for updating the patch. I have some comments and questions:

In ReorderBufferAbort():

        /*
         * We might have decoded changes for this transaction that could load
         * the cache as per the current transaction's view (consider DDL's
         * happened in this transaction). We don't want the decoding of future
         * transactions to use those cache entries so execute invalidations.
         */
        if (txn->ninvalidations > 0)
            ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
                                               txn->invalidations);

I think that if the txn->invalidations_distributed is overflowed, we
would miss executing the txn->invalidations here. Probably the same is
true for ReorderBufferForget() and ReorderBufferInvalidate().

---
I'd like to make it clear again which case we need to execute
txn->invalidations as well as txn->invalidations_distributed (like in
ReorderBufferProcessTXN()) and which case we need to execute only
txn->invalidations (like in ReorderBufferForget() and
ReorderBufferAbort()). I think it might be worth putting some comments
about overall strategy somewhere.

---
BTW for back branches, a simple fix without ABI breakage would be to
introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
txn->invalidations. That is, we accumulate inval messages both coming
from the current transaction and distributed by other transactions but
once the size reaches the threshold we invalidate all caches. Is it
worth considering for back branches?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, Jun 4, 2025 at 2:21 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Can we have a reasonable test case that covers the inval message overflow
> > cases?
>
> I have been considering how we add tests, but it needs lots of invalidation
> messages and consumes resources so much. Instead of that, how do you feel
> to use injection_points? If it is enabled, the threshold for overflow is much
> smaller than usual. Attached patch implemented the idea.

Alternative idea is to lower the constant value when using an
assertion build. That way, we don't need to rely on injection points
being enabled.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jun 5, 2025 at 3:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 3, 2025 at 11:48 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 4 Jun 2025 at 01:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for updating the patch. I have some comments and questions:
>
> In ReorderBufferAbort():
>
>         /*
>          * We might have decoded changes for this transaction that could load
>          * the cache as per the current transaction's view (consider DDL's
>          * happened in this transaction). We don't want the decoding of future
>          * transactions to use those cache entries so execute invalidations.
>          */
>         if (txn->ninvalidations > 0)
>             ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
>                                                txn->invalidations);
>
> I think that if the txn->invalidations_distributed is overflowed, we
> would miss executing the txn->invalidations here. Probably the same is
> true for ReorderBufferForget() and ReorderBufferInvalidate().
>

This is because of the following check "if
(!rbtxn_inval_overflowed(txn))" in function
ReorderBufferAddInvalidations(). What is the need of such a check in
this function? We don't need to execute distributed invalidations in
cases like ReorderBufferForget() when we haven't decoded any changes.

> ---
> I'd like to make it clear again which case we need to execute
> txn->invalidations as well as txn->invalidations_distributed (like in
> ReorderBufferProcessTXN()) and which case we need to execute only
> txn->invalidations (like in ReorderBufferForget() and
> ReorderBufferAbort()). I think it might be worth putting some comments
> about overall strategy somewhere.
>
> ---
> BTW for back branches, a simple fix without ABI breakage would be to
> introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> txn->invalidations. That is, we accumulate inval messages both coming
> from the current transaction and distributed by other transactions but
> once the size reaches the threshold we invalidate all caches. Is it
> worth considering for back branches?
>

It should work and is worth considering. The main concern would be
that it will hit sooner than we expect in the field, seeing the recent
reports. So, such a change has the potential to degrade the
performance. I feel that the number of people impacted due to
performance would be more than the number of people impacted due to
such an ABI change (adding the new members at the end of
ReorderBufferTXN). However, if we think we want to go safe w.r.t
extensions that can rely on the sizeof ReorderBufferTXN then your
proposal makes sense.

--
With Regards,
Amit Kapila.



Dear Sawada-san,

> Alternative idea is to lower the constant value when using an
> assertion build. That way, we don't need to rely on injection points
> being enabled.

Hmm, possible but I prefer current one. Two concerns:

1.
USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
to call debug functions in debug build.

2.
If we add tests which is usable only for debug build, it must be run only when it
is enabled. IIUC such test does not exist yet.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


On Thu, 5 Jun 2025 at 03:19, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for updating the patch. I have some comments and questions:
>
> In ReorderBufferAbort():
>
>         /*
>          * We might have decoded changes for this transaction that could load
>          * the cache as per the current transaction's view (consider DDL's
>          * happened in this transaction). We don't want the decoding of future
>          * transactions to use those cache entries so execute invalidations.
>          */
>         if (txn->ninvalidations > 0)
>             ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
>                                                txn->invalidations);
>
> I think that if the txn->invalidations_distributed is overflowed, we
> would miss executing the txn->invalidations here. Probably the same is
> true for ReorderBufferForget() and ReorderBufferInvalidate().

I'm accumulating the invalidations in txn->invalidations irrespective
of RBTXN_INVAL_OVERFLOWED txn.

> ---
> I'd like to make it clear again which case we need to execute
> txn->invalidations as well as txn->invalidations_distributed (like in
> ReorderBufferProcessTXN()) and which case we need to execute only
> txn->invalidations (like in ReorderBufferForget() and
> ReorderBufferAbort()). I think it might be worth putting some comments
> about overall strategy somewhere.

I have added  comments for this, feel free to reword it if some
changes are required.

The attached v11 version patch has the changes for the same.

Regards,
Vignesh

Attachment
Dear Amit,

> > ---
> > I'd like to make it clear again which case we need to execute
> > txn->invalidations as well as txn->invalidations_distributed (like in
> > ReorderBufferProcessTXN()) and which case we need to execute only
> > txn->invalidations (like in ReorderBufferForget() and
> > ReorderBufferAbort()). I think it might be worth putting some comments
> > about overall strategy somewhere.
> >
> > ---
> > BTW for back branches, a simple fix without ABI breakage would be to
> > introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> > txn->invalidations. That is, we accumulate inval messages both coming
> > from the current transaction and distributed by other transactions but
> > once the size reaches the threshold we invalidate all caches. Is it
> > worth considering for back branches?
> >
> 
> It should work and is worth considering. The main concern would be
> that it will hit sooner than we expect in the field, seeing the recent
> reports. So, such a change has the potential to degrade the
> performance. I feel that the number of people impacted due to
> performance would be more than the number of people impacted due to
> such an ABI change (adding the new members at the end of
> ReorderBufferTXN). However, if we think we want to go safe w.r.t
> extensions that can rely on the sizeof ReorderBufferTXN then your
> proposal makes sense.

While considering the approach, I found a doubtful point. Consider the below
workload:

0. S1: CREATE TABLE d(data text not null);
1. S1: BEGIN;
2. S1: INSERT INTO d VALUES ('d1')
3.                         S2: BEGIN;
4.                         S2: INSERT INTO d VALUES ('d2')
5. S1: ALTER PUBLICATION pb ADD TABLE d;
6. S1: ... lots of DDLs so overflow happens
7. S1: COMMIT;
8.                         S2: INSERT INTO d VALUES ('d3');
9.                         S2: COMMIT;
10.                        S2: INSERT INTO d VALUES ('d4');

In this case, the inval message generated by step 5 is discarded at step 6. No
invalidation messages are distributed in the  SnapBuildDistributeSnapshotAndInval().
While decoding S2, relcache cannot be discarded and tuples d3 and d4 won't be
replicated. Do you think this can happen?

Note that this won't happen for v11 patch. The patch won't discard txn->invalidations
in case of overflow, needed inval messages can be distributed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


>0. S1: CREATE TABLE d(data text not null);
>1. S1: BEGIN;
>2. S1: INSERT INTO d VALUES ('d1')
>3.                         S2: BEGIN;
>4.                         S2: INSERT INTO d VALUES ('d2')
>5. S1: ALTER PUBLICATION pb ADD TABLE d;
>6. S1: ... lots of DDLs so overflow happens
>7. S1: COMMIT;
>8.                         S2: INSERT INTO d VALUES ('d3');
>9.                         S2: COMMIT;
>10.                        S2: INSERT INTO d VALUES ('d4');
>
> In this case, the inval message generated by step 5 is discarded at step 6. No
> invalidation messages are distributed in the
> SnapBuildDistributeSnapshotAndInval().
> While decoding S2, relcache cannot be discarded and tuples d3 and d4 won't be
> replicated. Do you think this can happen?

Before running the workload, pg_recvlogical was run to check the output.
At step 6, I created and dropped the same table several times until the debug log was output:

```
[650552] LOG:  RBTXN_INVAL_OVERFLOWED flag is set
[650552] STATEMENT:  START_REPLICATION SLOT "test" LOGICAL 0/0 ("proto_version" '4', "publication_names" 'pb')
```

After that I executed till 10 but no output was done by pg_recvlogical

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment
On Wed, Jun 4, 2025 at 11:36 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Alternative idea is to lower the constant value when using an
> > assertion build. That way, we don't need to rely on injection points
> > being enabled.
>
> Hmm, possible but I prefer current one. Two concerns:
>
> 1.
> USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
> to call debug functions in debug build.

I think we have a similar precedent such as MT_NRELS_HASH to improve
the test coverages.

> 2.
> If we add tests which is usable only for debug build, it must be run only when it
> is enabled. IIUC such test does not exist yet.

I think we need to test cases not to check if we reach a specific code
point but to check if we can get the correct results even if we've
executed various code paths. As for this bug, it is better to check
that it works properly in a variety of cases. That way, we can check
overflow cases and non-overflow cases also in test cases added in the
future, improving the test coverage more.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, Jun 4, 2025 at 11:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 3:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 11:48 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Wed, 4 Jun 2025 at 01:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for updating the patch. I have some comments and questions:
> >
> > In ReorderBufferAbort():
> >
> >         /*
> >          * We might have decoded changes for this transaction that could load
> >          * the cache as per the current transaction's view (consider DDL's
> >          * happened in this transaction). We don't want the decoding of future
> >          * transactions to use those cache entries so execute invalidations.
> >          */
> >         if (txn->ninvalidations > 0)
> >             ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> >                                                txn->invalidations);
> >
> > I think that if the txn->invalidations_distributed is overflowed, we
> > would miss executing the txn->invalidations here. Probably the same is
> > true for ReorderBufferForget() and ReorderBufferInvalidate().
> >
>
> This is because of the following check "if
> (!rbtxn_inval_overflowed(txn))" in function
> ReorderBufferAddInvalidations(). What is the need of such a check in
> this function? We don't need to execute distributed invalidations in
> cases like ReorderBufferForget() when we haven't decoded any changes.



>
> > ---
> > I'd like to make it clear again which case we need to execute
> > txn->invalidations as well as txn->invalidations_distributed (like in
> > ReorderBufferProcessTXN()) and which case we need to execute only
> > txn->invalidations (like in ReorderBufferForget() and
> > ReorderBufferAbort()). I think it might be worth putting some comments
> > about overall strategy somewhere.
> >
> > ---
> > BTW for back branches, a simple fix without ABI breakage would be to
> > introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> > txn->invalidations. That is, we accumulate inval messages both coming
> > from the current transaction and distributed by other transactions but
> > once the size reaches the threshold we invalidate all caches. Is it
> > worth considering for back branches?
> >
>
> It should work and is worth considering. The main concern would be
> that it will hit sooner than we expect in the field, seeing the recent
> reports. So, such a change has the potential to degrade the
> performance. I feel that the number of people impacted due to
> performance would be more than the number of people impacted due to
> such an ABI change (adding the new members at the end of
> ReorderBufferTXN).

That's a fair point. I initially assumed that DDLs were not executed
often in practice, but analyzing this bug has made me realize this
assumption was misguided.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jun 5, 2025 at 4:07 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit,
>
> > > ---
> > > I'd like to make it clear again which case we need to execute
> > > txn->invalidations as well as txn->invalidations_distributed (like in
> > > ReorderBufferProcessTXN()) and which case we need to execute only
> > > txn->invalidations (like in ReorderBufferForget() and
> > > ReorderBufferAbort()). I think it might be worth putting some comments
> > > about overall strategy somewhere.
> > >
> > > ---
> > > BTW for back branches, a simple fix without ABI breakage would be to
> > > introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> > > txn->invalidations. That is, we accumulate inval messages both coming
> > > from the current transaction and distributed by other transactions but
> > > once the size reaches the threshold we invalidate all caches. Is it
> > > worth considering for back branches?
> > >
> >
> > It should work and is worth considering. The main concern would be
> > that it will hit sooner than we expect in the field, seeing the recent
> > reports. So, such a change has the potential to degrade the
> > performance. I feel that the number of people impacted due to
> > performance would be more than the number of people impacted due to
> > such an ABI change (adding the new members at the end of
> > ReorderBufferTXN). However, if we think we want to go safe w.r.t
> > extensions that can rely on the sizeof ReorderBufferTXN then your
> > proposal makes sense.
>
> While considering the approach, I found a doubtful point. Consider the below
> workload:
>
> 0. S1: CREATE TABLE d(data text not null);
> 1. S1: BEGIN;
> 2. S1: INSERT INTO d VALUES ('d1')
> 3.                                              S2: BEGIN;
> 4.                                              S2: INSERT INTO d VALUES ('d2')
> 5. S1: ALTER PUBLICATION pb ADD TABLE d;
> 6. S1: ... lots of DDLs so overflow happens
> 7. S1: COMMIT;
> 8.                                              S2: INSERT INTO d VALUES ('d3');
> 9.                                              S2: COMMIT;
> 10.                                             S2: INSERT INTO d VALUES ('d4');
>
> In this case, the inval message generated by step 5 is discarded at step 6. No
> invalidation messages are distributed in the  SnapBuildDistributeSnapshotAndInval().
> While decoding S2, relcache cannot be discarded and tuples d3 and d4 won't be
> replicated. Do you think this can happen?

I think that once the S1's inval messages got overflowed, we should
mark other transactions as overflowed instead of distributing inval
messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jun 5, 2025 at 3:28 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 5 Jun 2025 at 03:19, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for updating the patch. I have some comments and questions:
> >
> > In ReorderBufferAbort():
> >
> >         /*
> >          * We might have decoded changes for this transaction that could load
> >          * the cache as per the current transaction's view (consider DDL's
> >          * happened in this transaction). We don't want the decoding of future
> >          * transactions to use those cache entries so execute invalidations.
> >          */
> >         if (txn->ninvalidations > 0)
> >             ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> >                                                txn->invalidations);
> >
> > I think that if the txn->invalidations_distributed is overflowed, we
> > would miss executing the txn->invalidations here. Probably the same is
> > true for ReorderBufferForget() and ReorderBufferInvalidate().
>
> I'm accumulating the invalidations in txn->invalidations irrespective
> of RBTXN_INVAL_OVERFLOWED txn.

Agreed with this change.

>
> > ---
> > I'd like to make it clear again which case we need to execute
> > txn->invalidations as well as txn->invalidations_distributed (like in
> > ReorderBufferProcessTXN()) and which case we need to execute only
> > txn->invalidations (like in ReorderBufferForget() and
> > ReorderBufferAbort()). I think it might be worth putting some comments
> > about overall strategy somewhere.
>
> I have added  comments for this, feel free to reword it if some
> changes are required.
>
> The attached v11 version patch has the changes for the same.

I think the patch is getting into good shape. I've attached a patch
that includes changes I recommend. For example, it's better to rename
RBTXN_INVAL_OVERFLOWED to RBTXN_DISTR_INVAL_OVERFLOWED, and it
includes some comment updates. Please review them.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Fri, Jun 6, 2025 at 12:51 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 4:07 AM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Amit,
> >
> > > > ---
> > > > I'd like to make it clear again which case we need to execute
> > > > txn->invalidations as well as txn->invalidations_distributed (like in
> > > > ReorderBufferProcessTXN()) and which case we need to execute only
> > > > txn->invalidations (like in ReorderBufferForget() and
> > > > ReorderBufferAbort()). I think it might be worth putting some comments
> > > > about overall strategy somewhere.
> > > >
> > > > ---
> > > > BTW for back branches, a simple fix without ABI breakage would be to
> > > > introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> > > > txn->invalidations. That is, we accumulate inval messages both coming
> > > > from the current transaction and distributed by other transactions but
> > > > once the size reaches the threshold we invalidate all caches. Is it
> > > > worth considering for back branches?
> > > >
> > >
> > > It should work and is worth considering. The main concern would be
> > > that it will hit sooner than we expect in the field, seeing the recent
> > > reports. So, such a change has the potential to degrade the
> > > performance. I feel that the number of people impacted due to
> > > performance would be more than the number of people impacted due to
> > > such an ABI change (adding the new members at the end of
> > > ReorderBufferTXN). However, if we think we want to go safe w.r.t
> > > extensions that can rely on the sizeof ReorderBufferTXN then your
> > > proposal makes sense.
> >
> > While considering the approach, I found a doubtful point. Consider the below
> > workload:
> >
> > 0. S1: CREATE TABLE d(data text not null);
> > 1. S1: BEGIN;
> > 2. S1: INSERT INTO d VALUES ('d1')
> > 3.                                              S2: BEGIN;
> > 4.                                              S2: INSERT INTO d VALUES ('d2')
> > 5. S1: ALTER PUBLICATION pb ADD TABLE d;
> > 6. S1: ... lots of DDLs so overflow happens
> > 7. S1: COMMIT;
> > 8.                                              S2: INSERT INTO d VALUES ('d3');
> > 9.                                              S2: COMMIT;
> > 10.                                             S2: INSERT INTO d VALUES ('d4');
> >
> > In this case, the inval message generated by step 5 is discarded at step 6. No
> > invalidation messages are distributed in the  SnapBuildDistributeSnapshotAndInval().
> > While decoding S2, relcache cannot be discarded and tuples d3 and d4 won't be
> > replicated. Do you think this can happen?
>
> I think that once the S1's inval messages got overflowed, we should
> mark other transactions as overflowed instead of distributing inval
> messages.
>

Yeah, this should work, but are you still advocating that we go with
this approach (marking txn->invalidations also as overflowed) for
backbranches? In the previous email, you seemed to agree with the
performance impact due to DDLs, so it is not clear which approach you
prefer.

--
With Regards,
Amit Kapila.



On Thu, Jun 5, 2025 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jun 4, 2025 at 11:36 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Sawada-san,
> >
> > > Alternative idea is to lower the constant value when using an
> > > assertion build. That way, we don't need to rely on injection points
> > > being enabled.
> >
> > Hmm, possible but I prefer current one. Two concerns:
> >
> > 1.
> > USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
> > to call debug functions in debug build.
>
> I think we have a similar precedent such as MT_NRELS_HASH to improve
> the test coverages.
>
> > 2.
> > If we add tests which is usable only for debug build, it must be run only when it
> > is enabled. IIUC such test does not exist yet.
>
> I think we need to test cases not to check if we reach a specific code
> point but to check if we can get the correct results even if we've
> executed various code paths. As for this bug, it is better to check
> that it works properly in a variety of cases. That way, we can check
> overflow cases and non-overflow cases also in test cases added in the
> future, improving the test coverage more.
>

This makes sense, but we should see whether some existing tests cover
this code path after lowering the limit in the overflow code path. One
minor point to consider is that at the time, the value MT_NRELS_HASH
was used to cover cases in a debug build, but we didn't have the
injection_point framework.

BTW, I noticed that you removed the following comments in your suggestions:
  /*
  * Stores cache invalidation messages distributed by other transactions.
- *
- * It is acceptable to skip invalidations received from concurrent
- * transactions during ReorderBufferForget and ReorderBufferInvalidate,
- * because the transaction being discarded wouldn't have loaded any shared

IIUC, you only mentioned having some comments like this for ease of
understanding, and now you are suggesting to remove those.

--
With Regards,
Amit Kapila.



Dear Sawada-san,

> > 1.
> > USE_ASSERT_CHECKING has not been used to change the value yet. The main
> usage is
> > to call debug functions in debug build.
> 
> I think we have a similar precedent such as MT_NRELS_HASH to improve
> the test coverages.

Oh, good detection, it seems a typical way.

> > 2.
> > If we add tests which is usable only for debug build, it must be run only when it
> > is enabled. IIUC such test does not exist yet.
> 
> I think we need to test cases not to check if we reach a specific code
> point but to check if we can get the correct results even if we've
> executed various code paths. As for this bug, it is better to check
> that it works properly in a variety of cases. That way, we can check
> overflow cases and non-overflow cases also in test cases added in the
> future, improving the test coverage more.



Best regards,
Hayato Kuroda
FUJITSU LIMITED


Dear Sawada-san,

Sorry, I mistakenly sent partial part of the post. Let me continue the reply to 2.

> > 2.
> > If we add tests which is usable only for debug build, it must be run only when it
> > is enabled. IIUC such test does not exist yet.
> 
> I think we need to test cases not to check if we reach a specific code
> point but to check if we can get the correct results even if we've
> executed various code paths. As for this bug, it is better to check
> that it works properly in a variety of cases. That way, we can check
> overflow cases and non-overflow cases also in test cases added in the
> future, improving the test coverage more.

You meant that 1) we do not have to ensure we reached the overflow part by seeing
the actual log output, and 2) it should be tested by existing ones.

Based on your advice, I updated the patch set.

0001 contains changes raised by [1]. I checked then and looked good.
0002 reduces the limitation to extremely lower value. I confirmed by adding debug
log and 7 cases can cause the overflow.

Appending
=======
Added debug log:
```
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3607,6 +3607,8 @@ ReorderBufferAddDistributedInvalidations(ReorderBuffer *rb, TransactionId xid,
                         */
                        txn->txn_flags |= RBTXN_DISTR_INVAL_OVERFLOWED;
 
+                       elog(LOG, "RBTXN_DISTR_INVAL_OVERFLOWED is set to the transaction")
```

Method how I count the testcases:
```
testrun$ grep -rI "RBTXN_DISTR_INVAL_OVERFLOWED" | awk '{print $1 $6}' | sort -u
subscription/100_bugs/log/100_bugs_twoways.log:2025-06-06LOG:
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/catalog_change_snapshot/s1
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/concurrent_ddl_dml/s2
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/concurrent_stream/s1
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/invalidation_distribution/s2
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/oldest_xmin/s0
test_decoding/isolation/log/postmaster.log:2025-06-06isolation/snapshot_transfer/s0
```

[1]: https://www.postgresql.org/message-id/CAD21AoDaCL9X4E8VAe%3DfYa%3DzjqGTKRJW13dTPazqAuOAEEykOg%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment
On Thu, Jun 5, 2025 at 8:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jun 4, 2025 at 11:36 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Sawada-san,
> > >
> > > > Alternative idea is to lower the constant value when using an
> > > > assertion build. That way, we don't need to rely on injection points
> > > > being enabled.
> > >
> > > Hmm, possible but I prefer current one. Two concerns:
> > >
> > > 1.
> > > USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
> > > to call debug functions in debug build.
> >
> > I think we have a similar precedent such as MT_NRELS_HASH to improve
> > the test coverages.
> >
> > > 2.
> > > If we add tests which is usable only for debug build, it must be run only when it
> > > is enabled. IIUC such test does not exist yet.
> >
> > I think we need to test cases not to check if we reach a specific code
> > point but to check if we can get the correct results even if we've
> > executed various code paths. As for this bug, it is better to check
> > that it works properly in a variety of cases. That way, we can check
> > overflow cases and non-overflow cases also in test cases added in the
> > future, improving the test coverage more.
> >
>
> This makes sense, but we should see whether some existing tests cover
> this code path after lowering the limit in the overflow code path. One
> minor point to consider is that at the time, the value MT_NRELS_HASH
> was used to cover cases in a debug build, but we didn't have the
> injection_point framework.

True.

After thinking about it more, perhaps my proposal would not be a good
idea for this case. I think that the cases where we selectively
invalidate caches is more complex and error-prone than the cases where
we invalidate a complete cache. If we invalidated all caches after
decoding each transaction, we wouldn't have had the original data-loss
issue. Having a lower MAX_DISTR_INVAL_MSG_PER_TXN value when using an
assertio build means that we're going to test the cases using a
simpler invalidation mechanism while productions systems, which has a
higher MAX_DISTR_INVAL_MSG_PER_TXN value, would end up executing
complex cases, which is not great. What do you think?

BTW, as for a new test case, it might be worth having a case I
mentioned before[1]:

1)  S1: CREATE TABLE d (data text not null);
2)  S1: INSERT INTO d VALUES ('d1');
3)      S2: BEGIN; INSERT INTO d VALUES ('d2');
4)          S3: BEGIN; INSERT INTO d VALUES ('d3');
5)  S1: ALTER PUBLICATION pb ADD TABLE d;
6)      S2: INSERT INTO d VALUES ('d4');
7)      S2: COMMIT;
8)          S3: COMMIT;
9)    S2: INSERT INTO d VALUES('d5');
10) S1: INSERT INTO d VALUES ('d6');

With this case, we can test if we need to execute the distributed
invalidations as well in the non-error path in
ReorderBufferProcessTXN().

>
> BTW, I noticed that you removed the following comments in your suggestions:
>   /*
>   * Stores cache invalidation messages distributed by other transactions.
> - *
> - * It is acceptable to skip invalidations received from concurrent
> - * transactions during ReorderBufferForget and ReorderBufferInvalidate,
> - * because the transaction being discarded wouldn't have loaded any shared
>
> IIUC, you only mentioned having some comments like this for ease of
> understanding, and now you are suggesting to remove those.

I forgot to mention the reason. I thought we need either a
comprehensive comment in a place about in which case we need to
execute both the current transaction's inval messages and the
distributed inval messages and in which case we need to execute only
inval messages in the current transaction or to put comments where we
need. The v11 added the comprehensive comment to the declaration of
ninvalidations_distributed and invalidations_distributed in
ReoderBufferTXN, but I'm not sure that was the right place to have
such a comment as it's beyond the description of these fields. So in
my suggestion, I tried to clarify that we execute only the inval
message in the current transaction in ReorderBufferForget() and
ReorderBufferAbort() as they seems to already have a enough comment
for the reason why we need to execute the inval message there.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCeM2nni1P7Z6KXzLM%3D6zCdShC82sOvuvu0_hBuJkm9Qw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jun 5, 2025 at 8:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 6, 2025 at 12:51 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jun 5, 2025 at 4:07 AM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Amit,
> > >
> > > > > ---
> > > > > I'd like to make it clear again which case we need to execute
> > > > > txn->invalidations as well as txn->invalidations_distributed (like in
> > > > > ReorderBufferProcessTXN()) and which case we need to execute only
> > > > > txn->invalidations (like in ReorderBufferForget() and
> > > > > ReorderBufferAbort()). I think it might be worth putting some comments
> > > > > about overall strategy somewhere.
> > > > >
> > > > > ---
> > > > > BTW for back branches, a simple fix without ABI breakage would be to
> > > > > introduce the RBTXN_INVAL_OVERFLOWED flag to limit the size of
> > > > > txn->invalidations. That is, we accumulate inval messages both coming
> > > > > from the current transaction and distributed by other transactions but
> > > > > once the size reaches the threshold we invalidate all caches. Is it
> > > > > worth considering for back branches?
> > > > >
> > > >
> > > > It should work and is worth considering. The main concern would be
> > > > that it will hit sooner than we expect in the field, seeing the recent
> > > > reports. So, such a change has the potential to degrade the
> > > > performance. I feel that the number of people impacted due to
> > > > performance would be more than the number of people impacted due to
> > > > such an ABI change (adding the new members at the end of
> > > > ReorderBufferTXN). However, if we think we want to go safe w.r.t
> > > > extensions that can rely on the sizeof ReorderBufferTXN then your
> > > > proposal makes sense.
> > >
> > > While considering the approach, I found a doubtful point. Consider the below
> > > workload:
> > >
> > > 0. S1: CREATE TABLE d(data text not null);
> > > 1. S1: BEGIN;
> > > 2. S1: INSERT INTO d VALUES ('d1')
> > > 3.                                              S2: BEGIN;
> > > 4.                                              S2: INSERT INTO d VALUES ('d2')
> > > 5. S1: ALTER PUBLICATION pb ADD TABLE d;
> > > 6. S1: ... lots of DDLs so overflow happens
> > > 7. S1: COMMIT;
> > > 8.                                              S2: INSERT INTO d VALUES ('d3');
> > > 9.                                              S2: COMMIT;
> > > 10.                                             S2: INSERT INTO d VALUES ('d4');
> > >
> > > In this case, the inval message generated by step 5 is discarded at step 6. No
> > > invalidation messages are distributed in the  SnapBuildDistributeSnapshotAndInval().
> > > While decoding S2, relcache cannot be discarded and tuples d3 and d4 won't be
> > > replicated. Do you think this can happen?
> >
> > I think that once the S1's inval messages got overflowed, we should
> > mark other transactions as overflowed instead of distributing inval
> > messages.
> >
>
> Yeah, this should work, but are you still advocating that we go with
> this approach (marking txn->invalidations also as overflowed) for
> backbranches? In the previous email, you seemed to agree with the
> performance impact due to DDLs, so it is not clear which approach you
> prefer.

No, I just wanted to make it clear that this idea is possible. But I
agree to use the idea of having both invalidations_distributed and
ninvalidations_distributed in ReorderBufferTXN in all branches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Fri, Jun 6, 2025 at 10:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> After thinking about it more, perhaps my proposal would not be a good
> idea for this case. I think that the cases where we selectively
> invalidate caches is more complex and error-prone than the cases where
> we invalidate a complete cache. If we invalidated all caches after
> decoding each transaction, we wouldn't have had the original data-loss
> issue. Having a lower MAX_DISTR_INVAL_MSG_PER_TXN value when using an
> assertio build means that we're going to test the cases using a
> simpler invalidation mechanism while productions systems, which has a
> higher MAX_DISTR_INVAL_MSG_PER_TXN value, would end up executing
> complex cases, which is not great. What do you think?
>

Your reasoning makes sense to me. The other thing is that it would be
better if we don't add more cases to rely on debug build for testing.
Going forward, it can become difficult to decide which cases are good
to test only in debug mode.

> BTW, as for a new test case, it might be worth having a case I
> mentioned before[1]:
>
> 1)  S1: CREATE TABLE d (data text not null);
> 2)  S1: INSERT INTO d VALUES ('d1');
> 3)      S2: BEGIN; INSERT INTO d VALUES ('d2');
> 4)          S3: BEGIN; INSERT INTO d VALUES ('d3');
> 5)  S1: ALTER PUBLICATION pb ADD TABLE d;
> 6)      S2: INSERT INTO d VALUES ('d4');
> 7)      S2: COMMIT;
> 8)          S3: COMMIT;
> 9)    S2: INSERT INTO d VALUES('d5');
> 10) S1: INSERT INTO d VALUES ('d6');
>
> With this case, we can test if we need to execute the distributed
> invalidations as well in the non-error path in
> ReorderBufferProcessTXN().
>

+1.

> >
> > BTW, I noticed that you removed the following comments in your suggestions:
> >   /*
> >   * Stores cache invalidation messages distributed by other transactions.
> > - *
> > - * It is acceptable to skip invalidations received from concurrent
> > - * transactions during ReorderBufferForget and ReorderBufferInvalidate,
> > - * because the transaction being discarded wouldn't have loaded any shared
> >
> > IIUC, you only mentioned having some comments like this for ease of
> > understanding, and now you are suggesting to remove those.
>
> I forgot to mention the reason. I thought we need either a
> comprehensive comment in a place about in which case we need to
> execute both the current transaction's inval messages and the
> distributed inval messages and in which case we need to execute only
> inval messages in the current transaction or to put comments where we
> need. The v11 added the comprehensive comment to the declaration of
> ninvalidations_distributed and invalidations_distributed in
> ReoderBufferTXN, but I'm not sure that was the right place to have
> such a comment as it's beyond the description of these fields. So in
> my suggestion, I tried to clarify that we execute only the inval
> message in the current transaction in ReorderBufferForget() and
> ReorderBufferAbort() as they seems to already have a enough comment
> for the reason why we need to execute the inval message there.
>

Fair enough. I see one more case aka the call to
ReorderBufferExecuteInvalidations() in ReorderBufferFinishPrepared().
I think we should have executed the required invalidations at the end
of prepare only, so why do we need to execute in
ReorderBufferFinishPrepared? It might be kept as a general cleanup
call and if that is the case we might want to even adjust comments
there. What do you think?

--
With Regards,
Amit Kapila.



On Fri, 6 Jun 2025 at 22:51, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 8:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jun 5, 2025 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Jun 4, 2025 at 11:36 PM Hayato Kuroda (Fujitsu)
> > > <kuroda.hayato@fujitsu.com> wrote:
> > > >
> > > > Dear Sawada-san,
> > > >
> > > > > Alternative idea is to lower the constant value when using an
> > > > > assertion build. That way, we don't need to rely on injection points
> > > > > being enabled.
> > > >
> > > > Hmm, possible but I prefer current one. Two concerns:
> > > >
> > > > 1.
> > > > USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
> > > > to call debug functions in debug build.
> > >
> > > I think we have a similar precedent such as MT_NRELS_HASH to improve
> > > the test coverages.
> > >
> > > > 2.
> > > > If we add tests which is usable only for debug build, it must be run only when it
> > > > is enabled. IIUC such test does not exist yet.
> > >
> > > I think we need to test cases not to check if we reach a specific code
> > > point but to check if we can get the correct results even if we've
> > > executed various code paths. As for this bug, it is better to check
> > > that it works properly in a variety of cases. That way, we can check
> > > overflow cases and non-overflow cases also in test cases added in the
> > > future, improving the test coverage more.
> > >
> >
> > This makes sense, but we should see whether some existing tests cover
> > this code path after lowering the limit in the overflow code path. One
> > minor point to consider is that at the time, the value MT_NRELS_HASH
> > was used to cover cases in a debug build, but we didn't have the
> > injection_point framework.
>
> True.
>
> After thinking about it more, perhaps my proposal would not be a good
> idea for this case. I think that the cases where we selectively
> invalidate caches is more complex and error-prone than the cases where
> we invalidate a complete cache. If we invalidated all caches after
> decoding each transaction, we wouldn't have had the original data-loss
> issue. Having a lower MAX_DISTR_INVAL_MSG_PER_TXN value when using an
> assertio build means that we're going to test the cases using a
> simpler invalidation mechanism while productions systems, which has a
> higher MAX_DISTR_INVAL_MSG_PER_TXN value, would end up executing
> complex cases, which is not great. What do you think?
>
> BTW, as for a new test case, it might be worth having a case I
> mentioned before[1]:
>
> 1)  S1: CREATE TABLE d (data text not null);
> 2)  S1: INSERT INTO d VALUES ('d1');
> 3)      S2: BEGIN; INSERT INTO d VALUES ('d2');
> 4)          S3: BEGIN; INSERT INTO d VALUES ('d3');
> 5)  S1: ALTER PUBLICATION pb ADD TABLE d;
> 6)      S2: INSERT INTO d VALUES ('d4');
> 7)      S2: COMMIT;
> 8)          S3: COMMIT;
> 9)    S2: INSERT INTO d VALUES('d5');
> 10) S1: INSERT INTO d VALUES ('d6');
>
> With this case, we can test if we need to execute the distributed
> invalidations as well in the non-error path in
> ReorderBufferProcessTXN().

The attached v13 version patch has the change to include this test case.

Regards,
Vignesh

Attachment