Thread: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18641 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 17.0 Operating system: Ubuntu 22.04 Description: Logical decoding of a transaction like: BEGIN; INSERT INTO test_tab VALUES(1); PREPARE TRANSACTION 'pt'; where test_tab defined as: CREATE TABLE test_tab(a int primary key, t text DEFAULT 'Some TOASTable value'; for a subscription created with two_phase = on, fails as below: 2024-09-28 06:44:50.708 UTC [3741774:6][client backend][6/2:740] LOG: statement: PREPARE TRANSACTION 'pt'; 2024-09-28 06:44:50.709 UTC [3741774:7][client backend][:0] LOG: disconnection: session time: 0:00:00.006 user=law database=postgres host=[local] 2024-09-28 06:44:50.713 UTC [3741741:17][walsender][25/0:0] ERROR: unexpected table_index_fetch_tuple call during logical decoding 2024-09-28 06:44:50.713 UTC [3741741:18][walsender][25/0:0] BACKTRACE: table_index_fetch_tuple at tableam.h:1253:3 index_fetch_heap at indexam.c:637:10 index_getnext_slot at indexam.c:697:6 systable_getnext_ordered at genam.c:717:5 heap_fetch_toast_slice at heaptoast.c:698:17 table_relation_fetch_toast_slice at tableam.h:1924:1 toast_fetch_datum at detoast.c:379:2 detoast_attr at detoast.c:123:10 pg_detoast_datum_packed at fmgr.c:1867:10 text_to_cstring at varlena.c:220:23 AttrDefaultFetch at relcache.c:4537:17 RelationBuildTupleDesc at relcache.c:697:4 RelationBuildDesc at relcache.c:1188:24 RelationIdGetRelation at relcache.c:2116:7 ReorderBufferProcessTXN at reorderbuffer.c:2246:17 ReorderBufferReplay at reorderbuffer.c:2725:2 ReorderBufferPrepare at reorderbuffer.c:2822:2 DecodePrepare at decode.c:826:2 xact_decode at decode.c:347:5 LogicalDecodingProcessRecord at decode.c:123:1 XLogSendLogical at walsender.c:3413:33 WalSndLoop at walsender.c:2814:4 StartLogicalReplication at walsender.c:1523:2 exec_replication_command at walsender.c:2148:5 PostgresMain at postgres.c:4763:11 BackendInitialize at backend_startup.c:123:1 postmaster_child_launch at launch_backend.c:281:9 BackendStartup at postmaster.c:3593:8 ServerLoop at postmaster.c:1677:10 PostmasterMain at postmaster.c:1372:11 startup_hacks at main.c:217:1 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f30a90d7e40] postgres: node_publisher: walsender law postgres [local] START_REPLICATION(_start+0x25) [0x5647b6023565] 2024-09-28 06:44:50.713 UTC [3741741:19][walsender][25/0:0] LOG: released logical replication slot "test_sub" 2024-09-28 06:44:50.713 UTC [3741741:20][walsender][25/0:0] LOG: could not send data to client: Broken pipe 2024-09-28 06:44:50.713 UTC [3741741:21][walsender][25/0:0] FATAL: connection to client lost The issue can be easily reproduced with 022_twophase_cascade.pl modified like this: @@ -67,0 +68 @@ $node_A->safe_psql('postgres', +my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); @@ -69 +70 @@ $node_B->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')" @@ -72 +73 @@ $node_C->safe_psql('postgres', - "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999)" + "CREATE TABLE test_tab (a int primary key, b bytea, c timestamptz DEFAULT now(), d bigint DEFAULT 999, t text DEFAULT '$default')" Reproduced on REL_15_STABLE (starting from a8fd13cab) .. master.
Re: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
From
Amit Kapila
Date:
On Mon, Sep 30, 2024 at 6:46 AM Takeshi Ideriha <iderihatakeshi@gmail.com> wrote: > > Patch is attached. > Thanks for the patch. I'll look into it. -- With Regards, Amit Kapila.
Re: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
From
Amit Kapila
Date:
On Mon, Sep 30, 2024 at 6:46 AM Takeshi Ideriha <iderihatakeshi@gmail.com> wrote: > > Thank you for reporting the issue. > I was able to reproduce the issue by modifying 022_twophase_cascade.pl > accordingly. > > The scan for toast index is actually done under systable_getnext_ordered, > where HandleConcurrentAbort() is called. So it seems to me that this > scan is actually safe for concurrent abort in logical decoding. > Logic around HandleConcurrentAbort is intorduced > https://github.com/postgres/postgres/commit/7259736a6e5b7c7588fff9578370736a6648acbb. > > Though I may not understand the logic around HandleConcurrentAbort > fully and I am not sure not-setting bsysscan at > systable_beginscan_ordered is intentional, > it seems to me setting and unsetting a bsysscan flag in > systable_beginscan_ordered and systable_endscan_ordered would resolve > the issue. > We forgot to set/unset the flag in functions systable_beginscan_ordered and systable_endscan_ordered. BTW, shouldn't this occur even without prepare transaction? If so, we need to backpatch this till 14. Also, it is better to have a test for this, and let's ensure that the new test doesn't increase the regression time too much if possible. One minor point: + + /* + * If CheckXidAlive is set then set a flag to indicate that system table The indentation in the first comment line seems off. -- With Regards, Amit Kapila.
Re: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values
From
Amit Kapila
Date:
On Tue, Oct 1, 2024 at 7:07 PM Takeshi Ideriha <iderihatakeshi@gmail.com> wrote: > Sending it again to avoid getting it stuck because the original was replied to both -hackers and -bugs. > > >We forgot to set/unset the flag in functions > >systable_beginscan_ordered and systable_endscan_ordered. BTW, > > Thank you for the clarification. > > >shouldn't this occur even without prepare transaction? If so, we need > >to backpatch this till 14. > > Yes, it occurred also at PG14. > I did some tests using 015_stream.pl with some modification like > below, which tests the subscription about stream is on but two-phase > is off. > The same issue occurred at both current head source code and PG14. > > ``` > --- a/src/test/subscription/t/015_stream.pl > +++ b/src/test/subscription/t/015_stream.pl > @@ -134,9 +134,11 @@ my $node_subscriber = > PostgreSQL::Test::Cluster->new('subscriber'); > $node_subscriber->init; > $node_subscriber->start; > > +my $default = join('', map {chr(65 + rand 26)} (1 .. 10000)); > + > # Create some preexisting content on publisher > $node_publisher->safe_psql('postgres', > - "CREATE TABLE test_tab (a int primary key, b bytea)"); > + "CREATE TABLE test_tab (a int primary key, b bytea, t text > DEFAULT '$default')"); > $node_publisher->safe_psql('postgres', > "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')"); > > @@ -144,7 +146,7 @@ $node_publisher->safe_psql('postgres', "CREATE > TABLE test_tab_2 (a int)"); > > # Setup structure on subscriber > $node_subscriber->safe_psql('postgres', > - "CREATE TABLE test_tab (a int primary key, b bytea, c > timestamptz DEFAULT now(), d bigint DEFAULT 999)" > + "CREATE TABLE test_tab (a int primary key, b bytea, t text > DEFAULT '$default', c timestamptz DEFAULT now(), d bigint DEFAULT > 999)" > ); > ``` > > Logs from head source: > ``` > 2024-10-01 12:34:56.694 UTC [575202] LOG: starting PostgreSQL 18devel > on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat > 11.4.1-2), 64-bit > ... > 2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: BEGIN; > 2024-10-01 12:34:57.506 UTC [575258] 015_stream.pl LOG: statement: > INSERT INTO test_tab SELECT i, sha256(i::text::bytea) FROM > generate_series(3, 5000) s(i); > 2024-10-01 12:34:57.524 UTC [575242] tap_sub ERROR: unexpected > table_index_fetch_tuple call during logical decoding > 2024-10-01 12:34:57.524 UTC [575242] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', > streaming 'on', origin 'any', publication_names '"tap_pub"') > 2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: released logical > replication slot "tap_sub" > 2024-10-01 12:34:57.525 UTC [575242] tap_sub LOG: could not send data > to client: Broken pipe > 2024-10-01 12:34:57.525 UTC [575242] tap_sub FATAL: connection to client lost > ... > 2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', > streaming 'on', origin 'any', publication_names '"tap_pub"') > 2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: starting logical > decoding for slot "tap_sub" > 2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: Streaming > transactions committing after 0/1583A68, reading WAL from 0/1583A30. > 2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', > streaming 'on', origin 'any', publication_names '"tap_pub"') > 2024-10-01 12:34:57.829 UTC [575260] tap_sub LOG: logical decoding > found consistent point at 0/1583A30 > 2024-10-01 12:34:57.829 UTC [575260] tap_sub DETAIL: There are no > running transactions. > 2024-10-01 12:34:57.829 UTC [575260] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', > streaming 'on', origin 'any', publication_names '"tap_pub"') > 2024-10-01 12:34:57.831 UTC [575260] tap_sub ERROR: unexpected > table_index_fetch_tuple call during logical decoding > 2024-10-01 12:34:57.831 UTC [575260] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', > streaming 'on', origin 'any', publication_names '"tap_pub"') > 2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: released logical > replication slot "tap_sub" > 2024-10-01 12:34:57.832 UTC [575260] tap_sub LOG: could not send data > to client: Broken pipe > 2024-10-01 12:34:57.832 UTC [575260] tap_sub FATAL: connection to client lost > ``` > > Logs from PG14 source: > > ``` > 2024-10-01 13:20:29.409 UTC [593696] LOG: starting PostgreSQL 14.9 on > x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat > 11.4.1-2), 64-bit > ... > 2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: BEGIN; > 2024-10-01 13:20:31.285 UTC [593750] 015_stream.pl LOG: statement: > INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, > 5000) s(i); > 2024-10-01 13:20:31.301 UTC [593740] tap_sub ERROR: unexpected > table_index_fetch_tuple call during logical decoding > 2024-10-01 13:20:31.301 UTC [593740] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', > streaming 'on', publication_names '"tap_pub"') > 2024-10-01 13:20:31.302 UTC [593740] tap_sub LOG: could not send data > to client: Broken pipe > 2024-10-01 13:20:31.302 UTC [593740] tap_sub FATAL: connection to client lost > ... > 2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', > streaming 'on', publication_names '"tap_pub"') > 2024-10-01 13:20:31.830 UTC [593756] tap_sub LOG: logical decoding > found consistent point at 0/172C758 > 2024-10-01 13:20:31.830 UTC [593756] tap_sub DETAIL: There are no > running transactions. > 2024-10-01 13:20:31.830 UTC [593756] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', > streaming 'on', publication_names '"tap_pub"') > 2024-10-01 13:20:31.834 UTC [593756] tap_sub ERROR: unexpected > table_index_fetch_tuple call during logical decoding > 2024-10-01 13:20:31.834 UTC [593756] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', > streaming 'on', publication_names '"tap_pub"') > 2024-10-01 13:20:31.835 UTC [593756] tap_sub LOG: could not send data > to client: Broken pipe > 2024-10-01 13:20:31.835 UTC [593756] tap_sub FATAL: connection to client lost > ``` > > >Also, it is better to have a test for this, and let's ensure that the > >new test doesn't increase the regression time too much if possible. > > Sure. I'm going to add some test codes in a few days. > The above test shown in the email will work to test this issue. However, we should use 'debug_logical_replication_streaming' as that will help to reproduce the issue with just one write operation. Is it possible to add the test in 018_stream_subxact_abort where we already use it? -- With Regards, Amit Kapila.