Thread: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column
BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18509 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 17beta1 Operating system: Ubuntu 22.04 Description: The following script: numjobs=10 for ((j=1;j<=numjobs;j++)); do psql -c "CREATE TABLE t$j(t1 text, i2 int);" done for i in `seq 50`; do echo "iteration $i"; for ((j=1;j<=numjobs;j++)); do cat << EOF | psql >/dev/null & BEGIN; SELECT txid_current(); SELECT pg_sleep(0.01); EOF cat << EOF | psql >/dev/null & ALTER TABLE t$j RENAME COLUMN t1 TO i1; SELECT pg_sleep(0.001); ALTER TABLE t$j ALTER COLUMN i1 TYPE int4 USING (i1::int4); SELECT pg_sleep(0.001); INSERT INTO t$j(i1) VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t$j(i1) VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t$j(i1) VALUES (1); EOF cat << EOF | psql >/dev/null & SELECT 'init' FROM pg_create_logical_replication_slot( 'regression_slot$j', 'test_decoding'); SELECT data FROM pg_logical_slot_get_changes( 'regression_slot$j', NULL,NULL); EOF done wait for ((j=1;j<=numjobs;j++)); do cat << EOF | psql >/dev/null SELECT pg_drop_replication_slot('regression_slot$j'); ALTER TABLE t$j ALTER COLUMN i1 TYPE text USING (i1::text); ALTER TABLE t$j RENAME COLUMN i1 TO t1; EOF done grep 'TRAP' server.log && { break; } done produces errors like: 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|ERROR: could not map filenumber "base/16384/16453" to relation OID 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|BACKTRACE: ReorderBufferProcessTXN at reorderbuffer.c:2218:7 ReorderBufferReplay at reorderbuffer.c:2701:2 ReorderBufferCommit at reorderbuffer.c:2725:2 DecodeCommit at decode.c:733:3 xact_decode at decode.c:243:5 LogicalDecodingProcessRecord at decode.c:123:1 pg_logical_slot_get_changes_guts at logicalfuncs.c:269:7 pg_logical_slot_get_changes at logicalfuncs.c:334:1 ... 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot3', NULL,NULL); and an assertion failure: TRAP: failed Assert("TupleDescAttr(relation->rd_att, i)->attcacheoff == -1"), File: "relcache.c", Line: 673, PID: 4187156 ... #3 0x00007f03c690e476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007f03c68f47f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x000055c3c86d95f1 in ExceptionalCondition ( conditionName=0x55c3c8930198 "TupleDescAttr(relation->rd_att, i)->attcacheoff == -1", fileName=0x55c3c8930108 "relcache.c", lineNumber=673) at assert.c:66 #6 0x000055c3c86c2dd3 in RelationBuildTupleDesc (relation=0x7f03b75dd1a8) at relcache.c:673 #7 0x000055c3c86c3ce6 in RelationBuildDesc (targetRelId=16430, insertIt=true) at relcache.c:1185 #8 0x000055c3c86c5922 in RelationIdGetRelation (relationId=16430) at relcache.c:2116 #9 0x000055c3c8428276 in ReorderBufferProcessTXN (...) at reorderbuffer.c:2222 #10 0x000055c3c84290c5 in ReorderBufferReplay (...) at reorderbuffer.c:2701 #11 0x000055c3c8429147 in ReorderBufferCommit (...) at reorderbuffer.c:2725 #12 0x000055c3c8414350 in DecodeCommit (...) at decode.c:733 #13 0x000055c3c84135a1 in xact_decode (...) at decode.c:242 #14 0x000055c3c84130fb in LogicalDecodingProcessRecord (...) at decode.c:116 #15 0x000055c3c841ce60 in pg_logical_slot_get_changes_guts (...) at logicalfuncs.c:266 #16 0x000055c3c841cfd1 in pg_logical_slot_get_changes (...) at logicalfuncs.c:333 ... Also, with tables created as "CREATE TABLE t$j(t1 text);" it produces: WARNING: problem in alloc set text conversion context: bad block offset for chunk 0x55bdb51ced18 in block 0x55bdb51ce0c8 (and other "problem in alloc" warnings) and TRAP: failed Assert("!HdrMaskIsExternal(chunk->hdrmask) || HdrMaskCheckMagic(chunk->hdrmask)"), File: "../../../../src/include/utils/memutils_memorychunk.h", Line: 210, PID: 4188259 Reproduced on REL_12_STABLE .. master.
Re: BUG #18509: Logical decoding behaves badly when processing a change record for a table with altered column
From
Alexander Lakhin
Date:
14.06.2024 21:00, PG Bug reporting form wrote: > The following script: > ... > > produces errors like: > 2024-06-14 17:10:58.349 UTC|law|regression|666c79a2.3fe3a4|ERROR: could not > map filenumber "base/16384/16453" to relation OID I've simplified that script to the following: psql -c "CREATE TABLE t(t1 text, i2 int);" for ((i=1;i<=100;i++)); do echo "iteration $i"; cat << EOF | psql >/dev/null & BEGIN; SELECT txid_current(); SELECT pg_sleep(0.01); EOF cat << EOF | psql >/dev/null & ALTER TABLE t RENAME COLUMN t1 TO i1; ALTER TABLE t ALTER COLUMN i1 TYPE int4 USING (i1::int4); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); SELECT pg_sleep(0.001); INSERT INTO t VALUES (1); EOF cat << EOF | psql >/dev/null & SELECT 'init' FROM pg_create_logical_replication_slot( 'regression_slot_$i', 'test_decoding'); SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_$i', NULL, NULL); EOF wait grep 'ERROR' server.log && { break; } cat << EOF | psql >/dev/null SELECT pg_drop_replication_slot('regression_slot_$i'); ALTER TABLE t ALTER COLUMN i1 TYPE text USING (i1::text); ALTER TABLE t RENAME COLUMN i1 TO t1; EOF done It fails deterministically (though not on each run) for me as below: ... iteration 93 ERROR: could not map filenumber "base/16384/17310" to relation OID an excerpt from server.log (generated with log_min_messages = DEBUG3): 2024-07-07 09:15:36.167 UTC [324743:33] psql LOG: starting logical decoding for slot "regression_slot_93" 2024-07-07 09:15:36.167 UTC [324743:34] psql DETAIL: Streaming transactions committing after 0/2006458, reading WAL from 0/1FFB820. 2024-07-07 09:15:36.167 UTC [324743:35] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); ... 2024-07-07 09:15:36.169 UTC [324743:47] psql DEBUG: snapshot of 0+0 running transaction ids (lsn 0/20065A8 oldest xid 1666 latest complete 1665 next xid 1666) 2024-07-07 09:15:36.169 UTC [324743:48] psql DEBUG: purged committed transactions from 0 to 0, xmin: 1662, xmax: 1662 2024-07-07 09:15:36.169 UTC [324743:49] psql DEBUG: xmin: 1662, xmax: 1662, oldest running: 1662, oldest xmin: 1662 2024-07-07 09:15:36.169 UTC [324743:50] psql DEBUG: updated xmin: 1 restart: 0 2024-07-07 09:15:36.169 UTC [324743:51] psql DEBUG: xmin required by slots: data 0, catalog 1662 2024-07-07 09:15:36.169 UTC [324743:52] psql LOG: logical decoding found consistent point at 0/2006420 2024-07-07 09:15:36.169 UTC [324743:53] psql DETAIL: There are no running transactions. 2024-07-07 09:15:36.169 UTC [324743:54] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); 2024-07-07 09:15:36.169 UTC [324743:55] psql DEBUG: aborting old transaction 1662 2024-07-07 09:15:36.170 UTC [324743:56] psql ERROR: could not map filenumber "base/16384/17310" to relation OID 2024-07-07 09:15:36.170 UTC [324743:57] psql STATEMENT: SELECT data FROM pg_logical_slot_get_changes( 'regression_slot_93', NULL, NULL); The corresponding fragment of WAL: rmgr: Transaction len (rec/tot): 353/ 353, tx: 1662, lsn: 0/02005B38, prev 0/02005B00, desc: INVALIDATION ; inval msgs: catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 relcache 17310 snapshot 2608 rmgr: Transaction len (rec/tot): 1605/ 1605, tx: 1662, lsn: 0/02005CA0, prev 0/02005B38, desc: COMMIT 2024-07-07 09:15:36.150647 UTC; rels: base/16384/17305 base/16384/17308 base/16384/17309; dropped stats: 2/16384/17309 2/16384/17308 2/16384/17310; inval msgs: catcache 7 ... relcache 17310 ... relcache 17310 snapshot 2608 rmgr: Transaction len (rec/tot): 34/ 34, tx: 1661, lsn: 0/02006300, prev 0/02005CA0, desc: ABORT 2024-07-07 09:15:36.157644 UTC rmgr: Standby len (rec/tot): 78/ 78, tx: 0, lsn: 0/02006328, prev 0/02006300, desc: LOCK xid 1662 db 16384 rel 17308 xid 1662 db 16384 rel 17309 xid 1662 db 16384 rel 16385 xid 1662 db 16384 rel 17310 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/02006378, prev 0/02006328, desc: RUNNING_XACTS nextXid 1663 latestCompletedXid 1661 oldestRunningXid 1662; 1 xacts: 1662 rmgr: Heap len (rec/tot): 59/ 59, tx: 1663, lsn: 0/020063B0, prev 0/02006378, desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/16384/17310 blk 3 rmgr: Transaction len (rec/tot): 46/ 46, tx: 1663, lsn: 0/020063F0, prev 0/020063B0, desc: COMMIT 2024-07-07 09:15:36.162938 UTC (The transaction 1662 performs table rewrite.) What I'm observing with some debug logging added is: ReorderBufferProcessTXN(), called during processing the commit record 0/02006328, gets reloid = 0 from RelidByRelfilenumber() because systable_beginscan() chooses (via GetCatalogSnapshot()) HistoricSnapshot with xmin = 1662, xmax = 1662. That snapshot was set inside SnapBuildProcessChange(), via ReorderBufferSetBaseSnapshot(), from builder->snapshot, which is built inside SnapBuildProcessChange(), via SnapBuildBuildSnapshot(), from builder->xmin (builder->xmax), which is set by SnapBuildProcessRunningXacts from running->oldestRunningXid = 1662. Really, WAL contains RUNNING_XACTS oldestRunningXid 1662 before the record being decoded. So I wonder if the RUNNING_XACT record in WAL is correct or it should be decoded differently? Best regards, Alexander