Re: [HACKERS] logical decoding of two-phase transactions - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: [HACKERS] logical decoding of two-phase transactions |
Date | |
Msg-id | CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] logical decoding of two-phase transactions (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] logical decoding of two-phase transactions
|
List | pgsql-hackers |
Hi. Using a tablesync debugging technique as described in another mail thread [1][2] I have caused the tablesync worker to handle (e.g. apply_dispatch) a 2PC PREPARE This exposes a problem with the current 2PC logic because if/when the PREPARE is processed by the tablesync worker then the txn will end up being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT PREPARED by the publisher. For example, below is some logging (using my patch [2]) which shows this occurring: --- [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2020-11-18 17:00:37.394 AEDT [15885] LOG: logical decoding found consistent point at 0/16EF840 2020-11-18 17:00:37.394 AEDT [15885] DETAIL: There are no running transactions. 2020-11-18 17:00:37.394 AEDT [15885] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2020-11-18 17:00:37.407 AEDT [15886] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-11-18 17:00:37.407 AEDT [15886] LOG: !!>> The apply worker process has PID = 15886 2020-11-18 17:00:37.415 AEDT [15887] LOG: starting logical decoding for slot "tap_sub" 2020-11-18 17:00:37.415 AEDT [15887] DETAIL: Streaming transactions committing after 0/16EF878, reading WAL from 0/16EF840. 2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:00:37.415 AEDT [15887] LOG: logical decoding found consistent point at 0/16EF840 2020-11-18 17:00:37.415 AEDT [15887] DETAIL: There are no running transactions. 2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:37.421 AEDT [15889] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> The tablesync worker process has PID = 15889 2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 15889 now! [postgres@CentOS7-x64 ~]$ 2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables -- etc. 2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:07.426 AEDT [15889] LOG: !!>> tablesync worker: About to call LogicalRepSyncTableStart to do initial syncing 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables -- etc. 2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:27.454 AEDT [17456] LOG: logical decoding found consistent point at 0/16EF878 2020-11-18 17:01:27.454 AEDT [17456] DETAIL: There are no running transactions. 2020-11-18 17:01:27.454 AEDT [17456] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub_24582_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-11-18 17:01:27.456 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:27.457 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:27.465 AEDT [15889] LOG: !!>> tablesync worker: wait for CATCHUP state notification 2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables #### Here, while the tablesync worker is paused in the debugger I execute the PREPARE txn on publisher psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1, 'foo');PREPARE TRANSACTION 'test_prepared_tab';" PREPARE TRANSACTION 2020-11-18 17:01:54.732 AEDT [15887] LOG: !!>> pgoutput_begin_txn 2020-11-18 17:01:54.732 AEDT [15887] CONTEXT: slot "tap_sub", output plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0 2020-11-18 17:01:54.732 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') #### And then in the debugger I let the tablesync worker continue... 2020-11-18 17:02:02.788 AEDT [15889] LOG: !!>> tablesync worker: received CATCHUP state notification 2020-11-18 17:02:07.729 AEDT [15889] LOG: !!>> tablesync worker: Returned from LogicalRepSyncTableStart 2020-11-18 17:02:16.284 AEDT [17456] LOG: starting logical decoding for slot "tap_sub_24582_sync_16385" 2020-11-18 17:02:16.284 AEDT [17456] DETAIL: Streaming transactions committing after 0/16EF8B0, reading WAL from 0/16EF878. 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:16.284 AEDT [17456] LOG: logical decoding found consistent point at 0/16EF878 2020-11-18 17:02:16.284 AEDT [17456] DETAIL: There are no running transactions. 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:16.284 AEDT [17456] LOG: !!>> pgoutput_begin_txn 2020-11-18 17:02:16.284 AEDT [17456] CONTEXT: slot "tap_sub_24582_sync_16385", output plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:40.346 AEDT [15889] LOG: !!>> tablesync worker: LogicalRepApplyLoop #### The tablesync worker processes the replication messages.... 2020-11-18 17:02:47.992 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'B' 2020-11-18 17:02:54.858 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'R' 2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'I' 2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker: should_apply_changes_for_rel: true 2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'P' 2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker: called process_syncing_tables 2020-11-18 17:02:59.011 AEDT [15889] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has finished #### SInce the tablesync was "ahead", the apply worker now needs to skip those same messages #### Notice should_apply_changes_for_rel() is false #### Then apply worker just waits for next messages.... 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'B' 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'R' 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'I' 2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker: should_apply_changes_for_rel: false 2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'P' 2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:04.081 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:04.082 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables etc ... #### At this point there is a problem because the tablesync worker has COMMITTED that PREPARED INSERT. #### See the subscriber node has ONE record but the publisher node has NONE! [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "SELECT count(*) FROM test_tab;" count ------- 0 (1 row) [postgres@CentOS7-x64 ~]$ [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "SELECT count(*) FROM test_tab;" count ------- 1 (1 row) [postgres@CentOS7-x64 ~]$ ----- [1] - https://www.postgresql.org/message-id/CAHut%2BPsprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz%2BAA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAHut%2BPt4PyKQCwqzQ%3DEFF%3DbpKKJD7XKt_S23F6L20ayQNxg77A%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: