Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. - Mailing list pgsql-bugs
From | Dilip Kumar |
---|---|
Subject | Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. |
Date | |
Msg-id | CAFiTN-vsdWgthGJFOG74E94LAi5E5DmP0Ag616V62hftHq6Ldw@mail.gmail.com Whole thread Raw |
In response to | RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.
|
List | pgsql-bugs |
On Fri, Jan 5, 2024 at 9:25 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Song, > > > > > Hi hackers, I found when insert plenty of data into a table, and add the > > table to publication (through Alter Publication) meanwhile, it's likely that > > the incremental data cannot be synchronized to the subscriber. Here is my > > test method: > > Good catch. > > > 1. On publisher and subscriber, create table for test: > > CREATE TABLE tab_1 (a int); > > > > 2. Setup logical replication: > > on publisher: > > SELECT pg_create_logical_replication_slot('slot1', 'pgoutput', false, > > false); > > CREATE PUBLICATION tap_pub; > > on subscriber: > > CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' > > PUBLICATION > > tap_pub WITH (enabled = true, create_slot = false, slot_name='slot1') > > > > 3. Perform Insert: > > for (my $i = 1; $i <= 1000; $i++) { > > $node_publisher->safe_psql('postgres', "INSERT INTO tab_1 SELECT > > generate_series(1, 1000)"); > > } > > Each transaction contains 1000 insertion, and 1000 transactions are in > > total. > > > > 4. When performing step 3, add table tab_1 to publication. > > ALTER PUBLICATION tap_pub ADD TABLE tab_1 > > ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION > > I could reproduce the failure. PSA the script. > > In the script, ALTER PUBLICATION was executed while doing the initial data sync. > (The workload is almost same as what you reporter posted, but number of rows are reduced) > > In total, 4000 tuples are inserted on publisher. However, after sometime, only 2500 tuples are replicated. > > ``` > publisher=# SELECT count(*) FROM tab_1 ; > count > ------- > 40000 > (1 row) > > subscriber=# SELECT count(*) FROM tab_1 ; > count > ------- > 25000 > (1 row) > ``` > > Is it same failure you saw? With your attached script I was able to see this gap, I didn't dig deeper but with the initial investigation, I could see that even after ALTER PUBLICATION, the pgoutput_change continues to see 'relentry->pubactions.pubinsert' as false, even after re fetching the relation entry after the invalidation. That shows the invalidation framework might be working fine but we are using the older snapshot to fetch the entry. I did not debug it further why it is not getting the updated snapshot which can see the change in publication, because I assume Yutao Song as already analyzed that as per his first email, so I would wait for his patch. > > The root cause of the problem is as follows: > > pgoutput relies on the invalidation mechanism to validate publications. When > > walsender decoding an Alter Publication transaction, catalog caches are > > invalidated at once. Furthermore, since pg_publication_rel is modified, > > snapshot changes are added to all transactions currently being decoded. For > > other transactions, catalog caches have been invalidated. However, it is > > likely that the snapshot changes have not yet been decoded. In pgoutput > > implementation, these transactions query the system table pg_publication_rel > > to determine whether to publish changes made in transactions. In this case, > > catalog tuples are not found because snapshot has not been updated. As a > > result, changes in transactions are considered not to be published, and > > subsequent data cannot be synchronized. > > > > I think it's necessary to add invalidations to other transactions after > > adding a snapshot change to them. > > Therefore, I submitted a patch for this bug. > > I cannot see your attaching, but I found that proposed patch in [1] can solve > the issue. After applying 0001 + 0002 + 0003 (open relations as ShareRowExclusiveLock, > in OpenTableList), the data gap was removed. Thought? Not sure why 'open relations as ShareRowExclusiveLock' would help in this case? have you investigated that? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-bugs by date: