Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers
From | Michail Nikolaev |
---|---|
Subject | Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION |
Date | |
Msg-id | CANtu0ohgHM_7+M+3i2kXkZuLCsN_Dbio2pFfphrH+oOfzLfUhw@mail.gmail.com Whole thread Raw |
Responses |
Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
|
List | pgsql-hackers |
Hello. Just a small story about small data-loss on logical replication. We were logically replicating a 4 TB database from > PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04)7.5.0, 64-bit to > PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit Database includes many tables, but there are A, B and C tables. Tables A, B and are C changed in the same transaction (new tuples created in B and C with corresponding update of A). Table A was added to the PUBLICATION from the start. Once initial sync was done, tables B and C were added to PUBLICATION with REFRESH on subscription (to reduce the WAL collection on the source database). So, we see in logs: > -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "A" has started > -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "A" has finished > -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "B" has started > -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "B" has finished > -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "C" has started > -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4",table "C" has finished Also, we had to reboot subscription server twice. Moreover, we have plenty of messages like: > -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 47960) exited with exit code1 > -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4"has started > -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected Additionally, our HA replica of subscriber was broken and recovered by support… And logs like this: > psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous replication ack. > The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here. Furthermore, we were adding\removing another table D from publication few times. So, it was a little bit messy story. After all, we got streaming working for the whole database. But after some time we realized we have lost 203 records for table B created from 2022-12-14 09:21:25.705 to 2022-12-14 09:49:20.664 (after synchronization start, but before finish). And the most tricky part here - A, B and C are changed in the same transaction (related tuples). But tables A and C - are fine, only few records from B are lost. We have compared all other tables record to record - only 203 records from B are missing. We have restored the server from backup with point-in-time-recovery (to exclude case with application or human error) - the same results. Furthermore, we have tried different indexes in search (to exclude issue with broken btree) - the same results. So, yes, we understand our replication story was not a classical happy path even close. But the result feels a little bit scary. Currently, I have access to database and logs - so, feel free to ask for additional debugging information if you like. Thanks a lot, Michail.
pgsql-hackers by date: