long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | long-standing data loss bug in initial sync of logical replication |
Date | |
Msg-id | de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Whole thread Raw |
Responses |
Re: long-standing data loss bug in initial sync of logical replication
|
List | pgsql-hackers |
Hi, It seems there's a long-standing data loss issue related to the initial sync of tables in the built-in logical replication (publications etc.). I can reproduce it fairly reliably, but I haven't figured out all the details yet and I'm a bit out of ideas, so I'm sharing what I know with the hope someone takes a look and either spots the issue or has some other insight ... On the pgsql-bugs, Depesz reported reported [1] cases where tables are added to a publication but end up missing rows on the subscriber. I didn't know what might be the issue, but given his experience I decided to take a do some blind attempts to reproduce the issue. I'm not going to repeat all the details from the pgsql-bugs thread, but I ended up writing a script that does randomized stress test tablesync under concurrent load. Attached are two scripts, where crash-test.sh does the main work, while run.sh drives the test - executes crash-test.sh in a loop and generates random parameters for it. The run.sh generates number of tables, refresh interval (after how many tables we refresh subscription) and how long to sleep between steps (to allow pgbench to do more work). The crash-test.sh then does this: 1) initializes two clusters (expects $PATH to have pg_ctl etc.) 2) configures them for logical replication (wal_level, ...) 3) creates publication and subscription on the nodes 4) creates some a bunch of tables 5) starts a pgbench that inserts data into the tables 6) adds the tables to the publication one by one, occasionally refreshing the subscription 7) waits for tablesync of all the tables to complete (so that the tables get into the 'r' state, thus replicating normally) 8) stops the pgbench 9) waits for the subscriber to fully catch up 10) compares that the tables on publisher/subscriber nodes To run this, just make sure PATH includes pg, and do e.g. ./run.sh 10 which does 10 runs of crash-test.sh with random parameters. Each run can take a couple minutes, depending on the parameters, hardware etc. Obviously, we expect the tables to match on the two nodes, but the script regularly detects cases where the subscriber is missing some of the rows. The script dumps those tables, and the rows contain timestamps and LSNs to allow "rough correlation" (imperfect thanks to concurrency). Depesz reported "gaps" in the data, i.e. missing a chunk of data, but then following rows seemingly replicated. I did see such cases too, but most of the time I see a missing chunk of rows at the end (but maybe if the test continued a bit longer, it'd replicate some rows). The report talks about replication between pg12->pg14, but I don't think the cross-version part is necessary - I'm able to reproduce the issue on individual versions (e.g. 12->12) since 12 (I haven't tried 11, but I'd be surprised if it wasn't affected too). The rows include `pg_current_wal_lsn()` to roughly track the LSN where the row is inserted, and the "gap" of missing rows for each table seems to match pg_subscription_rel.srsublsn, i.e. the LSN up to which tablesync copied data, and the table should be replicated as usual. Another interesting observation is that the issue only happens for "bulk insert" transactions, i.e. BEGIN; ... INSERT into all tables ... COMMIT; but not when each insert is a separate transaction. A bit strange. After quite a bit of debugging, I came to the conclusion this happens because we fail to invalidate caches on the publisher, so it does not realize it should start sending rows for that table. In particular, we initially build RelationSyncEntry when the table is not yet included in the publication, so we end up with pubinsert=false, thus not replicating the inserts. Which makes sense, but we then seems to fail to invalidate the entry after it's added to the publication. The other problem is that even if we happen to invalidate the entry, we call GetRelationPublications(). But even if it happens long after the table gets added to the publication (both in time and LSN terms), it still returns NIL as if the table had no publications. And we end up with pubinsert=false, skipping the inserts again. Attached are three patches against master. 0001 adds some debug logging that I found useful when investigating the issue. 0002 illustrates the issue by forcefully invalidating the entry for each change, and implementing a non-syscache variant of the GetRelationPublication(). This makes the code unbearably slow, but with both changes in place I can no longer reproduce the issue. Undoing either of the two changes makes it reproducible again. (I'll talk about 0003 later.) I suppose timing matters, so it's possible it gets "fixed" simply because of that, but I find that unlikely given the number of runs I did without observing any failure. Overall, this looks, walks and quacks like a cache invalidation issue, likely a missing invalidation somewhere in the ALTER PUBLICATION code. If we fail to invalidate the pg_publication_rel syscache somewhere, that obviously explain why GetRelationPublications() returns stale data, but it would also explain why the RelationSyncEntry is not invalidated, as that happens in a syscache callback. But I tried to do various crazy things in the ALTER PUBLICATION code, and none of that worked, so I'm a bit confused/lost. However, while randomly poking at different things, I realized that if I change the lock obtained on the relation in OpenTableList() from ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away. I don't know why it works, and I don't even recall what exactly led me to the idea of changing it. This is what 0003 does - it reverts 0002 and changes the lock level. AFAIK the logical decoding code doesn't actually acquire locks on the decoded tables, so why would this change matter? The only place that does lock the relation is the tablesync, which gets RowExclusiveLock on it. And it's interesting that RowExclusiveLock does not conflict with ShareUpdateExclusiveLock, but does with ShareRowExclusiveLock. But why would this even matter, when the tablesync can only touch the table after it gets added to the publication? regards [1] https://www.postgresql.org/message-id/ZTu8GTDajCkZVjMs@depesz.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: