BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss |
Date | |
Msg-id | 18155-33f555bcc7e1105e@postgresql.org Whole thread Raw |
Responses |
Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18155 Logged by: Drew Callahan Email address: callaan@amazon.com PostgreSQL version: 16.0 Operating system: Ubuntu Description: Depending on the Major Version an untimely timeout or termination of the main apply worker when the table apply worker is waiting for the subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of two really painful experiences. If on Rel14+, the untimely exit can lead to the main apply worker becoming indefinitely stuck while it waits for a table apply worker, that has already exited and won't be launched again, to change the subscription relstate to SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to clear the corrupted transient subscription relstate data. If on Rel13+, then the untimely exit can lead to silent data loss. This will occur if the table apply worker performed a copy at LSN X. If the main apply worker is now at LSN Y > X, the system requires the table sync worker to apply all changes between X & Y that were skipped by the main apply worker in a catch up phase. Due to the untimely exit, the table apply worker will assume that the main apply worker was actually behind, skip the catch up work, and exit. As a result, all data between X & Y will be lost for that table. The cause of both issues is that wait_for_worker_state_change() is handled like a void function by the table apply worker. However, if the main apply worker does not currently exist on the system due to some issue such as a timeout triggering it to safely exit, then the function will return *before* the state change has occurred and return false. The fix for this issue is relatively, we just need to validate the state actually did change. If it did not, then we can circle back later to check again which is relatively cheap on Rel14+. Pasted a potential solution below. A solution on Rel13- would want to introduce a retry loop in wait_for_worker_state_change() to avoid having to do the entire COPY again first which is very expensive on large tables. diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index e2cee92cf2..25be978e15 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -1529,10 +1529,30 @@ copy_table_done: SpinLockRelease(&MyLogicalRepWorker->relmutex); /* - * Finally, wait until the leader apply worker tells us to catch up and - * then return to let LogicalRepApplyLoop do it. + * Finally, wait until the main apply worker tells us to catch up and then + * return to let LogicalRepApplyLoop do it. If this wait fails due to the + * apply worker not being up, we fail this attempt and force a retry. */ - wait_for_worker_state_change(SUBREL_STATE_CATCHUP); + if (!wait_for_worker_state_change(SUBREL_STATE_CATCHUP)) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("logical replication table synchronization worker for subscription \"%s\", relid \"%u\" " + "unable to finish due to main apply worker being unavailable.", + MySubscription->name, MyLogicalRepWorker->relid), + errhint("this worker will attempt to sync the table again once logical replication has resumed"))); + } + + if (MyLogicalRepWorker->relstate != SUBREL_STATE_CATCHUP) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("logical replication table synchronization worker for subscription \"%s\", relid \"%u\" " + "unable to make progress while apply worker is down", + MySubscription->name, MyLogicalRepWorker->relid), + errhint("this worker will attempt to sync the table again once logical replication has resumed"))); + } + return slotname; }
pgsql-bugs by date: