Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION |
Date | |
Msg-id | CAA4eK1LHD3ZHqyK2GcSw=fqCPr26yP2GL9niQbY8ems68D1Z9A@mail.gmail.com Whole thread Raw |
In response to | undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
|
List | pgsql-hackers |
On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I decided to do some stress-testing of the built-in logical replication, > as part of the sequence decoding work. And I soon ran into an undetected > deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( > > The attached bash scripts triggers that in a couple seconds for me. The > script looks complicated, but most of the code is waiting for sync to > complete, catchup, and that sort of thing. > > What the script does is pretty simple: > > 1) initialize two clusters, set them as publisher/subscriber pair > > 2) create some number of tables, add them to publication and wait for > the sync to complete > > 3) start two pgbench runs in the background, modifying the publication > (one removes+adds all tables in a single transaction, one does that > with transaction per table) > > 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION > in a loop (now that I think about it, could be another pgbench > script, but well ...) > > 5) some consistency checks, but the lockup happens earlier so this does > not really matter > > After a small number of refresh cycles (for me it's usually a couple > dozen), we end up with a couple stuck locks (I shortened the backend > type string a bit, for formatting reasons): > > test=# select a.pid, classid, objid, backend_type, query > from pg_locks l join pg_stat_activity a on (a.pid = l.pid) > where not granted; > > pid | classid | objid | backend_type | query > ---------+---------+-------+------------------+---------------------- > 2691941 | 6100 | 16785 | client backend | ALTER SUBSCRIPTION s > REFRESH PUBLICATION > 2691837 | 6100 | 16785 | tablesync worker | > 2691936 | 6100 | 16785 | tablesync worker | > (3 rows) > > All these backends wait for 6100/16785, which is the subscription row in > pg_subscription. The tablesync workers are requesting AccessShareLock, > the client backend however asks for AccessExclusiveLock. > > The entry is currently locked by: > > test=# select a.pid, mode, backend_type from pg_locks l > join pg_stat_activity a on (a.pid = l.pid) > where classid=6100 and objid=16785 and granted; > > pid | mode | backend_type > ---------+-----------------+---------------------------------- > 2690477 | AccessShareLock | logical replication apply worker > (1 row) > > But the apply worker is not waiting for any locks, so what's going on? > > Well, the problem is the apply worker is waiting for notification from > the tablesync workers the relation is synced, which happens through > updating the pg_subscription_rel row. And that wait happens in > wait_for_relation_state_change, which simply checks the row in a loop, > with a sleep by WaitLatch(). > > Unfortunately, the tablesync workers can't update the row because the > client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION > sneaked in, and waits for an AccessExclusiveLock. So the tablesync > workers are stuck in the queue and can't proceed. > > The client backend can't proceed, because it's waiting for a lock held > by the apply worker. > It seems there is some inconsistency in what you have written for client backends/tablesync worker vs. apply worker. The above text seems to be saying that the client backend and table sync worker are waiting on a "subscription row in pg_subscription" and the apply worker is operating on "pg_subscription_rel". So, if that is true then they shouldn't get stuck. I think here client backend and tablesync worker seems to be blocked for a lock on pg_subscription_rel. > The tablesync workers can't proceed because their lock request is stuck > behind the AccessExclusiveLock request. > > And the apply worker can't proceed, because it's waiting for status > update from the tablesync workers. > This part is not clear to me because wait_for_relation_state_change()->GetSubscriptionRelState() seems to be releasing the lock while closing the relation. Am, I missing something? -- With Regards, Amit Kapila.
pgsql-hackers by date: