Home > mailing lists

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Date	November 21, 2023 13:16:59
Msg-id	CAA4eK1LHD3ZHqyK2GcSw=fqCPr26yP2GL9niQbY8ems68D1Z9A@mail.gmail.com Whole thread Raw
In response to	undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
List	pgsql-hackers

Tree view

On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I decided to do some stress-testing of the built-in logical replication,
> as part of the sequence decoding work. And I soon ran into an undetected
> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>
> The attached bash scripts triggers that in a couple seconds for me. The
> script looks complicated, but most of the code is waiting for sync to
> complete, catchup, and that sort of thing.
>
> What the script does is pretty simple:
>
> 1) initialize two clusters, set them as publisher/subscriber pair
>
> 2) create some number of tables, add them to publication and wait for
>    the sync to complete
>
> 3) start two pgbench runs in the background, modifying the publication
>    (one removes+adds all tables in a single transaction, one does that
>     with transaction per table)
>
> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>    in a loop (now that I think about it, could be another pgbench
>    script, but well ...)
>
> 5) some consistency checks, but the lockup happens earlier so this does
>    not really matter
>
> After a small number of refresh cycles (for me it's usually a couple
> dozen), we end up with a couple stuck locks (I shortened the backend
> type string a bit, for formatting reasons):
>
>   test=# select a.pid, classid, objid, backend_type, query
>            from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>           where not granted;
>
>      pid   | classid | objid | backend_type     |         query
>   ---------+---------+-------+------------------+----------------------
>    2691941 |    6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>                                                   REFRESH PUBLICATION
>    2691837 |    6100 | 16785 | tablesync worker |
>    2691936 |    6100 | 16785 | tablesync worker |
>   (3 rows)
>
> All these backends wait for 6100/16785, which is the subscription row in
> pg_subscription. The tablesync workers are requesting AccessShareLock,
> the client backend however asks for AccessExclusiveLock.
>
> The entry is currently locked by:
>
>   test=# select a.pid, mode, backend_type from pg_locks l
>            join pg_stat_activity a on (a.pid = l.pid)
>           where classid=6100 and objid=16785 and granted;
>
>      pid   |      mode       |           backend_type
>   ---------+-----------------+----------------------------------
>    2690477 | AccessShareLock | logical replication apply worker
>   (1 row)
>
> But the apply worker is not waiting for any locks, so what's going on?
>
> Well, the problem is the apply worker is waiting for notification from
> the tablesync workers the relation is synced, which happens through
> updating the pg_subscription_rel row. And that wait happens in
> wait_for_relation_state_change, which simply checks the row in a loop,
> with a sleep by WaitLatch().
>
> Unfortunately, the tablesync workers can't update the row because the
> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
> workers are stuck in the queue and can't proceed.
>
> The client backend can't proceed, because it's waiting for a lock held
> by the apply worker.
>

It seems there is some inconsistency in what you have written for
client backends/tablesync worker vs. apply worker. The above text
seems to be saying that the client backend and table sync worker are
waiting on a "subscription row in pg_subscription" and the apply
worker is operating on "pg_subscription_rel". So, if that is true then
they shouldn't get stuck.

I think here client backend and tablesync worker seems to be blocked
for a lock on pg_subscription_rel.

> The tablesync workers can't proceed because their lock request is stuck
> behind the AccessExclusiveLock request.
>
> And the apply worker can't proceed, because it's waiting for status
> update from the tablesync workers.
>

This part is not clear to me because
wait_for_relation_state_change()->GetSubscriptionRelState() seems to
be releasing the lock while closing the relation. Am, I missing
something?

--
With Regards,
Amit Kapila.

pgsql-hackers by date:

From: Tomas Vondra
Date: 21 November 2023, 11:47:38
Subject: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From: Peter Eisentraut
Date: 21 November 2023, 13:23:39
Subject: Re: meson documentation build open issues

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION - Mailing list pgsql-hackers

Previous

Next