Re: 024_add_drop_pub.pl might fail due to deadlock - Mailing list pgsql-hackers

From Ajin Cherian
Subject Re: 024_add_drop_pub.pl might fail due to deadlock
Date
Msg-id CAFPTHDYucxiwZ-oVy0CV0Z0iviyy_vDWE=p+=csH66oo+8odDw@mail.gmail.com
Whole thread Raw
In response to 024_add_drop_pub.pl might fail due to deadlock  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > --- a/src/backend/replication/logical/origin.c
> > +++ b/src/backend/replication/logical/origin.c
> > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> >           * the specific origin and then re-check if the origin still exists.
> >           */
> >          rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
> > +pg_usleep(300000);
> >
> > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
> > 024_add_drop_pub.pl was added in v15).
> >
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
> >
> > Best regards,
> > Alexander
> >
>
> Hi Alexander,
>
> Yes, the problem can be reproduced by the changes you suggested. I
> will look into what is happening and how we can fix this.

The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:

Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).

Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).

This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

regards,
Ajin Cherian
Fujitsu Australia.

Attachment

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Fix replica identity checks for MERGE command on published table.
Next
From: shveta malik
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart