Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From shveta malik
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAJpy0uDm1+JP=SBkuSp=513cK6BJYZn5Yd_00aLLxpHGvdGEtQ@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Sep 8, 2025 at 3:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Coverity is not happy with commit a850be2fe:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/worker.c: 3276             in
FindDeletedTupleInLocalRel()
> 3270                     * maybe_advance_nonremovable_xid() for details).
> 3271                     */
> 3272                    LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 3273                    leader = logicalrep_worker_find(MyLogicalRepWorker->subid,
> 3274                                                                                    InvalidOid, false);
> 3275
> >>>     CID 1665367:         Null pointer dereferences  (NULL_RETURNS)
> >>>     Dereferencing a pointer that might be "NULL" "&leader->relmutex" when calling "tas".
> 3276                    SpinLockAcquire(&leader->relmutex);
> 3277                    oldestxmin = leader->oldest_nonremovable_xid;
> 3278                    SpinLockRelease(&leader->relmutex);
> 3279                    LWLockRelease(LogicalRepWorkerLock);
> 3280            }
>
> I think Coverity has a point.  AFAICS every other call of
> logicalrep_worker_find() guards against a NULL result,
> so why is it okay for this one to dump core on NULL?
>

Thanks for pointing it out. It was a miss.
I attempted to reproduce a SIGSEGV in this flow. It appears that a
SIGSEGV can occur when the tablesync worker is catching up and is in
FindDeletedTupleInLocalRel() and meanwhile drop-subscription is done
in another session. Here’s the sequence that triggers the issue:

1) Pause the tablesync worker while it's in FindDeletedTupleInLocalRel().
2) Issue a 'DROP SUBSCRIPTION sub'.
3) Allow DropSubscription to proceed to logicalrep_worker_stop() for
the apply worker, but block it using the debugger before it attempts
to stop the tablesync worker.
4) Simultaneously, hold the launcher process using the debugger before
it can restart the apply worker.
5) Now, resume the tablesync worker. It ends up with a NULL leader
worker and hits a SIGSEGV.

Since this issue can be reliably reproduced with a simple DROP
SUBSCRIPTION, I thought it would be appropriate to add the new error
as a user-facing error.

Additionally, the issue can also be reproduced  if the apply worker is
forcefully made to error out in wait_for_relation_state_change() while
the tablesync worker is applying changes and is in
FindDeletedTupleInLocalRel().

Attached a patch to address the issue.

thanks
Shveta

Attachment

pgsql-hackers by date:

Previous
From: David Geier
Date:
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Next
From: Daniel Gustafsson
Date:
Subject: Re: OAuth client code doesn't work with Google OAuth