On Mon, Sep 8, 2025 at 3:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Coverity is not happy with commit a850be2fe:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/worker.c: 3276 in
FindDeletedTupleInLocalRel()
> 3270 * maybe_advance_nonremovable_xid() for details).
> 3271 */
> 3272 LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 3273 leader = logicalrep_worker_find(MyLogicalRepWorker->subid,
> 3274 InvalidOid, false);
> 3275
> >>> CID 1665367: Null pointer dereferences (NULL_RETURNS)
> >>> Dereferencing a pointer that might be "NULL" "&leader->relmutex" when calling "tas".
> 3276 SpinLockAcquire(&leader->relmutex);
> 3277 oldestxmin = leader->oldest_nonremovable_xid;
> 3278 SpinLockRelease(&leader->relmutex);
> 3279 LWLockRelease(LogicalRepWorkerLock);
> 3280 }
>
> I think Coverity has a point. AFAICS every other call of
> logicalrep_worker_find() guards against a NULL result,
> so why is it okay for this one to dump core on NULL?
>
Thanks for pointing it out. It was a miss.
I attempted to reproduce a SIGSEGV in this flow. It appears that a
SIGSEGV can occur when the tablesync worker is catching up and is in
FindDeletedTupleInLocalRel() and meanwhile drop-subscription is done
in another session. Here’s the sequence that triggers the issue:
1) Pause the tablesync worker while it's in FindDeletedTupleInLocalRel().
2) Issue a 'DROP SUBSCRIPTION sub'.
3) Allow DropSubscription to proceed to logicalrep_worker_stop() for
the apply worker, but block it using the debugger before it attempts
to stop the tablesync worker.
4) Simultaneously, hold the launcher process using the debugger before
it can restart the apply worker.
5) Now, resume the tablesync worker. It ends up with a NULL leader
worker and hits a SIGSEGV.
Since this issue can be reliably reproduced with a simple DROP
SUBSCRIPTION, I thought it would be appropriate to add the new error
as a user-facing error.
Additionally, the issue can also be reproduced if the apply worker is
forcefully made to error out in wait_for_relation_state_change() while
the tablesync worker is applying changes and is in
FindDeletedTupleInLocalRel().
Attached a patch to address the issue.
thanks
Shveta