Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects
Date
Msg-id CA+hUKGJo6hu6GToiXarBRF+AqhFPnzMTW2Nksm0x-+9m2=dskQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects  (Bryan Green <dbryan.green@gmail.com>)
List pgsql-hackers
On Fri, Nov 7, 2025 at 3:13 AM Bryan Green <dbryan.green@gmail.com> wrote:
> The reason to still do this patch and clean up the handle inheritance
> mess is that there are states (suspended state, infinite loop, spinlock
> hold, whatever) that a process can be in that keeps it from processing
> the event.  We don't need to wait on the children to voluntarily exit
> when postmaster crashes.

Agreed on all points.  We'd recently come to the same conclusion on this thread:

https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru

I think there might arguably be a sort of weak forward progress
guarantee in the existing design and it's been a while since we've had
problem reports AFAIR*: locks were releases (which turns out to be
fundamentally unsafe at least while in a critical section as analysed
in that thread, but it does allow progress in blocked backends, so
that they can learn of the postmaster's demise), and no one should
enter WaitEventSet() while holding a spinlock, and infinite loops are
against the law, and it's previously been considered acceptable-ish
that a backend might continue to run a long query until completion
before exiting (without supporting auxiliary or worker backends, which
sounds potentially suspect, but at least you can't wait for another
backend without learning of the PostgreSQL's demise assuming the only
possible waits are LWLocks or latches).  But clearly it's not good
enough.

The fact that Windows backends are born in suspended state until the
postmaster resumes them is indeed a new and significant hole in that
theory.  Preemptive termination is the only thing that makes sense.

*We used to have places that waited but forgot to handle PM exit, and
I don't recall "manual orphan cleanup needed" reports since we
enforced a central handler.  But see also my earlier note about
systemd potentially hiding problems these days, if using "mixed" mode
to SIGKILL the whole cgroup.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [Patch] Windows relation extension failure at 2GB and 4GB
Next
From: jian he
Date:
Subject: transformJsonFuncExpr pathspec cache lookup failed