Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects - Mailing list pgsql-hackers

From Bryan Green
Subject Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects
Date
Msg-id f60065bd-bed3-4e78-beb6-7e81d5e3875d@gmail.com
Whole thread Raw
In response to Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On 11/6/2025 6:46 AM, Thomas Munro wrote:
> On Tue, Nov 4, 2025 at 4:12 AM Bryan Green <dbryan.green@gmail.com> wrote:
>> Current approaches (inherited event handles, shared memory flags) depend
>> on the postmaster running code during exit. A segfault or kill bypasses
>> all of that.
> 
> Huh.  I thought PostmasterHandle should be signalled by the kernel,
> not by any code run by the postmaster itself, when taskkill /f
> calls something like TerminateProcess().  Is that not the case?  Are
> you sure we haven't broken something on our side?
> 


Yes, that is true.  I may have been a bit sloppy with my wording here.
This patch is focused solely on quickly terminating the backend
processes when postmaster has crashed. Any children created in a
suspended state will not handle the signal.  Additionally, we have
handles being inherited by the children that keep them from terminating.
 See Fix Socket handle Inheritance on Windows
(https://commitfest.postgresql.org/patch/6207/) and O_CLOEXEC not
honored on Windows - handle inheritance chain
(https://commitfest.postgresql.org/patch/6197/).  When the children are
in a suspended state or have handles inherited from the parent process
then we have to wait-- which we shouldn't do when postmaster crashes.
The reason to still do this patch and clean up the handle inheritance
mess is that there are states (suspended state, infinite loop, spinlock
hold, whatever) that a process can be in that keeps it from processing
the event.  We don't need to wait on the children to voluntarily exit
when postmaster crashes.

>> My proposed solution is to use Windows Job Objects with KILL_ON_JOB_CLOSE.
> 
> I'm not a Windows guy but I had been researching this independently,
> and it does seem to be the standard approach, so +1 generally even
> though I'm still very curious to know *why* the existing coding
> doesn't work in such a simple case.
> 
> By coincidence, I'm getting close to posting a stack of patches for
> better postmaster death handling on Unix too, along with better
> subprocess and interrupt multiplexing and cleanup.  That does more
> stuff, with connections to multithreading, interrupts, critical
> sections, a bunch of existing bugs we have with subprocess management
> on Unix.  I will gladly delete my own attempt at Windows job objects
> from that effort and rebase on top of your patch, and see what review
> feedback ideas come up in that process.
> 

That sounds excellent. I'd be happy to coordinate - please feel free to
build on this work. The Windows Job Object approach is fairly
self-contained, so it should integrate cleanly with your broader
subprocess management improvements.

> In nearby threads that triggered my work on that, I was a bit worried
> about the change in behaviour on PM death in the syncrep and
> syslogger, but I'm beginning to suspect that the vast majority of
> Linux/systemd deployments probably just nuke the whole cgroup from
> orbit in this case, so it seems like exceptional behaviour is really
> just a recipe for fragility on rarer systems, not to mention that it
> probably has to be like that in a potential multithreaded mode anyway.
> We should probably just rip all such specialness out, which I'll show
> in my patch set with some explanations soon.
> 
> On the other hand, I was thinking of this as a v19 feature, where one
> can contemplate such changes, but you said:
> 
>> Job creation can fail if postgres runs under an existing job (service
>> managers, debuggers). Windows 7 disallows nested jobs. We detect this
>> with IsProcessInJob(), and if AssignProcessToJobObject() returns
>> ERROR_ACCESS_DENIED, we log and continue without orphan protection.
> 
> We currently require Windows 10 (itself recently EOL'd), but
> PostgreSQL 13-15 sort of claim to work on Windows all the way back to
> 7, so I'm guessing you're imagining back-patching this?
> 

No, I'm targeting v19 as a new feature. The Windows 7 comment was just
being defensive about the failure mode - since we allow job creation to
fail gracefully, there's no harm in having the code check for nested job
scenarios even though we don't officially support Win7 anymore. The code
degrades gracefully to "current behavior" if job creation fails for any
reason.

>> This patch does not include automated tests because the core
>> functionality (orphan prevention on crash) requires simulating process
>> termination, which is difficult to test reliably in CI.
> 
> Ah yes I ran into problems with that part too...

Thanks for the review, I look forward to working on this problem area
with you.  It's good to have more than one set of eyes looking at a problem.

-- 
Bryan Green
EDB: https://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: Adding basic NUMA awareness
Next
From: Andrew Dunstan
Date:
Subject: Re: Spacing of options in getopt_long processing