Re: windows doesn't notice backend death - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: windows doesn't notice backend death |
Date | |
Msg-id | 29196.1241377467@sss.pgh.pa.us Whole thread Raw |
In response to | Re: windows doesn't notice backend death (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: windows doesn't notice backend death
Re: windows doesn't notice backend death |
List | pgsql-hackers |
I wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Well, I can tell you that it is getting an exit code of 1, which is why >> the postmaster isn't restarting. > Blech. Count on Windows to find a way to break things. I reflected on this a bit more. Even if we find a way around this particular task-manager behavior, it seems to me there is a generic problem here. If some bit of clueless code does exit(0) or exit(1) inside a backend session, the postmaster will think everything is fine, but actually we have an un-cleaned-up session that's probably still holding locks etc. It's fairly easy to demonstrate the issue: pl_regression=# create language plperlu; CREATE LANGUAGE pl_regression=# create or replace function trouble() returns void as pl_regression-# $$ exit 0; $$ language plperlu; CREATE FUNCTION pl_regression=# select trouble(); server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. The connection to the server was lost. Attempting reset: Succeeded. pl_regression=# select * from pg_stat_activity;datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port -------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------40179 |pl_regression | 20847 | 10 | tgl | select trouble(); | f | 2009-05-03 14:46:10.170604-04| 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 | | -140179 | pl_regression| 20855 | 10 | tgl | select * from pg_stat_activity; | f | 2009-05-03 14:46:23.986909-04 |2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 | | -1 (2 rows) Up to now we've always just dismissed the above possibility as "superusers should know better", but I think there's a reasonable case to be made that this is an obvious failure mode and we should put a bit more effort into being robust against it. With more and more external code being routinely run in the backend, who wants to swear that there is no "exit(1)" in the guts of libperl or libxml or whatever? The first idea that comes to mind is to have some sort of "dead man switch" that flags an active backend and is reset by proc_exit() after it's finished cleaning up everything else. If the postmaster sees this flag still set after backend exit, then it treats the backend as having crashed regardless of what the reported exit code is. We could implement this via an array of sig_atomic_t in shared memory, so as to minimize the postmaster's entanglement with shared memory (it'd be no worse than the old WIN32-specific child pid arrays). Or maybe there's a better way. Thoughts? regards, tom lane
pgsql-hackers by date: