Thread: Inconsistencies of service failure handling on Windows
Hi all,
While playing on Windows with services, I noticed an inconsistent behavior in the way failures are handled when using a service for a Postgres instance.
Let's assume that there is a service called postgres that has been registered:
$ psql -At -c 'select version()'
PostgreSQL 9.5devel, compiled by Visual C++ build 1600, 64-bit
$ tasklist.exe -svc -FI "SERVICES eq postgres"
Image Name PID Services
========================= ======== ============================================
pg_ctl.exe 556 postgres
When pg_ctl is directly killed, service manager is able to detect a failure correctly.
$ taskkill.exe -PID 556 -F
SUCCESS: The process with PID 556 has been terminated.
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 1067 (0x42b)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
In this case 1067 means that the process left unexpectedly. Note that at this point the Postgres instance is still running but we can use the failure callback to run a script that could do some cleanup before restarting properly the service.
However when a backend process is directly killed something different happens.
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
postgres.exe 2088 Services 0 17,380 K
postgres.exe 2132 Services 0 4,400 K
postgres.exe 2236 Services 0 5,064 K
postgres.exe 1524 Services 0 6,304 K
postgres.exe 2084 Services 0 9,200 K
postgres.exe 2384 Services 0 5,968 K
postgres.exe 2652 Services 0 4,500 K
postgres.exe 2116 Services 0 4,384 K
$ taskkill.exe -PID 2084 -F
SUCCESS: The process with PID 2084 has been terminated.
After that some processes remain:
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
postgres.exe 2088 Services 0 5,708 K
postgres.exe 2132 Services 0 4,400 K
Processes that are immediately taken down when attempting a connection to the server. Note that before attempting any connections service is considered as running normally:
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 4 RUNNING
(STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
$ psql
psql: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
INFO: No tasks are running which match the specified criteria.
But now service has stopped, and it is not considered as having failed:
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
This seems like an inconsistent behavior in error detection.While playing on Windows with services, I noticed an inconsistent behavior in the way failures are handled when using a service for a Postgres instance.
Let's assume that there is a service called postgres that has been registered:
$ psql -At -c 'select version()'
PostgreSQL 9.5devel, compiled by Visual C++ build 1600, 64-bit
$ tasklist.exe -svc -FI "SERVICES eq postgres"
Image Name PID Services
========================= ======== ============================================
pg_ctl.exe 556 postgres
When pg_ctl is directly killed, service manager is able to detect a failure correctly.
$ taskkill.exe -PID 556 -F
SUCCESS: The process with PID 556 has been terminated.
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 1067 (0x42b)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
In this case 1067 means that the process left unexpectedly. Note that at this point the Postgres instance is still running but we can use the failure callback to run a script that could do some cleanup before restarting properly the service.
However when a backend process is directly killed something different happens.
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
postgres.exe 2088 Services 0 17,380 K
postgres.exe 2132 Services 0 4,400 K
postgres.exe 2236 Services 0 5,064 K
postgres.exe 1524 Services 0 6,304 K
postgres.exe 2084 Services 0 9,200 K
postgres.exe 2384 Services 0 5,968 K
postgres.exe 2652 Services 0 4,500 K
postgres.exe 2116 Services 0 4,384 K
$ taskkill.exe -PID 2084 -F
SUCCESS: The process with PID 2084 has been terminated.
After that some processes remain:
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
postgres.exe 2088 Services 0 5,708 K
postgres.exe 2132 Services 0 4,400 K
Processes that are immediately taken down when attempting a connection to the server. Note that before attempting any connections service is considered as running normally:
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 4 RUNNING
(STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
$ psql
psql: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
$ tasklist.exe -FI "IMAGENAME eq postgres.exe"
INFO: No tasks are running which match the specified criteria.
But now service has stopped, and it is not considered as having failed:
$ sc query postgres
SERVICE_NAME: postgres
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
--
Michael
Michael
Michael Paquier <michael.paquier@gmail.com> writes: > While playing on Windows with services, I noticed an inconsistent behavior > in the way failures are handled when using a service for a Postgres > instance. > ... > However when a backend process is directly killed something different > happens. Was that a backend that you directly killed? Or the postmaster? The subsequent connection failures suggest it was the postmaster. Killing the postmaster is not a supported operation, not on Windows and not anywhere else either. It's in the category of "doctor, it hurts when I do this". regards, tom lane
On Wed, Jul 23, 2014 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Michael
Was that a backend that you directly killed? Or the postmaster? The
subsequent connection failures suggest it was the postmaster. Killing
the postmaster is not a supported operation, not on Windows and not
anywhere else either. It's in the category of "doctor, it hurts when
I do this".
The headshot was done on random backends. Perhaps in some of those tests the postmaster was taken down though :) I didn't check postmaster.pid all the time.
Michael