Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f - Mailing list pgsql-committers
| From | Andres Freund |
|---|---|
| Subject | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f |
| Date | |
| Msg-id | 20220120192736.msyesvdlljn6tptw@alap3.anarazel.de Whole thread Raw |
| In response to | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f (Tom Lane <tgl@sss.pgh.pa.us>) |
| Responses |
Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
|
| List | pgsql-committers |
Hi,
On 2022-01-20 09:32:16 -0500, Tom Lane wrote:
> Seems like we need to do 'kill 0' on the PID we get from
> the file to verify that there's really a postmaster there.
> (I wonder how well that works on Windows? perlport claims
> it does, but ...)
We seem to successfully do that in pg_ctl's postmaster_is_alive().
> I wrote:
> > Hmm. I think the problem is that poll_start() thinks it can just call
> > start() a second time after a failure. If it wasn't a true failure
> > but a timeout, then _pid is now set and the second call complains.
>
> Oh, wait --- the case that is failing is after 017_shm.pl has
> intentionally kill -9'd a postmaster, so that its pidfile is
> left behind. The next attempted start fails on shmem id
> conflict, but it doesn't remove the old pidfile, and then
> the code I added to sub start erroneously picks that up
> as a live postmaster PID.
> I fear I still don't have the whole story though because
> per this theory it should fail everywhere, yet it doesn't.
The sequence in regress_log_* that hoverfly is getting and that I am seeing
locally differs. It's a bit hard to tell, because the test doesn't report any
test progress until quite a bit into the test...
For the first kill9, I see:
local:
### Killing node "gnat" using signal 9
...
### Starting node "gnat"
# Running: pg_ctl -w -D
/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/home/andres/build/postgres/dev>
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... done
server started
# Postmaster PID for node "gnat" is 2904505
hoverfly:
### Killing node "gnat" using signal 9
...
### Starting node "gnat"
# Running: pg_ctl -w -D
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/log/017_shm_gnat.log-o --cluster-name=gnat
start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
# pg_ctl start failed; logfile:
local log:
2022-01-20 11:02:40.109 PST [2905213] LOG: starting PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc-12
(Debian12-20220106-1) 12.0.0 20220106 (experimental) [master r12-6277-g1935db29689], 64-bit
2022-01-20 11:02:40.109 PST [2905213] LOG: listening on Unix socket "/tmp/jAyaRMImQH/.s.PGSQL.53851"
2022-01-20 11:02:40.110 PST [2905216] LOG: database system was interrupted; last known up at 2022-01-20 11:02:39
PST
2022-01-20 11:02:40.111 PST [2905216] LOG: database system was not properly shut down; automatic recovery in
progress
hoverfly:
2022-01-20 18:04:53.087 UTC [50659690:1] LOG: starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit
2022-01-20 18:04:53.088 UTC [50659690:2] LOG: listening on Unix socket
"/home/nm/farm/tmp/oR5PbI8A5R/.s.PGSQL.61673"
2022-01-20 18:04:53.092 UTC [393770:1] LOG: database system was shut down at 2022-01-20 18:04:52 UTC
2022-01-20 18:04:53.099 UTC [50659690:3] LOG: database system is ready to accept connections
2022-01-20 18:04:53.275 UTC [33751596:1] FATAL: lock file "postmaster.pid" already exists
2022-01-20 18:04:53.275 UTC [33751596:2] HINT: Is another postmaster (PID 50659690) running in data directory
"/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?
IOW, the test doesn't actually quite seem to be working in the "local" case?
Hence not seeing the problem of picking up some wrong pid.
I think it's a question of how quick the postmaster-death detection logic
is. At first I thought it's a question of USE_POSTMASTER_DEATH_SIGNAL, but I
see the same with the pipe based approach. So it's probably just a question of
event ordering.
I think this test could benefit from
$gnat->append_conf('postgresql.conf', 'log_min_messages=debug3');
or such, so that we see processes exiting.
I also think we ought to emit a debug message when detecting postmaster
death. It's unnecessarily hard to debug them right now.
Greetings,
Andres Freund
pgsql-committers by date: