Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f - Mailing list pgsql-committers
From | Andres Freund |
---|---|
Subject | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f |
Date | |
Msg-id | 20220120192736.msyesvdlljn6tptw@alap3.anarazel.de Whole thread Raw |
In response to | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
|
List | pgsql-committers |
Hi, On 2022-01-20 09:32:16 -0500, Tom Lane wrote: > Seems like we need to do 'kill 0' on the PID we get from > the file to verify that there's really a postmaster there. > (I wonder how well that works on Windows? perlport claims > it does, but ...) We seem to successfully do that in pg_ctl's postmaster_is_alive(). > I wrote: > > Hmm. I think the problem is that poll_start() thinks it can just call > > start() a second time after a failure. If it wasn't a true failure > > but a timeout, then _pid is now set and the second call complains. > > Oh, wait --- the case that is failing is after 017_shm.pl has > intentionally kill -9'd a postmaster, so that its pidfile is > left behind. The next attempted start fails on shmem id > conflict, but it doesn't remove the old pidfile, and then > the code I added to sub start erroneously picks that up > as a live postmaster PID. > I fear I still don't have the whole story though because > per this theory it should fail everywhere, yet it doesn't. The sequence in regress_log_* that hoverfly is getting and that I am seeing locally differs. It's a bit hard to tell, because the test doesn't report any test progress until quite a bit into the test... For the first kill9, I see: local: ### Killing node "gnat" using signal 9 ... ### Starting node "gnat" # Running: pg_ctl -w -D /home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l /home/andres/build/postgres/dev> pg_ctl: another server might be running; trying to start server anyway waiting for server to start.... done server started # Postmaster PID for node "gnat" is 2904505 hoverfly: ### Killing node "gnat" using signal 9 ... ### Starting node "gnat" # Running: pg_ctl -w -D /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata-l /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/log/017_shm_gnat.log-o --cluster-name=gnat start pg_ctl: another server might be running; trying to start server anyway waiting for server to start.... stopped waiting pg_ctl: could not start server Examine the log output. # pg_ctl start failed; logfile: local log: 2022-01-20 11:02:40.109 PST [2905213] LOG: starting PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc-12 (Debian12-20220106-1) 12.0.0 20220106 (experimental) [master r12-6277-g1935db29689], 64-bit 2022-01-20 11:02:40.109 PST [2905213] LOG: listening on Unix socket "/tmp/jAyaRMImQH/.s.PGSQL.53851" 2022-01-20 11:02:40.110 PST [2905216] LOG: database system was interrupted; last known up at 2022-01-20 11:02:39 PST 2022-01-20 11:02:40.111 PST [2905216] LOG: database system was not properly shut down; automatic recovery in progress hoverfly: 2022-01-20 18:04:53.087 UTC [50659690:1] LOG: starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by /opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 2022-01-20 18:04:53.088 UTC [50659690:2] LOG: listening on Unix socket "/home/nm/farm/tmp/oR5PbI8A5R/.s.PGSQL.61673" 2022-01-20 18:04:53.092 UTC [393770:1] LOG: database system was shut down at 2022-01-20 18:04:52 UTC 2022-01-20 18:04:53.099 UTC [50659690:3] LOG: database system is ready to accept connections 2022-01-20 18:04:53.275 UTC [33751596:1] FATAL: lock file "postmaster.pid" already exists 2022-01-20 18:04:53.275 UTC [33751596:2] HINT: Is another postmaster (PID 50659690) running in data directory "/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"? IOW, the test doesn't actually quite seem to be working in the "local" case? Hence not seeing the problem of picking up some wrong pid. I think it's a question of how quick the postmaster-death detection logic is. At first I thought it's a question of USE_POSTMASTER_DEATH_SIGNAL, but I see the same with the pipe based approach. So it's probably just a question of event ordering. I think this test could benefit from $gnat->append_conf('postgresql.conf', 'log_min_messages=debug3'); or such, so that we see processes exiting. I also think we ought to emit a debug message when detecting postmaster death. It's unnecessarily hard to debug them right now. Greetings, Andres Freund
pgsql-committers by date: