Thread: Is PQreset() proper ?
HI all, I've encountered a database freeze and found it's due to the reset of connection after abort. The following is a part of postmaster log. A new backend(pid=395) started immedaitely after a backend(pid=394) abort. OTOH postmaster tries to kill all backends to cleanup shared memory. However the process 394 ignored SIGUSR1 signal and is waiting for some lock which would never be released. FATAL 2: elog: error during error recovery, giving up! DEBUG: proc_exit(2) DEBUG: shmem_exit(2) postmaster: ServerLoop: handling reading 5 postmaster: ServerLoop: handling reading 5 postmaster: ServerLoop: handling writing 5 postmaster: BackendStartup: pid 395 user reindex db reindex socket 5 DEBUG: exit(2) postmaster: reaping dead processes... postmaster: CleanupProc: pid 394 exited with status 512 Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000 Terminating any active server processes... postmaster: CleanupProc: sending SIGUSR1 to process 395 postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex ) FindExec: searching PATH ... ValidateBinary: can't stat "/bin/postgres" ValidateBinary: can't stat "/usr/bin/postgres" ValidateBinary: can't stat "/usr/local/bin/postgres" ValidateBinary: can't stat "/usr/bin/X11/postgres" ValidateBinary: can't stat "/usr/lib/jdk1.2/bin/postgres" ValidateBinary: can't stat "/home/freetools/bin/postgres" FindExec: found "/home/freetools/reindex/bin/postgres" using PATH DEBUG: connection: host=[local] user=reindex database=reindex DEBUG: InitPostgres Regards. Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > postmaster: BackendStartup: pid 395 user reindex db reindex socket 5 > DEBUG: exit(2) > postmaster: reaping dead processes... > postmaster: CleanupProc: pid 394 exited with status 512 > Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000 > Terminating any active server processes... > postmaster: CleanupProc: sending SIGUSR1 to process 395 > postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex ) This isn't PQreset()'s fault that I can see. This is a race condition caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's set up the correct signal handler for same. The postmaster should have started the child process with all signals blocked, so SIGUSR1 will be held off until the child explicitly enables it; but it does so a few lines too soon. Will fix. regards, tom lane
Tom Lane wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > postmaster: BackendStartup: pid 395 user reindex db reindex socket 5 > > DEBUG: exit(2) > > postmaster: reaping dead processes... > > postmaster: CleanupProc: pid 394 exited with status 512 > > Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000 > > Terminating any active server processes... > > postmaster: CleanupProc: sending SIGUSR1 to process 395 > > postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex ) > > This isn't PQreset()'s fault that I can see. This is a race condition > caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's > set up the correct signal handler for same. The postmaster should have > started the child process with all signals blocked, so SIGUSR1 will be > held off until the child explicitly enables it; but it does so a few > lines too soon. Will fix. > I once observed another case,the hang of CheckPoint process while postmaster was in a backend crash recovery. I changed postmaster.c to not invoke CheckPoint process while postmaster is in a backend crash recovery but it doesn't seem sufficient. SIGUSR1 signal seems to be blocked all the way in CheckPoint process. Regards. Hiroshi Inoue
> Tom Lane wrote: >> This isn't PQreset()'s fault that I can see. This is a race condition >> caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's >> set up the correct signal handler for same. The postmaster should have >> started the child process with all signals blocked, so SIGUSR1 will be >> held off until the child explicitly enables it; but it does so a few >> lines too soon. Will fix. Actually, it turns out the real problem is that backends were inheriting a SIG_IGN setting for SIGUSR1 from the postmaster. So a SIGUSR1 delivered before they got as far as setting up their own signal handling would get lost. Fixed now. Hiroshi Inoue <Inoue@tpf.co.jp> writes: > I once observed another case,the hang of CheckPoint process > while postmaster was in a backend crash recovery. I changed > postmaster.c to not invoke CheckPoint process while postmaster > is in a backend crash recovery but it doesn't seem sufficient. > SIGUSR1 signal seems to be blocked all the way in CheckPoint > process. Hm. Vadim, do you think it's safe to let CheckPoint be killed by SIGUSR1? If not, what will we do about this? regards, tom lane