Thread: Is PQreset() proper ?

Is PQreset() proper ?

From

"Hiroshi Inoue"

Date:

19 December 2000, 10:42:42

HI all,

I've encountered a database freeze and found it's due
to the reset of connection after abort.
The following is a part of postmaster log.
A new backend(pid=395) started immedaitely after
a backend(pid=394) abort. OTOH postmaster tries
to kill all backends to cleanup shared memory.
However the process 394 ignored SIGUSR1 signal
and is waiting for some lock which would never be
released.

FATAL 2:  elog: error during error recovery, giving up!
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
postmaster: ServerLoop:        handling reading 5
postmaster: ServerLoop:        handling reading 5
postmaster: ServerLoop:        handling writing 5
postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
DEBUG:  exit(2)
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 394 exited with status 512
Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
Terminating any active server processes...
postmaster: CleanupProc: sending SIGUSR1 to process 395
postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )
FindExec: searching PATH ...
ValidateBinary: can't stat "/bin/postgres"
ValidateBinary: can't stat "/usr/bin/postgres"
ValidateBinary: can't stat "/usr/local/bin/postgres"
ValidateBinary: can't stat "/usr/bin/X11/postgres"
ValidateBinary: can't stat "/usr/lib/jdk1.2/bin/postgres"
ValidateBinary: can't stat "/home/freetools/bin/postgres"
FindExec: found "/home/freetools/reindex/bin/postgres" using PATH
DEBUG:  connection: host=[local] user=reindex database=reindex
DEBUG:  InitPostgres

Regards.
Hiroshi Inoue

Re: Is PQreset() proper ?

From

Tom Lane

Date:

20 December 2000, 14:28:11

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
> DEBUG:  exit(2)
> postmaster: reaping dead processes...
> postmaster: CleanupProc: pid 394 exited with status 512
> Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
> Terminating any active server processes...
> postmaster: CleanupProc: sending SIGUSR1 to process 395
> postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )

This isn't PQreset()'s fault that I can see.  This is a race condition
caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
set up the correct signal handler for same.  The postmaster should have
started the child process with all signals blocked, so SIGUSR1 will be
held off until the child explicitly enables it; but it does so a few
lines too soon.  Will fix.
        regards, tom lane

Re: Is PQreset() proper ?

From

Hiroshi Inoue

Date:

20 December 2000, 22:42:27

Tom Lane wrote:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
> > DEBUG:  exit(2)
> > postmaster: reaping dead processes...
> > postmaster: CleanupProc: pid 394 exited with status 512
> > Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
> > Terminating any active server processes...
> > postmaster: CleanupProc: sending SIGUSR1 to process 395
> > postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )
> 
> This isn't PQreset()'s fault that I can see.  This is a race condition
> caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
> set up the correct signal handler for same.  The postmaster should have
> started the child process with all signals blocked, so SIGUSR1 will be
> held off until the child explicitly enables it; but it does so a few
> lines too soon.  Will fix.
> 

I once observed another case,the hang of CheckPoint process
while postmaster was in a backend crash recovery. I changed
postmaster.c to not invoke CheckPoint process while postmaster
is in a backend crash recovery but it doesn't seem sufficient.
SIGUSR1 signal seems to be blocked all the way in CheckPoint
process.

Regards.
Hiroshi Inoue

Re: Is PQreset() proper ?

From

Tom Lane

Date:

21 December 2000, 01:17:57

> Tom Lane wrote:
>> This isn't PQreset()'s fault that I can see.  This is a race condition
>> caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
>> set up the correct signal handler for same.  The postmaster should have
>> started the child process with all signals blocked, so SIGUSR1 will be
>> held off until the child explicitly enables it; but it does so a few
>> lines too soon.  Will fix.

Actually, it turns out the real problem is that backends were inheriting
a SIG_IGN setting for SIGUSR1 from the postmaster.  So a SIGUSR1
delivered before they got as far as setting up their own signal handling
would get lost.  Fixed now.

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> I once observed another case,the hang of CheckPoint process
> while postmaster was in a backend crash recovery. I changed
> postmaster.c to not invoke CheckPoint process while postmaster
> is in a backend crash recovery but it doesn't seem sufficient.
> SIGUSR1 signal seems to be blocked all the way in CheckPoint
> process.

Hm.  Vadim, do you think it's safe to let CheckPoint be killed by
SIGUSR1?  If not, what will we do about this?
        regards, tom lane