Re: VM corruption on standby - Mailing list pgsql-hackers

From Tom Lane
Subject Re: VM corruption on standby
Date
Msg-id 168715.1755441226@sss.pgh.pa.us
Whole thread Raw
In response to Re: VM corruption on standby  (Kirill Reshke <reshkekirill@gmail.com>)
List pgsql-hackers
Kirill Reshke <reshkekirill@gmail.com> writes:
> [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ]

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I think the actual bug here is the use of proc_exit(1) after
observing postmaster death.  That is what creates the hazard,
because it releases the locks that are preventing other processes
from observing the inconsistent state in shared memory.
Compare this to what we do, for example, on receipt of SIGQUIT:

    /*
     * We DO NOT want to run proc_exit() or atexit() callbacks -- we're here
     * because shared memory may be corrupted, so we don't want to try to
     * clean up our transaction.  Just nail the windows shut and get out of
     * town.  The callbacks wouldn't be safe to run from a signal handler,
     * anyway.
     *
     * Note we do _exit(2) not _exit(0).  This is to force the postmaster into
     * a system reset cycle if someone sends a manual SIGQUIT to a random
     * backend.  This is necessary precisely because we don't clean up our
     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
     * should ensure the postmaster sees this as a crash, too, but no harm in
     * being doubly sure.)
     */
    _exit(2);

So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
places that are responding to postmaster death.  There might be
more than just WaitEventSetWaitBlock; I didn't look.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Obsolete comments in ResultRelInfo struct
Next
From: Tom Lane
Date:
Subject: Re: psql: Count all table footer lines in pager setup