Home > mailing lists

BUG #13459: Replaying WAL logs can hang on startup - Mailing list pgsql-bugs

From	chris+postgresql@qwirx.com
Subject	BUG #13459: Replaying WAL logs can hang on startup
Date	June 22, 2015 14:19:41
Msg-id	20150622084201.3874.74819@wrigleys.postgresql.org Whole thread Raw
Responses	Re: BUG #13459: Replaying WAL logs can hang on startup
List	pgsql-bugs

Tree view

The following bug has been logged on the website:

Bug reference:      13459
Logged by:          Chris Wilson
Email address:      chris+postgresql@qwirx.com
PostgreSQL version: 9.4.1
Operating system:   Linux 2.6.32-220.30.1.el6.x86_64
Description:

Dear developers,

I restored a standby from a pg_basebackup made from the master the previous
morning. As a result, it had a lot of WAL logs to catch up on.

At one point it hangs while restoring logs. It normally takes a few seconds
to process a 16 MB WAL segment, but on this one, it was "recovering
0000000100000CEC00000025" for 7 minutes now with no log output at all.

I restarted it at least once and it reached the same point and hung again. I
attached gdb and saw that it was hung here:

(gdb) bt
#0  0x0000003f358e1433 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000008be748 in pg_usleep ()
#2  0x000000000074ef9f in ForgetRelationFsyncRequests ()
#3  0x000000000074d0b0 in mdunlink ()
#4  0x000000000074faa5 in smgrdounlink ()
#5  0x00000000004d5b3a in xact_redo_commit_internal ()
#6  0x00000000004d5c36 in xact_redo_commit ()
#7  0x00000000004d5f37 in xact_redo ()
#8  0x00000000004eb36a in StartupXLOG ()
#9  0x00000000006eac29 in StartupProcessMain ()
#10 0x00000000004f9eeb in AuxiliaryProcessMain ()
#11 0x00000000006e9ea1 in StartChildProcess ()
#12 0x00000000006e50ab in PostmasterMain ()
#13 0x000000000064e693 in main ()

Browsing the code, I saw this interesting comment in
ForgetRelationFsyncRequests:

void
ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
{
        if (pendingOpsTable)
        {
                /* standalone backend or startup process: fsync state is
local */
                RememberFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC);
        }
        else if (IsUnderPostmaster)
        {
                /*
                 * Notify the checkpointer about it.  If we fail to queue
the cancel
                 * message, we have to sleep and try again ... ugly, but
hopefully
                 * won't happen often.
                 *
                 * XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
                 * error would leave the no-longer-used file still present
on disk,
                 * which would be bad, so I'm inclined to assume that the
checkpointer
                 * will always empty the queue soon.
                 */
                while (!ForwardFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC))
                        pg_usleep(10000L);      /* 10 msec seems a good
number */

                /*
                 * Note we don't wait for the checkpointer to actually
absorb the
                 * cancel message; see mdsync() for the implications.
                 */
        }
}

My guess is that ForwardFsyncRequest() is continually returning false, and
this code is stuck forever. I noticed that it says that "I'm inclined to
assume that the checkpointer
will always empty the queue soon", but there is no checkpointer running
during recovery, is there?

Although I managed to recover the system by restoring from a different
backup the next day, bypassing this logfile, I think I still have the copy
that hangs on this logfile so I should be able to conduct further
investigations if required. Please let me know.

Cheers, Chris.

pgsql-bugs by date:

From: 1809563951@qq.com
Date: 22 June 2015, 14:19:40
Subject: BUG #13460: ERROR: could not find block containing chunk 0x2930072

From: Tom Lane
Date: 22 June 2015, 15:00:46
Subject: Re: BUG #13460: ERROR: could not find block containing chunk 0x2930072

BUG #13459: Replaying WAL logs can hang on startup - Mailing list pgsql-bugs

Previous

Next