BUG #13459: Replaying WAL logs can hang on startup - Mailing list pgsql-bugs
From | chris+postgresql@qwirx.com |
---|---|
Subject | BUG #13459: Replaying WAL logs can hang on startup |
Date | |
Msg-id | 20150622084201.3874.74819@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #13459: Replaying WAL logs can hang on startup
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 13459 Logged by: Chris Wilson Email address: chris+postgresql@qwirx.com PostgreSQL version: 9.4.1 Operating system: Linux 2.6.32-220.30.1.el6.x86_64 Description: Dear developers, I restored a standby from a pg_basebackup made from the master the previous morning. As a result, it had a lot of WAL logs to catch up on. At one point it hangs while restoring logs. It normally takes a few seconds to process a 16 MB WAL segment, but on this one, it was "recovering 0000000100000CEC00000025" for 7 minutes now with no log output at all. I restarted it at least once and it reached the same point and hung again. I attached gdb and saw that it was hung here: (gdb) bt #0 0x0000003f358e1433 in __select_nocancel () from /lib64/libc.so.6 #1 0x00000000008be748 in pg_usleep () #2 0x000000000074ef9f in ForgetRelationFsyncRequests () #3 0x000000000074d0b0 in mdunlink () #4 0x000000000074faa5 in smgrdounlink () #5 0x00000000004d5b3a in xact_redo_commit_internal () #6 0x00000000004d5c36 in xact_redo_commit () #7 0x00000000004d5f37 in xact_redo () #8 0x00000000004eb36a in StartupXLOG () #9 0x00000000006eac29 in StartupProcessMain () #10 0x00000000004f9eeb in AuxiliaryProcessMain () #11 0x00000000006e9ea1 in StartChildProcess () #12 0x00000000006e50ab in PostmasterMain () #13 0x000000000064e693 in main () Browsing the code, I saw this interesting comment in ForgetRelationFsyncRequests: void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum) { if (pendingOpsTable) { /* standalone backend or startup process: fsync state is local */ RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC); } else if (IsUnderPostmaster) { /* * Notify the checkpointer about it. If we fail to queue the cancel * message, we have to sleep and try again ... ugly, but hopefully * won't happen often. * * XXX should we CHECK_FOR_INTERRUPTS in this loop? Escaping with an * error would leave the no-longer-used file still present on disk, * which would be bad, so I'm inclined to assume that the checkpointer * will always empty the queue soon. */ while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC)) pg_usleep(10000L); /* 10 msec seems a good number */ /* * Note we don't wait for the checkpointer to actually absorb the * cancel message; see mdsync() for the implications. */ } } My guess is that ForwardFsyncRequest() is continually returning false, and this code is stuck forever. I noticed that it says that "I'm inclined to assume that the checkpointer will always empty the queue soon", but there is no checkpointer running during recovery, is there? Although I managed to recover the system by restoring from a different backup the next day, bypassing this logfile, I think I still have the copy that hangs on this logfile so I should be able to conduct further investigations if required. Please let me know. Cheers, Chris.
pgsql-bugs by date: