Thread: Server Panic when trying to stop point in time recovery
I am running 8.2.4 on Solaris 10 x86.
I have setup WAL file shipping from a primary server to a warm standby. I am able to start the standby server using a backup from the primary and get it to apply the log files as they arrive. My problem comes when I want to trigger the standby server to come out of recovery mode.
Here is the log file starting from when the server comes up from the backup. Just prior to the error I have "triggered" the server to exit recovery mode by making my restore_command return 1 instead of 0.
2010-04-27 15:00:58 CDT :LOG: database system was interrupted at 2010-04-27 10:10:08 CDT
2010-04-27 15:00:58 CDT :LOG: starting archive recovery
2010-04-27 15:00:58 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:00:58 CDT :LOG: restored log file "000000010000009F000000BA.00000278.backup" from archive
2010-04-27 15:00:59 CDT :LOG: restored log file "000000010000009F000000BA" from archive
2010-04-27 15:00:59 CDT :LOG: checkpoint record is at 9F/BA000278
2010-04-27 15:00:59 CDT :LOG: redo record is at 9F/BA000278; undo record is at 0/0; shutdown FALSE
2010-04-27 15:00:59 CDT :LOG: next transaction ID: 0/325985316; next OID: 823081
2010-04-27 15:00:59 CDT :LOG: next MultiXactId: 2127; next MultiXactOffset: 4278
2010-04-27 15:00:59 CDT :LOG: automatic recovery in progress
2010-04-27 15:00:59 CDT :LOG: redo starts at 9F/BA0002C0
2010-04-27 15:01:00 CDT :LOG: restored log file "000000010000009F000000BB" from archive
2010-04-27 15:01:02 CDT :LOG: restored log file "000000010000009F000000BC" from archive
<snip, many more files restored>
2010-04-27 15:03:19 CDT :LOG: restored log file "000000010000009F000000FE" from archive
2010-04-27 15:03:20 CDT :LOG: restored log file "00000001000000A000000000" from archive
2010-04-27 15:06:00 CDT :LOG: restored log file "00000001000000A000000001" from archive
2010-04-27 15:09:21 CDT :LOG: could not open file "pg_xlog/00000001000000A000000002" (log file 160, segment 2): No such file or directory
2010-04-27 15:09:21 CDT :LOG: redo done at A0/1000068
2010-04-27 15:09:21 CDT :PANIC: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: startup process (PID 22490) was terminated by signal 6
2010-04-27 15:09:26 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:26 CDT :LOG: logger shutting down
--------------------
At this point the server will now enter a restart loop and constantly generate log files like this :
2010-04-27 15:09:26 CDT :LOG: database system was interrupted while in recovery at log time 2010-04-27 15:05:08 CDT
2010-04-27 15:09:26 CDT :HINT: If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.
2010-04-27 15:09:26 CDT :LOG: starting archive recovery
2010-04-27 15:09:26 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid primary checkpoint record
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/000000010000009F000000BA" (log file 159, segment 186): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid secondary checkpoint record
2010-04-27 15:09:26 CDT :PANIC: could not locate a valid checkpoint record
2010-04-27 15:09:30 CDT :LOG: startup process (PID 24191) was terminated by signal 6
2010-04-27 15:09:30 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:30 CDT :LOG: logger shutting down
--------------------
Any help is greatly appreciated. Please let me know if I can provide any more information that will be helpful.
-Chris
I have setup WAL file shipping from a primary server to a warm standby. I am able to start the standby server using a backup from the primary and get it to apply the log files as they arrive. My problem comes when I want to trigger the standby server to come out of recovery mode.
Here is the log file starting from when the server comes up from the backup. Just prior to the error I have "triggered" the server to exit recovery mode by making my restore_command return 1 instead of 0.
2010-04-27 15:00:58 CDT :LOG: database system was interrupted at 2010-04-27 10:10:08 CDT
2010-04-27 15:00:58 CDT :LOG: starting archive recovery
2010-04-27 15:00:58 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:00:58 CDT :LOG: restored log file "000000010000009F000000BA.00000278.backup" from archive
2010-04-27 15:00:59 CDT :LOG: restored log file "000000010000009F000000BA" from archive
2010-04-27 15:00:59 CDT :LOG: checkpoint record is at 9F/BA000278
2010-04-27 15:00:59 CDT :LOG: redo record is at 9F/BA000278; undo record is at 0/0; shutdown FALSE
2010-04-27 15:00:59 CDT :LOG: next transaction ID: 0/325985316; next OID: 823081
2010-04-27 15:00:59 CDT :LOG: next MultiXactId: 2127; next MultiXactOffset: 4278
2010-04-27 15:00:59 CDT :LOG: automatic recovery in progress
2010-04-27 15:00:59 CDT :LOG: redo starts at 9F/BA0002C0
2010-04-27 15:01:00 CDT :LOG: restored log file "000000010000009F000000BB" from archive
2010-04-27 15:01:02 CDT :LOG: restored log file "000000010000009F000000BC" from archive
<snip, many more files restored>
2010-04-27 15:03:19 CDT :LOG: restored log file "000000010000009F000000FE" from archive
2010-04-27 15:03:20 CDT :LOG: restored log file "00000001000000A000000000" from archive
2010-04-27 15:06:00 CDT :LOG: restored log file "00000001000000A000000001" from archive
2010-04-27 15:09:21 CDT :LOG: could not open file "pg_xlog/00000001000000A000000002" (log file 160, segment 2): No such file or directory
2010-04-27 15:09:21 CDT :LOG: redo done at A0/1000068
2010-04-27 15:09:21 CDT :PANIC: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: startup process (PID 22490) was terminated by signal 6
2010-04-27 15:09:26 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:26 CDT :LOG: logger shutting down
--------------------
At this point the server will now enter a restart loop and constantly generate log files like this :
2010-04-27 15:09:26 CDT :LOG: database system was interrupted while in recovery at log time 2010-04-27 15:05:08 CDT
2010-04-27 15:09:26 CDT :HINT: If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.
2010-04-27 15:09:26 CDT :LOG: starting archive recovery
2010-04-27 15:09:26 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid primary checkpoint record
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/000000010000009F000000BA" (log file 159, segment 186): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid secondary checkpoint record
2010-04-27 15:09:26 CDT :PANIC: could not locate a valid checkpoint record
2010-04-27 15:09:30 CDT :LOG: startup process (PID 24191) was terminated by signal 6
2010-04-27 15:09:30 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:30 CDT :LOG: logger shutting down
--------------------
Any help is greatly appreciated. Please let me know if I can provide any more information that will be helpful.
-Chris
On Tue, 2010-04-27 at 15:39 -0500, Chris Copeland wrote: > Any help is greatly appreciated. Please let me know if I can provide > any more information that will be helpful. Well the problem is that the files are missing. How are you copying them? Joshua D. Drake > > -Chris > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Tue, Apr 27, 2010 at 4:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> > Well the problem is that the files are missing. How are you copying > them? > > Joshua D. Drake > I am using rsync to copy the files from the master into a directory /opt/wal/archwalremote on the standby. My restore script copies the requested file from /opt/wal/archwalremote to wherever it is asked to (pg_xlog/RECOVERYXLOG). Here is the end of the log from my restore script. First, 00000001000000A000000001 is requested and copied. Then 00000001000000A000000002 is requested, but it has not been synced yet so the script is sleeping and polling for that file to exist. Before the file exists, the trigger is detected and the restore script starts returning 1 to all requests. Tue Apr 27 15:06:00 CDT 2010 > cp /opt/wal/archwalremote/00000001000000A000000001 pg_xlog/RECOVERYXLOG Tue Apr 27 15:06:00 CDT 2010 > cp returned 0, script returning 0 Tue Apr 27 15:06:00 CDT 2010 Starting up... From /opt/wal/archwalremote/00000001000000A000000002 TO pg_xlog/RECOVERYXLOG Tue Apr 27 15:06:00 CDT 2010 /opt/wal/archwalremote/00000001000000A000000002 not found Tue Apr 27 15:06:00 CDT 2010 Sleeping Tue Apr 27 15:09:11 CDT 2010 /opt/wal/archwalremote/00000001000000A000000002 not found Tue Apr 27 15:09:11 CDT 2010 Sleeping Tue Apr 27 15:09:21 CDT 2010 Triggered Tue Apr 27 15:09:21 CDT 2010 Starting up... From /opt/wal/archwalremote/00000001000000A000000001 TO pg_xlog/RECOVERYXLOG Tue Apr 27 15:09:21 CDT 2010 Triggered 00000001000000A000000001 is found and restored (see original log file). I guess the question is: why, after the server should be out of recovery mode, is it looking for 00000001000000A000000001 in the pg_xlog directory?
On Tue, 2010-04-27 at 15:39 -0500, Chris Copeland wrote: > Any help is greatly appreciated. Please let me know if I can provide > any more information that will be helpful. Well the problem is that the files are missing. How are you copying them? Joshua D. Drake > > -Chris > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering