Thread: Server Panic when trying to stop point in time recovery

Server Panic when trying to stop point in time recovery

From

Chris Copeland

Date:

27 April 2010, 17:49:06

I am running 8.2.4 on Solaris 10 x86.

I have setup WAL file shipping from a primary server to a warm standby. I am able to start the standby server using a backup from the primary and get it to apply the log files as they arrive. My problem comes when I want to trigger the standby server to come out of recovery mode.

Here is the log file starting from when the server comes up from the backup. Just prior to the error I have "triggered" the server to exit recovery mode by making my restore_command return 1 instead of 0.

2010-04-27 15:00:58 CDT :LOG: database system was interrupted at 2010-04-27 10:10:08 CDT
2010-04-27 15:00:58 CDT :LOG: starting archive recovery
2010-04-27 15:00:58 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:00:58 CDT :LOG: restored log file "000000010000009F000000BA.00000278.backup" from archive
2010-04-27 15:00:59 CDT :LOG: restored log file "000000010000009F000000BA" from archive
2010-04-27 15:00:59 CDT :LOG: checkpoint record is at 9F/BA000278
2010-04-27 15:00:59 CDT :LOG: redo record is at 9F/BA000278; undo record is at 0/0; shutdown FALSE
2010-04-27 15:00:59 CDT :LOG: next transaction ID: 0/325985316; next OID: 823081
2010-04-27 15:00:59 CDT :LOG: next MultiXactId: 2127; next MultiXactOffset: 4278
2010-04-27 15:00:59 CDT :LOG: automatic recovery in progress
2010-04-27 15:00:59 CDT :LOG: redo starts at 9F/BA0002C0
2010-04-27 15:01:00 CDT :LOG: restored log file "000000010000009F000000BB" from archive
2010-04-27 15:01:02 CDT :LOG: restored log file "000000010000009F000000BC" from archive

<snip, many more files restored>

2010-04-27 15:03:19 CDT :LOG: restored log file "000000010000009F000000FE" from archive
2010-04-27 15:03:20 CDT :LOG: restored log file "00000001000000A000000000" from archive
2010-04-27 15:06:00 CDT :LOG: restored log file "00000001000000A000000001" from archive
2010-04-27 15:09:21 CDT :LOG: could not open file "pg_xlog/00000001000000A000000002" (log file 160, segment 2): No such file or directory
2010-04-27 15:09:21 CDT :LOG: redo done at A0/1000068
2010-04-27 15:09:21 CDT :PANIC: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: startup process (PID 22490) was terminated by signal 6
2010-04-27 15:09:26 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:26 CDT :LOG: logger shutting down

--------------------

At this point the server will now enter a restart loop and constantly generate log files like this :

2010-04-27 15:09:26 CDT :LOG: database system was interrupted while in recovery at log time 2010-04-27 15:05:08 CDT
2010-04-27 15:09:26 CDT :HINT: If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.
2010-04-27 15:09:26 CDT :LOG: starting archive recovery
2010-04-27 15:09:26 CDT :LOG: restore_command = "/opt/data/restore.sh /opt/wal/archwalremote/%f %p"
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/00000001000000A000000001" (log file 160, segment 1): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid primary checkpoint record
2010-04-27 15:09:26 CDT :LOG: could not open file "pg_xlog/000000010000009F000000BA" (log file 159, segment 186): No such file or directory
2010-04-27 15:09:26 CDT :LOG: invalid secondary checkpoint record
2010-04-27 15:09:26 CDT :PANIC: could not locate a valid checkpoint record
2010-04-27 15:09:30 CDT :LOG: startup process (PID 24191) was terminated by signal 6
2010-04-27 15:09:30 CDT :LOG: aborting startup due to startup process failure
2010-04-27 15:09:30 CDT :LOG: logger shutting down

--------------------

Any help is greatly appreciated. Please let me know if I can provide any more information that will be helpful.

-Chris

Re: Server Panic when trying to stop point in time recovery

From

"Joshua D. Drake"

Date:

27 April 2010, 18:03:31

On Tue, 2010-04-27 at 15:39 -0500, Chris Copeland wrote:

> Any help is greatly appreciated.  Please let me know if I can provide
> any more information that will be helpful.

Well the problem is that the files are missing. How are you copying
them?

Joshua D. Drake



>
> -Chris
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering

Re: Server Panic when trying to stop point in time recovery

From

Chris Copeland

Date:

27 April 2010, 18:32:45

On Tue, Apr 27, 2010 at 4:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>
> Well the problem is that the files are missing. How are you copying
> them?
>
> Joshua D. Drake
>

I am using rsync to copy the files from the master into a directory
/opt/wal/archwalremote on the standby.

My restore script copies the requested file from
/opt/wal/archwalremote to wherever it is asked to
(pg_xlog/RECOVERYXLOG).

Here is the end of the log from my restore script.  First,
00000001000000A000000001 is requested and copied.  Then
00000001000000A000000002 is requested, but it has not been synced yet
so the script is sleeping and polling for that file to exist.  Before
the file exists, the trigger is detected and the restore script starts
returning 1 to all requests.

Tue Apr 27 15:06:00 CDT 2010 > cp
/opt/wal/archwalremote/00000001000000A000000001 pg_xlog/RECOVERYXLOG
Tue Apr 27 15:06:00 CDT 2010 > cp returned 0, script returning 0
Tue Apr 27 15:06:00 CDT 2010 Starting up... From
/opt/wal/archwalremote/00000001000000A000000002 TO
pg_xlog/RECOVERYXLOG
Tue Apr 27 15:06:00 CDT 2010
/opt/wal/archwalremote/00000001000000A000000002 not found
Tue Apr 27 15:06:00 CDT 2010 Sleeping
Tue Apr 27 15:09:11 CDT 2010
/opt/wal/archwalremote/00000001000000A000000002 not found
Tue Apr 27 15:09:11 CDT 2010 Sleeping
Tue Apr 27 15:09:21 CDT 2010 Triggered
Tue Apr 27 15:09:21 CDT 2010 Starting up... From
/opt/wal/archwalremote/00000001000000A000000001 TO
pg_xlog/RECOVERYXLOG
Tue Apr 27 15:09:21 CDT 2010 Triggered

00000001000000A000000001 is found and restored (see original log
file).  I guess the question is:  why, after the server should be out
of recovery mode, is it looking for 00000001000000A000000001 in the
pg_xlog directory?

Re: Server Panic when trying to stop point in timerecovery

From

"Joshua D. Drake"

Date:

28 April 2010, 01:02:28

On Tue, 2010-04-27 at 15:39 -0500, Chris Copeland wrote:

> Any help is greatly appreciated.  Please let me know if I can provide
> any more information that will be helpful.

Well the problem is that the files are missing. How are you copying
them?

Joshua D. Drake



>
> -Chris
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering