PITR COPY Failure (was Point in Time Recovery) - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | PITR COPY Failure (was Point in Time Recovery) |
Date | |
Msg-id | 40FA0F93.8010607@coretech.co.nz Whole thread Raw |
In response to | Re: Point in Time Recovery (markw@osdl.org) |
Responses |
Re: PITR COPY Failure (was Point in Time Recovery)
Re: PITR COPY Failure (was Point in Time Recovery) Re: PITR COPY Failure (was Point in Time Recovery) |
List | pgsql-hackers |
I decided to produce a nice simple example, so that anyone could hopefully replicate what I am seeing. The scenario is the same as before (the 11 steps), but the CREATE TABLE and COPY step has been reduced to: CREATE TABLE test0 (filler VARCHAR(120)); COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ','; Now the file 'test0.dat' consists of (128293) identical lines, each of 109 'a' charactors (plus end of line) A script to run the whole business can be found here : http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz (It will need a bit of editing for things like location of Pg, PGDATA, and you will need to make your own data file) The main points of interest are: - anything <=128392 rows in test0.dat results in 1 archived log, and the recovery succeeds - anything >=128393 rows in test0.dat results in 2 or more archived logs, and recovery fails on the second log (and gives the zero length redo at 0/1FFFFE0 message). Let me know if I can do any more legwork on this (I am considering re-compiling with WAL_DEBUG now that example is simpler) regards Mark Simon Riggs wrote: >On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote: > > >>I tried what I thought was a straightforward scenario, and seem to have >>broken it :-( >> >>Here is the little tale >> >>1) initdb >>2) set archive_mode and archive_dest in postgresql.conf >>3) startup >>4) create database called 'test' >>5) connect to 'test' and type 'checkpoint' >>6) backup PGDATA using 'tar -zcvf' >>7) create tables in 'test' and add data using COPY (exactly 2 logs worth) >>8) shutdown and remove PGDATA >>9) recover using 'tar -zxvf' >>10) copy recovery.conf into PGDATA >>11) startup >> >>This is what I get : >> >>LOG: database system was interrupted at 2004-07-15 21:24:04 NZST >>LOG: recovery command file found... >>LOG: restore_program = cp %s/%s %s >>LOG: recovery_target_inclusive = true >>LOG: recovery_debug_log = true >>LOG: starting archive recovery >>LOG: restored log file "0000000000000000" from archive >>LOG: checkpoint record is at 0/A48054 >>LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE >>LOG: next transaction ID: 496; next OID: 25419 >>LOG: database system was not properly shut down; automatic recovery in >>progress >>LOG: redo starts at 0/A48094 >>LOG: restored log file "0000000000000001" from archive >>LOG: record with zero length at 0/1FFFFE0 >>LOG: redo done at 0/1FFFF30 >>LOG: restored log file "0000000000000001" from archive >>LOG: restored log file "0000000000000001" from archive >>PANIC: concurrent transaction log activity while database system is >>shutting down >>LOG: startup process (PID 13492) was terminated by signal 6 >>LOG: aborting startup due to startup process failure >> >>The concurrent access is a bit of a puzzle, as this is my home machine >>(i.e. I am *sure* noone else is connected!) >> >> >> > >I can see what is wrong now, but you'll have to help me on details your >end... > >The log shows that xlog 1 was restored from archive. It contains a zero >length record, which indicates that it isn't yet full (or thats what the >existing recovery code assumes it means). Which also indicates that it >should never have been archived in the first place, and should not >therefore be a candidate for a restore from archive. > >The double message "restored log file" can only occur after you've >retrieved a partially full file from archive - which as I say, shouldn't >be there. > >Other messages are essentially spurious in those circumstances. > >Either: >- somehow the files have been mixed up in the archive directory, which >is possible if the filing discipline is not strict - various ways, >unfortunately I would guess this to be the most likely, somehow >- the file that has been restored has been damaged in some way >- the archiver has archived a file too early (very unlikely, IMHO - >thats the most robust bit of the code) >- some aspect of the code has written a zero length record to WAL (which >is supposed to not be possible, but we musn't discount an error in >recent committed work) > >- there may also be an effect going on with checkpoints that I don't >understand...spurious checkpoint warning messages have already been >observed and reported, > >Best regards, Simon Riggs > > > > > >
pgsql-hackers by date: