Thread: db corruption/recovery help
Someone flipped a breaker switch, and evidently triggered corruption in one of our major clusters: $ cat server_log.Mon postmaster successfully started 2005-06-06 14:31:11.950 [20124] LOG: database system was interrupted being in recovery at 2005-06-06 14:29:01 EDT This probably means that some data blocks are corrupted and you will have to use the last backup for recovery. 2005-06-06 14:31:11.950 [20124] LOG: checkpoint record is at EF/EBB7AFC8 2005-06-06 14:31:11.950 [20124] LOG: redo record is at EF/EBA91EF0; undo record is at 0/0; shutdown FALSE 2005-06-06 14:31:11.950 [20124] LOG: next transaction id: 577477594; next oid: 89750885 2005-06-06 14:31:11.951 [20124] LOG: database system was not properly shut down; automatic recovery in progress 2005-06-06 14:31:11.952 [20124] LOG: redo starts at EF/EBA91EF0 2005-06-06 14:31:11.984 [20124] PANIC: Invalid page header in block 22376 of 79189398 2005-06-06 14:31:12.275 [20121] LOG: startup process (pid 20124) was terminated by signal 6 2005-06-06 14:31:12.275 [20121] LOG: aborting startup due to startup process failure We have backups from 10 hours earlier, but the obvious question: what, if anything, can I do now to salvage those 10 hours of data from this? I guess I could zero the block? I'm a little uncertain since I don't know what I'm zeroing (pg_database? pg_class?), and can't start up to see what that relfilenode maps to... Going to look at it with pg_filedump, maybe oid2filename or whatever that utility is... Thanks, Ed
On Monday June 6 2005 2:16 pm, Ed L. wrote: > Someone flipped a breaker switch, and evidently triggered > corruption in one of our major clusters: BTW, this is a 7.3.4 cluster ... Ed
On Mon, 2005-06-06 at 15:16, Ed L. wrote: > Someone flipped a breaker switch, and evidently triggered > corruption in one of our major clusters: > > $ cat server_log.Mon > postmaster successfully started > 2005-06-06 14:31:11.950 [20124] LOG: database system was interrupted being in recovery at 2005-06-06 14:29:01 EDT > This probably means that some data blocks are corrupted > and you will have to use the last backup for recovery. > 2005-06-06 14:31:11.950 [20124] LOG: checkpoint record is at EF/EBB7AFC8 > 2005-06-06 14:31:11.950 [20124] LOG: redo record is at EF/EBA91EF0; undo record is at 0/0; shutdown FALSE > 2005-06-06 14:31:11.950 [20124] LOG: next transaction id: 577477594; next oid: 89750885 > 2005-06-06 14:31:11.951 [20124] LOG: database system was not properly shut down; automatic recovery in progress > 2005-06-06 14:31:11.952 [20124] LOG: redo starts at EF/EBA91EF0 > 2005-06-06 14:31:11.984 [20124] PANIC: Invalid page header in block 22376 of 79189398 > 2005-06-06 14:31:12.275 [20121] LOG: startup process (pid 20124) was terminated by signal 6 > 2005-06-06 14:31:12.275 [20121] LOG: aborting startup due to startup process failure > > We have backups from 10 hours earlier, but the obvious > question: what, if anything, can I do now to salvage those > 10 hours of data from this? > > I guess I could zero the block? I'm a little uncertain since > I don't know what I'm zeroing (pg_database? pg_class?), and > can't start up to see what that relfilenode maps to... > > Going to look at it with pg_filedump, maybe oid2filename or > whatever that utility is... OK, if postgresql is running on hardware that doe NOT lie about fsyncing, and it is set to fsync, this should NEVER happen. The fact that it happened means either A: the hardware is lying / broken (like default IDE drives or a RAID controller with a non-battery backed cache) or B: the database was set to not fsync. Since the default IS to fsync, this would require effort on your part to turn it off. Now, once the database comes up like this, your data may or may not be salvagable, and may or may not be coherent. I.e. you may have orphaned records where you shouldn't, or duplicate entries in some unique index or something like that. Look for pg_resetxlog. There's a man page for it on my machine, but I'm not sure about 7.3.x Before doing anything, make a file system level backup of all your data so you have a pristine set to play about with.
On Monday June 6 2005 3:17 pm, Scott Marlowe wrote: > On Mon, 2005-06-06 at 15:16, Ed L. wrote: > > Someone flipped a breaker switch, and evidently triggered > > corruption in one of our major clusters: > > OK, if postgresql is running on hardware that doe NOT lie > about fsyncing, and it is set to fsync, this should NEVER > happen. This is 7.3.4 running on an HP-UX 11.00 9000/800 PA-RISC box with fsync = TRUE, built with gcc 3.2.2. Database is entirely on a SAN. We got very lucky: the corrupted database was expendable (essentially a log database). I was able to just move the data/base/NNNN directory off to the side, restart, drop the corrupted db, and recreate schema... Thanks, Ed
On Monday June 6 2005 3:29 pm, Ed L. wrote: > On Monday June 6 2005 3:17 pm, Scott Marlowe wrote: > > On Mon, 2005-06-06 at 15:16, Ed L. wrote: > > > Someone flipped a breaker switch, and evidently triggered > > > corruption in one of our major clusters: > > > > OK, if postgresql is running on hardware that doe NOT lie > > about fsyncing, and it is set to fsync, this should NEVER > > happen. > > This is 7.3.4 running on an HP-UX 11.00 9000/800 PA-RISC box > with fsync = TRUE, built with gcc 3.2.2. Database is entirely > on a SAN. > > We got very lucky: the corrupted database was expendable > (essentially a log database). I was able to just move the > data/base/NNNN directory off to the side, restart, drop the > corrupted db, and recreate schema... The SAN never lost power, only the system itself. I'd really like to chase this to the root if possible. Ideas? Ed
On Mon, 2005-06-06 at 16:39, Ed L. wrote: > On Monday June 6 2005 3:29 pm, Ed L. wrote: > > On Monday June 6 2005 3:17 pm, Scott Marlowe wrote: > > > On Mon, 2005-06-06 at 15:16, Ed L. wrote: > > > > Someone flipped a breaker switch, and evidently triggered > > > > corruption in one of our major clusters: > > > > > > OK, if postgresql is running on hardware that doe NOT lie > > > about fsyncing, and it is set to fsync, this should NEVER > > > happen. > > > > This is 7.3.4 running on an HP-UX 11.00 9000/800 PA-RISC box > > with fsync = TRUE, built with gcc 3.2.2. Database is entirely > > on a SAN. > > > > We got very lucky: the corrupted database was expendable > > (essentially a log database). I was able to just move the > > data/base/NNNN directory off to the side, restart, drop the > > corrupted db, and recreate schema... > > The SAN never lost power, only the system itself. I'd really > like to chase this to the root if possible. Ideas? It sounds like somewhere between postgresql and the SAN connector going out the back, something is lying about fsync. I'm not that familiar with lots of different SAN setups, so you might want to describe how things are set up and see if anyone else knows more about them than me.
"Ed L." <pgsql@bluepolka.net> writes: > Someone flipped a breaker switch, and evidently triggered > corruption in one of our major clusters: > 2005-06-06 14:31:11.984 [20124] PANIC: Invalid page header in block 22376 of 79189398 It's possible that you are one minor release short of having dodged this problem, as I see in the 7.3.5 CVS log 2003-12-01 11:53 tgl * src/backend/storage/buffer/bufmgr.c (REL7_3_STABLE): Force zero_damaged_pages to be effectively ON during recovery from WAL, since there is no need to worry about damaged pages when we are going to overwrite them anyway from the WAL. Per recent discussion. This doesn't really address the question of how the page header got clobbered in the first place, though. Did you by any chance make a dump to see what data was in there? regards, tom lane
On Monday June 6 2005 11:15 pm, Tom Lane wrote: > It's possible that you are one minor release short of having > dodged this problem, as I see in the 7.3.5 CVS log > > * src/backend/storage/buffer/bufmgr.c (REL7_3_STABLE): Force > zero_damaged_pages to be effectively ON during recovery from > WAL, since there is no need to worry about damaged pages when > we are going to overwrite them anyway from the WAL. Per > recent discussion. I remember all too well. > This doesn't really address the question of how the page > header got clobbered in the first place, though. Did you by > any chance make a dump to see what data was in there? Well, I do have a copy of the corrupted database/file, if that's what you mean. I rebuilt 7.3.4 source (which I don't normally delete) so I could build pg_filedump 2.0, but then pg_filedump wouldn't build. Not sure why; I didn't have the luxury of digging deeper. Maybe I can find a pg_filedump laying around somewhere here, or sort out the build issue... Ed
On Monday June 6 2005 11:15 pm, Tom Lane wrote: > This doesn't really address the question of how the page > header got clobbered in the first place, though. Did you by > any chance make a dump to see what data was in there? I couldn't start the postmaster at all with that data in the cluster, so no dump. But I do have the corrupted file. Haven't gotten free yet to setup a pg_filedump on it. The hardware setup is an HP RP5470 4-way running 11.00, with dual HP Tachyon XL2 Fibre Channel Mass Storage Adapters going into dual Cisco 9509 SAN switches to dual EVA 5000 disk arrays, CPQswsp A.3.0B.01F.00F Patch upgrade for Sanworks Secure Path Device Driver and utilities. Not sure that reveals much of the problem... Ed