9.3.9 and pg_multixact corruption - Mailing list pgsql-hackers
From | Bernd Helmle |
---|---|
Subject | 9.3.9 and pg_multixact corruption |
Date | |
Msg-id | 7E3C7F8D210AC9A423E96F3A@eje.local Whole thread Raw |
Responses |
Re: 9.3.9 and pg_multixact corruption
Re: 9.3.9 and pg_multixact corruption Re: 9.3.9 and pg_multixact corruption Re: 9.3.9 and pg_multixact corruption |
List | pgsql-hackers |
A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11 instance. The database crashed with the following log messages: 2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of transaction 1068235595 2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file "pg_multixact/members/FFFF5FC4": No such file or directory. 2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer where oid = $1 and tanum = $2 When they called us later, it turned out that the crash happened during a base backup, leaving a backup_label behind which prevented the database coming up again with a invalid checkpoint location. However, removing the backup_label still didn't let the database through recovery, it failed again with the former error, this time during recovery: 2015-09-08 11:40:04 CEST [27047] LOG: database system was interrupted while in recovery at 2015-09-08 11:19:44 CEST 2015-09-08 11:40:04 CEST [27047] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. 2015-09-08 11:40:04 CEST [27047] LOG: database system was not properly shut down; automatic recovery in progress 2015-09-08 11:40:05 CEST [27047] LOG: redo starts at 1A52/2313FEF8 2015-09-08 11:40:47 CEST [27082] FATAL: the database system is starting up 2015-09-08 11:40:59 CEST [27047] FATAL: could not access status of transaction 1068235595 2015-09-08 11:40:59 CEST [27047] DETAIL: Could not seek in file "pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument. 2015-09-08 11:40:59 CEST [27047] CONTEXT: xlog redo create mxid 1068235595 offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 2015-09-08 11:40:59 CEST [27045] LOG: startup process (PID 27047) exited with exit code 1 2015-09-08 11:40:59 CEST [27045] LOG: aborting startup due to startup process failure Some side notes: An additional recovery from a base backup and archive recovery yield to the same error, as soon as the affected tuple was touched with a DELETE. The affected table was fully dumpable via pg_dump, though. We also have a core dump, but no direct access to the machine. If there's more information required (and i believe it is), let me know where to dig deeper. I also would like to request a backtrace from the existing core dump, but in the absence of a sparc64 machine here we need to ask the customer to get one. -- Thanks Bernd
pgsql-hackers by date: