Home > mailing lists

9.3.9 and pg_multixact corruption - Mailing list pgsql-hackers

From	Bernd Helmle
Subject	9.3.9 and pg_multixact corruption
Date	September 10, 2015 21:26:56
Msg-id	7E3C7F8D210AC9A423E96F3A@eje.local Whole thread Raw
Responses	Re: 9.3.9 and pg_multixact corruption Re: 9.3.9 and pg_multixact corruption Re: 9.3.9 and pg_multixact corruption Re: 9.3.9 and pg_multixact corruption
List	pgsql-hackers

Tree view

A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
instance.

The database crashed with the following log messages:

2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
transaction 1068235595
2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
"pg_multixact/members/FFFF5FC4": No such file or directory.
2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
where oid = $1 and tanum = $2 

When they called us later, it turned out that the crash happened during a
base backup, leaving a backup_label behind which prevented the database
coming up again with a invalid checkpoint location. However, removing the
backup_label still didn't let the database through recovery, it failed
again with the former error, this time during recovery:

2015-09-08 11:40:04 CEST [27047] LOG:  database system was interrupted
while in recovery at 2015-09-08 11:19:44 CEST
2015-09-08 11:40:04 CEST [27047] HINT:  This probably means that some data
is corrupted and you will have to use the last backup for recovery.
2015-09-08 11:40:04 CEST [27047] LOG:  database system was not properly
shut down; automatic recovery in progress
2015-09-08 11:40:05 CEST [27047] LOG:  redo starts at 1A52/2313FEF8
2015-09-08 11:40:47 CEST [27082] FATAL:  the database system is starting up
2015-09-08 11:40:59 CEST [27047] FATAL:  could not access status of
transaction 1068235595
2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
"pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 
2015-09-08 11:40:59 CEST [27045] LOG:  startup process (PID 27047) exited
with exit code 1
2015-09-08 11:40:59 CEST [27045] LOG:  aborting startup due to startup
process failure

Some side notes:

An additional recovery from a base backup and archive recovery yield to the
same error, as soon as the affected tuple was touched with a DELETE. The
affected table was fully dumpable via pg_dump, though.

We also have a core dump, but no direct access to the machine. If there's
more information  required (and i believe it is), let me know where to dig
deeper. I also would like to request a backtrace from the existing core
dump, but in the absence of a sparc64 machine here we need to ask the
customer to get one.

-- 
Thanks
Bernd

pgsql-hackers by date:

From: Robert Haas
Date: 10 September 2015, 21:24:11
Subject: Re: Foreign join pushdown vs EvalPlanQual

From: Alvaro Herrera
Date: 10 September 2015, 21:28:42
Subject: Re: statistics for array types

9.3.9 and pg_multixact corruption - Mailing list pgsql-hackers

Previous

Next