We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.
We recently experienced two similar outages on two different prod databases. The error messages from the logs were as follows:
LOG: server process (PID 138529) was terminated by signal 6: Aborted LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
We are still investigating what may have triggered these errors, since there were no recent changes to these databases. Unfortunately, core dumps were not configured correctly, so we may have to wait for the next outage before we can do a good root cause analysis.
My question, meanwhile, is around remedial actions to take when this happens.
In one case, the logs recorded
LOG: all server processes terminated; reinitializing LOG: incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 365/FDFA738 LOG: invalid record length at 365/12420978: wanted 24, got 0 LOG: redo done at 365/12420950 LOG: last completed transaction was at log time 2018-06-05 10:59:27.049443-05 LOG: checkpoint starting: end-of-recovery immediate LOG: checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s, total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s; distance=39064 kB, estimate=39064 kB LOG: MultiXact member wraparound protections are now enabled LOG: autovacuum launcher started LOG: database system is ready to accept connections
In that case, the database restarted immediately, with only 30 seconds of downtime.
In the other case, the logs recorded
LOG: all server processes terminated; reinitializing LOG: dynamic shared memory control segment is corrupt LOG: incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7
In that case, the database did not restart on its own. It was 5 am on Sunday, so the on-call SRE just manually started the database up, and it appears to have been running fine since.
My question is whether the corrupt shared memory control segment, and the failure of Postgres to automatically restart, mean the database should not be automatically started up, and if there's something we should be doing before restarting.
Do we potentially have corrupt data or indices as a result of our last outage? If so, what should we do to investigate?