Re: UTC4115FATAL: the database system is in recovery mode - Mailing list pgsql-general
From | Mathew Samuel |
---|---|
Subject | Re: UTC4115FATAL: the database system is in recovery mode |
Date | |
Msg-id | FBB3A126CE548B48A5CD4C867A33258603695E61DA@sottexch7.corp.ad.entrust.com Whole thread Raw |
In response to | Re: UTC4115FATAL: the database system is in recovery mode (Craig Ringer <craig@postnewspapers.com.au>) |
Responses |
Re: UTC4115FATAL: the database system is in recovery mode
|
List | pgsql-general |
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes: > On 05/30/2011 10:29 PM, Mathew Samuel wrote: > >> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too >> frequently (10 seconds apart) >> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement >> timeout >> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze >> _zamboni.sl_log_1 >> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, >> it was already committed >> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was >> terminated by signal 6 > > Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted (assumingit was in fact > 1827110275) and then the backend crashed with a signal 6 (SIGABRT). > SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation errors,etc. > > Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do finda core file for that process ID, it might be worth checking for a debuginfo rpm just in case. > >> In fact those last 3 lines are repeated over and over again repeatedly >> until "UTC4115FATAL: the database system is in recovery mode" is >> logged for 4 hours. At some point, 4 hours later of course, it appears >> that the system recovers. > > Wow. Four hours recovery with default checkpoint settings. > > Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM timedout in the first place, and would explain why it took such a long time to recover. Check your system logs around thesame time for other indications of excessive load, and check your monitoring history if you have monitoring like Cactior the like active. > > See if there's anything interesting in the kernel logs too. > > Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ? > > $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 > > -- > Craig Ringer Thanks for your help, here are the results from running that you provided to me: $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 listen_addresses='localhost' port=5432 max_connections=200 ssl=off shared_buffers = 24MB max_fsm_pages = 153600 log_line_prefix='%t%p' statement_timeout=60000 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' And as you anticipated, no core file was dumped on that system. It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1 minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there isan option I can change to help prevent a reoccurrence. Cheers, Matt
pgsql-general by date: