orangutan seizes up during isolation-check - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | orangutan seizes up during isolation-check |
Date | |
Msg-id | 20140902013458.GB906981@tornado.leadboat.com Whole thread Raw |
Responses |
Re: orangutan seizes up during isolation-check
|
List | pgsql-hackers |
Buildfarm member orangutan has failed chronically on both of the branches for which it still reports, HEAD and REL9_1_STABLE, for over two years. The postmaster appears to jam during isolation-check. Dave, orangutan currently has one such jammed postmaster for each branch. Could you gather some information about the running processes? Specifically, it would be helpful to see the output of "ps -el" and a stack trace for each running PostgreSQL process. (If there are enough PostgreSQL processes to make stack traces tedious to acquire, it will be almost as good to have traces for each postmaster and one autovacuum worker per postmaster.) Thanks. Best not to kill the processes yet, in case we need more information. The rest of this message is just a dump my observations from the data already available. The jammed postmasters fail to complete fast shutdown requests. Beyond that, the symptoms are different on HEAD versus 9.1. The 2014-07-09 run is representative for HEAD. multiple-row-versions.spec failed like this after having run for almost 21 hours: --- 1,2 ---- Parsed test spec with 4 sessions ! Connection 2 to database failed: \ No newline at end of file I don't know what would cause PQconnectdb() to hang for 21 hours before failing with a blank error message. Note that the hang duration and the spec in which the hang falls varies from failure to failure. All subsequent specs then fail like this: --- 1,4 ---- Parsed test spec with 2 sessions ! Connection 0 to database failed: could not connect to server: Connection refused ! Is the server running locally and accepting ! connections on Unix domain socket "/tmp/.s.PGSQL.5678"? One can get ECONNREFUSED from a Unix-domain socket when the listen() backlog is full. At this point, we've made only two connection attempts since the last successful one and only about 40 attempts since last postmaster startup. I have no good theories remaining at the moment. The postmaster log ends in 1211 copies of this message: WARNING: worker took too long to start; canceled. At the default autovacuum_naptime=1min, that represents 20:11:00 of autovacuum launch failures. The postmaster had been running about 20:55:42 by the time we collected that log, suggesting that autovacuum was healthy until 40-45 minutes into the doomed PQconnectdb() call. I'm hypothesizing that the postmaster ceased serving autovacuum launcher requests. A jammed postmaster tends to explain both the ECONNREFUSED symptom and the autovacuum symptom. In REL9_1_STABLE, isolation-check completes, but the StopDb-C:2 step that follows isolation-check fails to stop the server. (If you go back far enough in the history, suites other than isolation-check occasionally jam the server.) The server log ends like this: LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down That suggests a postmaster stuck in PM_WAIT_BACKENDS. The process data should illuminate this situation.
pgsql-hackers by date: