Defunct postmasters - Mailing list pgsql-general
From | Gavin Scott |
---|---|
Subject | Defunct postmasters |
Date | |
Msg-id | 1014670303.13536.95.camel@gavin.pokerpages.com Whole thread Raw |
Responses |
Re: Defunct postmasters
|
List | pgsql-general |
Hi, We have lately begun having problems with our production database running postgres 7.1 on linux kernel v 2.4.17. The system had run without incident for many months (there were occasional reboots). Since we upgraded to kernel 2.4.17 on Dec. 31 it ran non-stop without problem until Feb 13, when postmaster appeared to stop taking new incoming connections. We restarted and then the problem struck again Saturday night (Feb 23). In both instances attempting to access the db via the psql commandline would just hang -- no error messages were printed. Also we have two perl scripts running that connect to the database once every few minutes; one runs on a remote server the other locally. Both create log files and appeared to be stuck trying to make a connection. In the 2nd incident /var/log/postgresql.log contained: Sat Feb 23 23:41:00 CST 2002 PacketReceiveFragment: read() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer Sat Feb 23 23:51:00 CST 2002 pq_recvbuf: recv() failed: Connection reset by peer pq_recvbuf: recv() failed: Connection reset by peer 23:40 appears to have been when the problem began. I added a cron job to put the date lines in the above; in the 1st incident I didn't have that so it was difficult to tell what was happening when the problem began; it did contain messages similar to the above but I can't guarantee they were produced at the time of the problem. dmesg both on the postgres machine and our remote server which accesses it via the script mentioned above showed a couple of lines like: sending pkt_too_big to self sending pkt_too_big to self Since there aren't any timestamps in dmesg I can't guarantee that those were produced at the time of incident. Also I did not check dmesg during the 1st incident. In both incidences there were multiple zombies hanging around: postgres 21264 0.0 0.0 0 0 ? Z Feb23 0:00 [postmaster <defunct>] postgres 21266 0.0 0.0 0 0 ? Z Feb23 0:00 [postmaster <defunct>] The system was mostly idle at the time I began investigating both incidents. While searching the mailing list archives I did find 2 threads that seemed to reference similar problems. This one sounded like an exact match: http://groups.google.com/groups?hl=en&frame=right&th=a52001dbca656ddc&seekm=Pine.GSO.4.10.10105111011390.27338-100000%40tigger.seis.sc.edu#s There were similar elements mentioned here: http://archives.postgresql.org/pgsql-hackers/2002-01/msg01142.php I was especially intrigued by this quote from Tom Lane in the 2nd link: "It sounds like the postmaster got into a state where it was not responding to SIGCHLD signals. We fixed one possible cause of that between 7.1 and 7.2, but without a more concrete report I have no way to know if you saw the same problem or a different one. I'd have expected connection attempts to unwedge the postmaster in any case." Does anyone have any idea what might be causing our problem and whether or now upgrading to 7.2 might solve it? Also, does anyone know any reason to NOT upgrade to 7.2? I've only recently joined this list, so I may have overlooked outstanding known problems with 7.2. Thanks, Gavin Scott gavin@pokerpages.com
pgsql-general by date: