Re: Core dump - Mailing list pgsql-hackers
From | Dan Moschuk |
---|---|
Subject | Re: Core dump |
Date | |
Msg-id | 20001012164752.A3004@spirit.jaded.net Whole thread Raw |
In response to | Re: Core dump (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Core dump
|
List | pgsql-hackers |
| > Sparc solaris 2.7 with postgres 7.0.2 | > It seems to be reproducable, the server crashes on us at a rate of about | > every few hours. | | That's a very bizarre backtrace. Why the multiple levels of recursive | entry to the quickdie() signal handler? I wonder if you aren't looking | at some kind of Solaris bug --- perhaps it's not able to cope with a | signal handler turning around and issuing new kernel calls. I'm not sure that is the issue, see below. | The core file you are looking at is probably *not* from the original | failure, whatever that is. The sequence is probably | | 1. Some backend crashes for unknown reason, dumping core. | | 2. Postmaster observes messy death of a child, decides that mass suicide | followed by restart is called for. Postmaster sends SIGUSR1 to all | remaining backends to make them commit hara-kiri. | | 3. One or more other backends crash trying to obey postmaster's command. | The corefile left for you to examine comes from whichever crashed | last. | | So there are at least two problems here, but we only have evidence of | the second one. | | Since the problem is fairly reproducible, I'd suggest you temporarily | dike out the elog(NOTICE) call in quickdie() (in | src/backend/tcop/postgres.c), which will probably allow the backends | to honor SIGUSR1 without dumping core. Then you have a shot at seeing | the core from the original failure. I will try this, however the database is currently running under light load. Only under high load does postgres start to choke, and eventually die. | Assuming that this works (ie, you find a core that's not got anything | to do with quickdie()), I'd suggest an inquiry to Sun about whether | their signal handler logic hasn't got a problem with write() issued | from inside a signal handler. Meanwhile let us know what the new | backtrace shows. I wrote a quick test program to test this theory. Below is the code and the output. #include <sys/types.h> #include <stdio.h> #include <unistd.h> #include <signal.h> static void moo (int); int main (void) { signal(SIGUSR1, moo); raise(SIGUSR1); } static void moo (cow) int cow; { printf("Getting ready for write()\n"); write(STDOUT_FILENO, "Hello!\n", 7); printf("Done.\n"); } static void moo (cow) int cow; { printf("Getting ready for write()\n"); write(STDOUT_FILENO, "Hello!\n", 7); printf("Done.\n"); } eclipse% ./x Getting ready for write() Hello! Done. eclipse% It would appear from that very rough test program that solaris doesn't mind system calls from within a signal handler. -- Man is a rational animal who always loses his temper when he is called upon to act in accordance with the dictates of reason. -- Oscar Wilde
pgsql-hackers by date: