Thread: Re: Backend core dump, Please help, Urgent!
[ I'm redirecting this to pg-hackers since it doesn't look like an interfaces problem ... ] Matthew Hagerty <matthew@venux.net> writes: > The app is written in PHP3-3.0.12 compiled as an Apache-1.3.6 module. The > OS is FreeBSD-3.1-Release with GCC-2.7.2.1 and a PostgreSQL-6.5.1 backend. You should probably update to 6.5.3 for starters. I'm not all that hopeful that any of the bugfixes in 6.5.3 will fix this, but it'd be pretty silly not to try it before investing a lot of work running down the problem. > The app went online on August 30, 1999 and has run without incident until > yesterday. At about 10am Dec, 13th, 1999 one of the programmers noticed > that none of the forum messages would come up. I went to the console of > the server and saw this message about 10 or 15 times: > Dec 13 10:35:56 redbox /kernel: pid 13856 (postgres), uid 1002: exited on > signal 11 (core dumped) > A ps -xa revealed about 15 or so postgres processes! I did not think > postgres made any child processes?!?! So I stopped the web server and > killed the main postgres process which seemed to kill all the other > postgres processes. I then tried to restart postgres and got an error > message that was something like: > IpcSemaphore??? - Key=54321234 Max You could probably have recovered from this with "ipcclean" instead of a reboot; it sounds like the postmaster failed to release the shared semaphores before exiting. Which it should have, unless maybe you used kill -9 on it... > At 9:36am on the 14th it happened again. Again I was unable to recover the > data and had to rebuild the data directory. I did not delete the data > directory this time, I just moved it to another directory so I would have > it. I also have the core dumps. The only file I had to delete was the > pg_log in the data directory. What is this file? It had grown to 700Meg > in under 24 hours!! Also, the core dump for the main app grew from 2.7Meg > to over 80Meg while I was trying to dump the data. Sure sounds like a corrupted-data problem. Can you use gdb on the corefiles to get a backtrace of what they were doing? > My biggest hang-up is why all of a sudden? Good question. We'll probably know the answer when we find the problem. regards, tom lane
> Sure sounds like a corrupted-data problem. Can you use gdb on the > corefiles to get a backtrace of what they were doing? > > > My biggest hang-up is why all of a sudden? > > Good question. We'll probably know the answer when we find the problem. Besides the problem Tom has pointed out its possibility, there is a known problem with 6.5.x on FreeBSD. It would be rather important, since it results in a core dump as well. The problem occurs while a backend is waiting for acquiring a lock. Thus it tends to happen on relatively heavy load (I observed the problem starting with 4 concurrent transactions). As far as I know, Linux does not have the problem at all, but FreeBSD does. I'm not sure about other platforms. Solaris seems to be not suffered. You could try following patch. It was made for 6.5.3, but you could apply it to 6.5.1 or 6.5.2 as well. Current has been already fixed with more complex and long-term-aid solution. But I would prefer to minimize the impact to existing releases. Keeping that in mind, I have made the patch the simplest. -- Tatsuo Ishii ---------------------------- cut here ----------------------------- *** postgresql-6.5.3/src/backend/storage/lmgr/lock.c~ Sat May 29 15:14:42 1999 --- postgresql-6.5.3/src/backend/storage/lmgr/lock.c Mon Dec 13 16:45:47 1999 *************** *** 940,946 **** { PROC_QUEUE *waitQueue = &(lock->waitProcs); LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod]; ! char old_status[64], new_status[64]; Assert(lockmethod < NumLockMethods); --- 940,946 ---- { PROC_QUEUE *waitQueue = &(lock->waitProcs); LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod]; ! static char old_status[64], new_status[64]; Assert(lockmethod < NumLockMethods);
Thanks for the patch. I think I'm going to upgrade to FreeBSD-3.3 and PG-6.5.3 tonight. Will I still need the patch with 6.5.3? I'm also going to do a connection test on another offline server to see if it is indeed a load problem. I'll post the results if anyone is interested. Thank you for the help, Matthew At 08:43 PM 12/15/99 +0900, Tatsuo Ishii wrote: >> Sure sounds like a corrupted-data problem. Can you use gdb on the >> corefiles to get a backtrace of what they were doing? >> >> > My biggest hang-up is why all of a sudden? >> >> Good question. We'll probably know the answer when we find the problem. > >Besides the problem Tom has pointed out its possibility, there is a >known problem with 6.5.x on FreeBSD. It would be rather important, >since it results in a core dump as well. The problem occurs while a >backend is waiting for acquiring a lock. Thus it tends to happen on >relatively heavy load (I observed the problem starting with 4 >concurrent transactions). As far as I know, Linux does not have the >problem at all, but FreeBSD does. I'm not sure about other >platforms. Solaris seems to be not suffered. > >You could try following patch. It was made for 6.5.3, but you could >apply it to 6.5.1 or 6.5.2 as well. Current has been already fixed >with more complex and long-term-aid solution. But I would prefer to >minimize the impact to existing releases. Keeping that in mind, I have >made the patch the simplest. >-- >Tatsuo Ishii > >---------------------------- cut here ----------------------------- >*** postgresql-6.5.3/src/backend/storage/lmgr/lock.c~ Sat May 29 15:14:42 1999 >--- postgresql-6.5.3/src/backend/storage/lmgr/lock.c Mon Dec 13 16:45:47 1999 >*************** >*** 940,946 **** > { > PROC_QUEUE *waitQueue = &(lock->waitProcs); > LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod]; >! char old_status[64], > new_status[64]; > > Assert(lockmethod < NumLockMethods); >--- 940,946 ---- > { > PROC_QUEUE *waitQueue = &(lock->waitProcs); > LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod]; >! static char old_status[64], > new_status[64]; > > Assert(lockmethod < NumLockMethods);