Re: What is happening on buildfarm member crake? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: What is happening on buildfarm member crake? |
Date | |
Msg-id | CA+TgmoZBXXVMF-oH=JEv5BsshE7P8PTz25-1qjVWeMh8LbhRHg@mail.gmail.com Whole thread Raw |
In response to | Re: What is happening on buildfarm member crake? (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: What is happening on buildfarm member crake?
|
List | pgsql-hackers |
On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Also crake does produce backtraces on core dumps, and they are at the > bottom of the buildfarm log. The latest failure backtrace is reproduced > below. > > ================== stack trace: > /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ================== > [New LWP 12584] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `postgres: buildfarm > contrib_regression_test_shm_mq'. > Program terminated with signal 11, Segmentation fault. > #0 SetLatch (latch=0x1c) at pg_latch.c:509 > 509 if (latch->is_set) > #0 SetLatch (latch=0x1c) at pg_latch.c:509 > #1 0x000000000064c04e in procsignal_sigusr1_handler > (postgres_signal_arg=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289 > #2 <signal handler called> > #3 _dl_fini () at dl-fini.c:190 > #4 0x000000361ba39931 in __run_exit_handlers (status=0, > listp=0x361bdb1668, run_list_atexit=true) at exit.c:78 > #5 0x000000361ba399b5 in __GI_exit (status=<optimized out>) at > exit.c:100 > #6 0x00000000006485a6 in proc_exit (code=0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143 > #7 0x0000000000663abb in PostgresMain (argc=<optimized out>, > argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq", > username=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225 > #8 0x000000000062220f in BackendRun (port=0x12d6bf0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083 > #9 BackendStartup (port=0x12d6bf0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772 > #10 ServerLoop () at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583 > #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238 > #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205 Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: