BUG #9464: PANIC with 'failed to re-find shared lock object' - Mailing list pgsql-bugs
From | petela@gocougs.wsu.edu |
---|---|
Subject | BUG #9464: PANIC with 'failed to re-find shared lock object' |
Date | |
Msg-id | 20140307005623.1918.22473@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #9464: PANIC with 'failed to re-find shared lock object'
Re: BUG #9464: PANIC with 'failed to re-find shared lock object' |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 9464 Logged by: Peter LaDow Email address: petela@gocougs.wsu.edu PostgreSQL version: 9.3.3 Operating system: Linux (Ubuntu 12.04, PPC) Description: We are upgrading from Postgresql 8.3.7 to 9.3.3, and we are getting a database crash due to a failure in a C extension. In this extension we are trying to spawn an external program to run in the background. We do this with double fork(), with the first child exiting cleanly, and the second child execing the program. The database log outputs the following upon failure (myfoo_broken() is an SQL function that calls a C extension function called myfoo(); see the code below): 2014-03-07 00:23:00.764 GMT myfoodatabase psql_user 6753: NOTICE: Running myfoo_broken() 2014-03-07 00:23:00.764 GMT myfoodatabase psql_user 6753: STATEMENT: SELECT public.set_schema(); select myfoo_broken(); 2014-03-07 00:23:00.784 GMT myfoodatabase psql_user 6753: PANIC: failed to re-find shared lock object 2014-03-07 00:23:00.784 GMT myfoodatabase psql_user 6753: STATEMENT: SELECT public.set_schema(); select myfoo_broken(); 2014-03-07 00:23:05.600 GMT myfoodatabase sel_web 6557: LOG: duration: 2.651 ms statement: SELECT session_id FROM device.web_session; 2014-03-07 00:23:09.383 GMT 1100: LOG: server process (PID 6753) was terminated by signal 6: Aborted 2014-03-07 00:23:09.384 GMT 1100: LOG: terminating any other active server processes 2014-03-07 00:23:09.386 GMT 6534: WARNING: terminating connection because of crash of another server process 2014-03-07 00:23:09.386 GMT 6534: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2014-03-07 00:23:09.386 GMT 6534: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: WARNING: terminating connection because of crash of another server process 2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2014-03-07 00:23:09.387 GMT myfoodatabase sel_web 6557: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2014-03-07 00:23:09.384 GMT myfoodatabase psql_user 6556: WARNING: terminating connection because of crash of another server process 2014-03-07 00:23:09.384 GMT myfoodatabase psql_user 6556: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2014-03-07 00:23:09.388 GMT myfoodatabase psql_user 6556: HINT: In a moment you should be able to reconnect to the database and repeat your command. 2014-03-07 00:23:09.388 GMT myfoodatabase psql_user 6556: DEBUG: shmem_exit(-1): 0 callbacks to make 2014-03-07 00:23:09.390 GMT myfoodatabase psql_user 6556: DEBUG: proc_exit(-1): 0 callbacks to make 2014-03-07 00:23:09.425 GMT 1100: LOG: all server processes terminated; reinitializing 2014-03-07 00:23:10.146 GMT myfoodatabase psql_user 6778: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.156 GMT 6777: LOG: database system was interrupted; last known up at 2014-03-07 00:21:12 GMT 2014-03-07 00:23:10.162 GMT 6777: LOG: database system was not properly shut down; automatic recovery in progress 2014-03-07 00:23:10.218 GMT myfoodatabase psql_user 6781: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.347 GMT myfoodatabase psql_user 6790: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.464 GMT myfoodatabase psql_user 6798: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.645 GMT myfoodatabase psql_user 6806: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.722 GMT myfoodatabase psql_user 6809: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.863 GMT myfoodatabase psql_user 6818: FATAL: the database system is in recovery mode 2014-03-07 00:23:10.985 GMT myfoodatabase psql_user 6826: FATAL: the database system is in recovery mode 2014-03-07 00:23:11.199 GMT myfoodatabase psql_user 6834: FATAL: the database system is in recovery mode 2014-03-07 00:23:11.253 GMT 6777: LOG: record with zero length at 0/3BB9AE0 2014-03-07 00:23:11.253 GMT 6777: LOG: redo is not required 2014-03-07 00:23:11.271 GMT myfoodatabase psql_user 6837: FATAL: the database system is in recovery mode 2014-03-07 00:23:11.377 GMT 6849: LOG: autovacuum launcher started 2014-03-07 00:23:11.381 GMT 1100: LOG: database system is ready to accept connections We fired up gdb and connected to the coredump, and get the following stack trace: # gdb -q -c /database/core.8073.sig6.1394152008s /usr/lib/postgresql/9.3/bin/postgres Reading symbols from /usr/lib/postgresql/9.3/bin/postgres...done. [New LWP 8073] warning: Can't read pathname for load map: Input/output error. Core was generated by `postgres: psql_user myfoodatabase [local] SELECT '. Program terminated with signal 6, Aborted. #0 0x1fc3a1e0 in raise () from /lib/powerpc-linux-gnu/libc.so.6 (gdb) bt #0 0x1fc3a1e0 in raise () from /lib/powerpc-linux-gnu/libc.so.6 #1 0x1fc3f004 in abort () from /lib/powerpc-linux-gnu/libc.so.6 #2 0x2081d578 in errfinish (dummy=0) at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/error/elog.c:554 #3 0x2081fe24 in elog_finish (elevel=22, fmt=0x2098230c "failed to re-find shared lock object") at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/error/elog.c:1318 #4 0x206725b4 in LockRefindAndRelease (lockMethodTable=0x209d0944, proc=0x50a86088, locktag=0x2101ccc8, lockmode=1, decrement_strong_lock_count=0 '\000') at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/lock.c:2904 #5 0x206703ac in LockReleaseAll (lockmethodid=1, allLocks=0 '\000') at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/lock.c:2077 #6 0x20676438 in ProcReleaseLocks (isCommit=1 '\001') at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/storage/lmgr/proc.c:752 #7 0x2085b324 in ResourceOwnerReleaseInternal (owner=0x21026230, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001') at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/resowner/resowner.c:286 #8 0x2085b07c in ResourceOwnerRelease (owner=0x21026230, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001') at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/utils/resowner/resowner.c:206 #9 0x2036d23c in CommitTransaction () at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/access/transam/xact.c:1951 #10 0x2036de24 in CommitTransactionCommand () at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/access/transam/xact.c:2547 #11 0x20691254 in finish_xact_command () at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:2411 #12 0x2068e6a8 in exec_simple_query ( query_string=0x2108d8b0 "SELECT public.set_schema(); select myfoo_broken();") at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:1079 #13 0x20693d90 in PostgresMain (argc=1, argv=0x21024588, dbname=0x210244b8 "myfoodatabase", username=0x20ff32b8 "psql_user") at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/tcop/postgres.c:3997 #14 0x2060f4c4 in BackendRun (port=0x2100b920) at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:3996 #15 0x2060ea20 in BackendStartup (port=0x2100b920) at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:3685 #16 0x2060a000 in ServerLoop () at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:1586 #17 0x2060950c in PostmasterMain (argc=4, argv=0x20ff11b8) at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/postmaster/postmaster.c:1253 #18 0x20543e7c in main (argc=4, argv=0x20ff11b8) at /hg/sel/sel3500_pg93/thirdparty/opensource/linux/postgresql/./src/backend/main/main.c:206 (gdb) The offending elog() is in src/backend/storage/lmgr/lock.c, line 2904: /* * Re-find the lock object (it had better be there). */ lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash, (void *) locktag, hashcode, HASH_FIND, NULL); if (!lock) elog(PANIC, "failed to re-find shared lock object"); So, apparently we are getting a case where (I think) we aren't finding a lock in a lock hash table. Below is the body of an extension (the full extension is available if necessary) that exhibits the problem. In our actual case Datum run_foo(PG_FUNCTION_ARGS) { pid_t pid; int rc; int status; /* If 'true' is passed, take the broken path. */ bool broken = PG_GETARG_BOOL(0); pid = fork(); if ( pid < 0 ) { /* Failed to fork the first child (the child) process */ ereport( ERROR, (errcode(ERRCODE_RAISE_EXCEPTION), errmsg("fork error")) ); } /* In the child */ else if ( pid == 0 ) { struct rlimit rlim; int i; /* Make us the controlling terminal */ setsid(); /* Get the number of file descriptors */ if ( getrlimit(RLIMIT_NOFILE, &rlim) < 0 ) { exit(EXIT_FAILURE); } /* Close all our file descriptors */ for(i=0; i < rlim.rlim_cur; i++) { close(i); } /* Fork again to make sure that we are not a session leader. */ pid = fork(); if ( pid < 0 ) { /* Failed to fork the second child process (the grand child) */ exit(EXIT_FAILURE); } else if ( pid > 0 ) { /* This is the broken case. This causes the panic. */ if ( broken ) { /* The first child (the child) process is exiting. */ exit(EXIT_SUCCESS); } /* This execlp() "fixes" the problem. We do not see the panic * when taking this path. */ else { execlp("/bin/true", "/bin/true", NULL); /* Should never get here */ exit(EXIT_FAILURE); } } /* Note: This is where we normally execute the external process. This example below * is just for demonstration purposes. */ execlp("/bin/false", "/bin/false", NULL); // We shouldn't get here exit(EXIT_FAILURE); } // Wait for the first child (the child) to exit rc = waitpid(pid, &status, 0); // Check the return codes (both the call and the process) if( (rc < 0 ) || (WIFEXITED(status) && WEXITSTATUS(status) != 0) ) { ereport( ERROR, (errcode(ERRCODE_RAISE_EXCEPTION),errmsg("wait error")) ); } PG_RETURN_VOID(); } Take particular note of the "if (broken)" portion. If we call exit() from the grandchild of postgres, we get the panic. If we call exec(), things are fine. Note that calling this function alone does not cause a problem. Only when coupled with some other SQL that invokes this function do we get the panic. So, we have the following (contrived) example SQL to demonstrate the problem: -- Create a dummy table. The column 'schema_name' is the name of the schema we want to add to the search path. DROP TABLE IF EXISTS public.dummy_schema; CREATE TABLE public.dummy_schema(schema_name TEXT NOT NULL); INSERT INTO public.dummy_schema VALUES ('schema_16'); -- Add the schema to the search path. Note that selecting from a table, -- even if the result is unused, exhibits the problem. It seems to be -- a select coupled with a set_config() that causes problems (of course -- paired with our C extension). CREATE OR REPLACE FUNCTION public.set_schema( ) RETURNS VOID AS $$ DECLARE bar TEXT; BEGIN SELECT schema_name INTO bar FROM public.dummy_schema; PERFORM pg_catalog.set_config('search_path', 'schema_16,public', false); END; $$ LANGUAGE plpgsql VOLATILE STRICT SECURITY DEFINER; -- Create the linkage to the C extension. CREATE OR REPLACE FUNCTION schema_16.myfoo(IN broken BOOL) RETURNS VOID AS '/usr/lib/postgresql/libmyfoo.so.1', 'run_foo' LANGUAGE C VOLATILE STRICT; Here are some scenarios, and how they function. psql -c "select schema_16.myfoo(false);" # Works fine psql -c "select schema_16.myfoo(true);" # Works fine psql -c "select set_schema(); select myfoo(false);" # Works fine psql -c "select set_schema(); select myfoo(true);" # PANIC! # Note the examples below "inline" the function set_schema() psql -c "select pg_catalog.set_config('search_path', 'schema_16,public', false); select myfoo(false);" # Works fine psql -c "select pg_catalog.set_config('search_path', 'schema_16,public', false); select myfoo(true);" # Works fine So for some strange reason, the exec() in the non-"broken" case does not exhibit the problem. Are we violating some assumptions or has there been a change in behavior regarding fork/exec since 8.3? Thanks, Pete
pgsql-bugs by date: