Thread: PL/TCL Patch to prevent postgres from becoming multithreaded
There is a problem in PL/TCL that can cause the postgres backend to become multithreaded. Postgres is not designed to be multithreaded, so this causes downstream errors in signal handling. We have seen this cause a number of "unexpected state" errors associated with notification handling; however, unpredictable signal handling would be likely to cause other errors as well. Some sample scripts are attached which will reproduce this problem when running against a multithreaded version of TCL, but will work without error with single-threaded TCL library. The scripts are a combination of Unix shell, perl DBI, and SQL commands. The postgres process can be seen to have multiple threads using the Linux command "ps -Lwfu postgres". In this command the NLWP columns will be 2 for multithreaded backend processes. The threaded/non-threaded state of the TCL library can be ascertained on Linux using ldd to determine if libpthread.so is linked to the TCL library (e.g. "ldd /usr/lib/libtcl8.4.so"). The multithreaded behavior occurs the first time PL/TCL is used in a postgres backend, but only when postgres is linked against a multithread-enabled version of libtcl. Thus, this problem can be side-stepped by linking against the proper TCL library. However multithreaded TCL libraries are becoming the norm in Linux distributions and seems ubiquitous in the Windows world. Therefore a fix to the PL/TCL code is warrented. We determined that postgres became multithreaded during the creation of the TCL interpreter in a function called tcl_InitNotifier. This function is part of TCL's Notifier subsystem, which is used to monitor for events asynchronously from the TCL event loop. Although initialized when an interpreter is created, the Notifier subsystem is not used until a process enters the TCL event loop. This never happens within a postgres process, because postgres implements its own event loop. Therefore the initialization of the Notifier subsystem is not necessary within the context of PL/TCL. Our solution was to disable the Notifier subsystem by overriding the functions associated with it using the Tcl_SetNotifier function. This allows 8 functions related to the Notifier to overriden. Even though we found only two of the functions were ever called within postgres, we overrode 8 functions with no-op versions, just for completeness. A patch file containing the changes to pltcl.c from its 8.2.4 version is also attached. We tested this patch with PostgreSQL 8.2.4 on both RedHat Enterprise 4.0 usingTCL 8.4 (single threaded) and RHE 5.0 using TCL 8.4.13 (multithreaded). We expect this solution to work with Windows as well, although we have not tested it. There may be some problems using this solution with old versions of TCL that pre-date the Tcl_SetNotifier function. However this function has been around for quite a while; it was added in in the TCL 8.2 release, circa 2000. We hope this patch will be considered for a future PostgreSQL release. Steve Marshall Paul Bayer Doug Knight WSI Corporation *** pltcl.c.orig 2007-09-10 12:58:34.000000000 -0400 --- pltcl.c 2007-09-11 11:37:33.363222114 -0400 *************** *** 163,168 **** --- 163,258 ---- static void pltcl_build_tuple_argument(HeapTuple tuple, TupleDesc tupdesc, Tcl_DString *retval); + /********************************************************************** + * Declarations for functions overriden using Tcl_SetNotifier. + **********************************************************************/ + static int fakeThreadKey; /* To give valid address for ClientData */ + + static ClientData + pltcl_InitNotifier _ANSI_ARGS_((void)); + + static void + pltcl_FinalizeNotifier _ANSI_ARGS_((ClientData clientData)); + + static void + pltcl_SetTimer _ANSI_ARGS_((Tcl_Time *timePtr)); + + static void + pltcl_AlertNotifier _ANSI_ARGS_((ClientData clientData)); + + static void + pltcl_CreateFileHandler _ANSI_ARGS_((int fd, int mask, Tcl_FileProc *proc, ClientData clientData)); + + static void + pltcl_DeleteFileHandler _ANSI_ARGS_((int fd)); + + static void + pltcl_ServiceModeHook _ANSI_ARGS_((int mode)); + + static int + pltcl_WaitForEvent _ANSI_ARGS_((Tcl_Time *timePtr)); + + /********************************************************************** + * Definitions for functions overriden using Tcl_SetNotifier. + * These implementations effectively disable the TCL Notifier subsystem. + * This is okay because we never enter the TCL event loop from postgres, + * so the notifier capabilities are initialized, but never used. + * + * NOTE: Only InitNotifier and DeleteFileHandler ever seem to get called + * by postgres, but we implement all the functions for completeness. + **********************************************************************/ + + ClientData + pltcl_InitNotifier() + { + return (ClientData) &(fakeThreadKey); + } + + void + pltcl_FinalizeNotifier(clientData) + ClientData clientData; /* Not used. */ + { + } + + void + pltcl_SetTimer(timePtr) + Tcl_Time *timePtr; + { + } + + void + pltcl_AlertNotifier(clientData) + ClientData clientData; + { + } + + void + pltcl_CreateFileHandler(fd, mask, proc, clientData) + int fd; + int mask; + Tcl_FileProc *proc; + ClientData clientData; + { + } + + void + pltcl_DeleteFileHandler(fd) + int fd; + { + } + + void + pltcl_ServiceModeHook(mode) + int mode; + { + } + + int + pltcl_WaitForEvent(timePtr) + Tcl_Time *timePtr; /* Maximum block time, or NULL. */ + { + return 0; + } /* * This routine is a crock, and so is everyplace that calls it. The problem *************** *** 189,194 **** --- 279,287 ---- void _PG_init(void) { + /* Notifier structure used to override functions in Notifier subsystem*/ + Tcl_NotifierProcs notifier; + /* Be sure we do initialization only once (should be redundant now) */ if (pltcl_pm_init_done) return; *************** *** 199,204 **** --- 292,316 ---- #endif /************************************************************ + * Override the functions in the Notifier subsystem. + * + * We do this to prevent the postgres backend from becoming + * multithreaded, which happens in the default version of + * Tcl_InitNotifier if the TCL library has been compiled with + * multithreading support (i.e. when TCL_THREADS is defined + * under Unix, and in all cases under Windows). + ************************************************************/ + notifier.setTimerProc = pltcl_SetTimer; + notifier.waitForEventProc = pltcl_WaitForEvent; + notifier.createFileHandlerProc = pltcl_CreateFileHandler; + notifier.deleteFileHandlerProc = pltcl_DeleteFileHandler; + notifier.initNotifierProc = pltcl_InitNotifier; + notifier.finalizeNotifierProc = pltcl_FinalizeNotifier; + notifier.alertNotifierProc = pltcl_AlertNotifier; + notifier.serviceModeHookProc = pltcl_ServiceModeHook; + Tcl_SetNotifier(¬ifier); + + /************************************************************ * Create the dummy hold interpreter to prevent close of * stdout and stderr on DeleteInterp ************************************************************/
Attachment
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Marshall, Steve wrote: > There is a problem in PL/TCL that can cause the postgres backend to > become multithreaded. Postgres is not designed to be multithreaded, so > this causes downstream errors in signal handling. We have seen this > cause a number of "unexpected state" errors associated with notification > handling; however, unpredictable signal handling would be likely to > cause other errors as well. > > Some sample scripts are attached which will reproduce this problem when > running against a multithreaded version of TCL, but will work without > error with single-threaded TCL library. The scripts are a combination > of Unix shell, perl DBI, and SQL commands. The postgres process can be > seen to have multiple threads using the Linux command "ps -Lwfu > postgres". In this command the NLWP columns will be 2 for multithreaded > backend processes. The threaded/non-threaded state of the TCL library > can be ascertained on Linux using ldd to determine if libpthread.so is > linked to the TCL library (e.g. "ldd /usr/lib/libtcl8.4.so"). > > The multithreaded behavior occurs the first time PL/TCL is used in a > postgres backend, but only when postgres is linked against a > multithread-enabled version of libtcl. Thus, this problem can be > side-stepped by linking against the proper TCL library. However > multithreaded TCL libraries are becoming the norm in Linux distributions > and seems ubiquitous in the Windows world. Therefore a fix to the > PL/TCL code is warrented. > > We determined that postgres became multithreaded during the creation of > the TCL interpreter in a function called tcl_InitNotifier. This > function is part of TCL's Notifier subsystem, which is used to monitor > for events asynchronously from the TCL event loop. Although initialized > when an interpreter is created, the Notifier subsystem is not used until > a process enters the TCL event loop. This never happens within a > postgres process, because postgres implements its own event loop. > Therefore the initialization of the Notifier subsystem is not necessary > within the context of PL/TCL. > > Our solution was to disable the Notifier subsystem by overriding the > functions associated with it using the Tcl_SetNotifier function. This > allows 8 functions related to the Notifier to overriden. Even though we > found only two of the functions were ever called within postgres, we > overrode 8 functions with no-op versions, just for completeness. A > patch file containing the changes to pltcl.c from its 8.2.4 version is > also attached. > > We tested this patch with PostgreSQL 8.2.4 on both RedHat Enterprise 4.0 > usingTCL 8.4 (single threaded) and RHE 5.0 using TCL 8.4.13 > (multithreaded). We expect this solution to work with Windows as well, > although we have not tested it. There may be some problems using this > solution with old versions of TCL that pre-date the Tcl_SetNotifier > function. However this function has been around for quite a while; it > was added in in the TCL 8.2 release, circa 2000. > > We hope this patch will be considered for a future PostgreSQL release. > > Steve Marshall > Paul Bayer > Doug Knight > WSI Corporation > > [ application/x-gzip is not supported, skipping... ] > *** pltcl.c.orig 2007-09-10 12:58:34.000000000 -0400 > --- pltcl.c 2007-09-11 11:37:33.363222114 -0400 > *************** > *** 163,168 **** > --- 163,258 ---- > static void pltcl_build_tuple_argument(HeapTuple tuple, TupleDesc tupdesc, > Tcl_DString *retval); > > + /********************************************************************** > + * Declarations for functions overriden using Tcl_SetNotifier. > + **********************************************************************/ > + static int fakeThreadKey; /* To give valid address for ClientData */ > + > + static ClientData > + pltcl_InitNotifier _ANSI_ARGS_((void)); > + > + static void > + pltcl_FinalizeNotifier _ANSI_ARGS_((ClientData clientData)); > + > + static void > + pltcl_SetTimer _ANSI_ARGS_((Tcl_Time *timePtr)); > + > + static void > + pltcl_AlertNotifier _ANSI_ARGS_((ClientData clientData)); > + > + static void > + pltcl_CreateFileHandler _ANSI_ARGS_((int fd, int mask, Tcl_FileProc *proc, ClientData clientData)); > + > + static void > + pltcl_DeleteFileHandler _ANSI_ARGS_((int fd)); > + > + static void > + pltcl_ServiceModeHook _ANSI_ARGS_((int mode)); > + > + static int > + pltcl_WaitForEvent _ANSI_ARGS_((Tcl_Time *timePtr)); > + > + /********************************************************************** > + * Definitions for functions overriden using Tcl_SetNotifier. > + * These implementations effectively disable the TCL Notifier subsystem. > + * This is okay because we never enter the TCL event loop from postgres, > + * so the notifier capabilities are initialized, but never used. > + * > + * NOTE: Only InitNotifier and DeleteFileHandler ever seem to get called > + * by postgres, but we implement all the functions for completeness. > + **********************************************************************/ > + > + ClientData > + pltcl_InitNotifier() > + { > + return (ClientData) &(fakeThreadKey); > + } > + > + void > + pltcl_FinalizeNotifier(clientData) > + ClientData clientData; /* Not used. */ > + { > + } > + > + void > + pltcl_SetTimer(timePtr) > + Tcl_Time *timePtr; > + { > + } > + > + void > + pltcl_AlertNotifier(clientData) > + ClientData clientData; > + { > + } > + > + void > + pltcl_CreateFileHandler(fd, mask, proc, clientData) > + int fd; > + int mask; > + Tcl_FileProc *proc; > + ClientData clientData; > + { > + } > + > + void > + pltcl_DeleteFileHandler(fd) > + int fd; > + { > + } > + > + void > + pltcl_ServiceModeHook(mode) > + int mode; > + { > + } > + > + int > + pltcl_WaitForEvent(timePtr) > + Tcl_Time *timePtr; /* Maximum block time, or NULL. */ > + { > + return 0; > + } > > /* > * This routine is a crock, and so is everyplace that calls it. The problem > *************** > *** 189,194 **** > --- 279,287 ---- > void > _PG_init(void) > { > + /* Notifier structure used to override functions in Notifier subsystem*/ > + Tcl_NotifierProcs notifier; > + > /* Be sure we do initialization only once (should be redundant now) */ > if (pltcl_pm_init_done) > return; > *************** > *** 199,204 **** > --- 292,316 ---- > #endif > > /************************************************************ > + * Override the functions in the Notifier subsystem. > + * > + * We do this to prevent the postgres backend from becoming > + * multithreaded, which happens in the default version of > + * Tcl_InitNotifier if the TCL library has been compiled with > + * multithreading support (i.e. when TCL_THREADS is defined > + * under Unix, and in all cases under Windows). > + ************************************************************/ > + notifier.setTimerProc = pltcl_SetTimer; > + notifier.waitForEventProc = pltcl_WaitForEvent; > + notifier.createFileHandlerProc = pltcl_CreateFileHandler; > + notifier.deleteFileHandlerProc = pltcl_DeleteFileHandler; > + notifier.initNotifierProc = pltcl_InitNotifier; > + notifier.finalizeNotifierProc = pltcl_FinalizeNotifier; > + notifier.alertNotifierProc = pltcl_AlertNotifier; > + notifier.serviceModeHookProc = pltcl_ServiceModeHook; > + Tcl_SetNotifier(¬ifier); > + > + /************************************************************ > * Create the dummy hold interpreter to prevent close of > * stdout and stderr on DeleteInterp > ************************************************************/ > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Bruce Momjian" <bruce@momjian.us> writes: > This has been saved for the 8.4 release: > > http://momjian.postgresql.org/cgi-bin/pgpatches_hold > > --------------------------------------------------------------------------- > > Marshall, Steve wrote: >> There is a problem in PL/TCL that can cause the postgres backend to >> become multithreaded. Postgres is not designed to be multithreaded, so >> this causes downstream errors in signal handling. Um, this is a bug fix. Unless you had some problem with it? Do we have anyone actively maintaining pltcl these days? I'm intentionally quite unfamiliar with Tcl or I would be happy to verify it's reasonable. But the explanation seems pretty convincing. If we don't have anyone maintaining it then we're pretty much at the mercy of applying whatever patches come in from people who are more familiar with it than us. In my experience that's how new maintainers for modules of free software are often found anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Bruce Momjian" <bruce@momjian.us> writes: >>> There is a problem in PL/TCL that can cause the postgres backend to >>> become multithreaded. Postgres is not designed to be multithreaded, so >>> this causes downstream errors in signal handling. > Um, this is a bug fix. Unless you had some problem with it? I haven't reviewed that patch yet, but I concur we should consider it for 8.3. regards, tom lane
Gregory Stark wrote: > Do we have anyone actively maintaining pltcl these days? I'm intentionally > quite unfamiliar with Tcl or I would be happy to verify it's reasonable. But > the explanation seems pretty convincing. If we don't have anyone maintaining > it then we're pretty much at the mercy of applying whatever patches come in > from people who are more familiar with it than us. In my experience that's how > new maintainers for modules of free software are often found anyways. > > I was hoping that Jan or somebody else with some Tcl-fu would comment. I agree it probably shouldn't be deferred. It does bother me a bit that the Tcl engine startup just starts firing up threads without advertisement - I wondered if we shouldn't kick back and say it's their problem to fix. cheers andrew
OK, moved to patches list. --------------------------------------------------------------------------- Andrew Dunstan wrote: > > > Gregory Stark wrote: > > Do we have anyone actively maintaining pltcl these days? I'm intentionally > > quite unfamiliar with Tcl or I would be happy to verify it's reasonable. But > > the explanation seems pretty convincing. If we don't have anyone maintaining > > it then we're pretty much at the mercy of applying whatever patches come in > > from people who are more familiar with it than us. In my experience that's how > > new maintainers for modules of free software are often found anyways. > > > > > > I was hoping that Jan or somebody else with some Tcl-fu would comment. I > agree it probably shouldn't be deferred. > > It does bother me a bit that the Tcl engine startup just starts firing > up threads without advertisement - I wondered if we shouldn't kick back > and say it's their problem to fix. > > cheers > > andrew -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> "Bruce Momjian" <bruce@momjian.us> writes: >>>> There is a problem in PL/TCL that can cause the postgres backend to >>>> become multithreaded. Postgres is not designed to be multithreaded, so >>>> this causes downstream errors in signal handling. > >> Um, this is a bug fix. Unless you had some problem with it? > > I haven't reviewed that patch yet, but I concur we should consider it > for 8.3. hmm i wonder if that could be related to: http://archives.postgresql.org/pgsql-hackers/2007-01/msg00377.php Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > hmm i wonder if that could be related to: > http://archives.postgresql.org/pgsql-hackers/2007-01/msg00377.php I had forgotten that thread, but it sure does look related doesn't it? Do you want to try Steve's proposed patch and see if it fixes it? regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> hmm i wonder if that could be related to: >> http://archives.postgresql.org/pgsql-hackers/2007-01/msg00377.php > > I had forgotten that thread, but it sure does look related doesn't it? > Do you want to try Steve's proposed patch and see if it fixes it? yeah testing that patch now (seems to apply just fine on -HEAD) but it seems that there is something strange going on because I just got: http://www.kaltenbrunner.cc/files/regression.diffs on the first manual buildfarm run on quagga... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > yeah testing that patch now (seems to apply just fine on -HEAD) but it > seems that there is something strange going on because I just got: ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes Is that repeatable? What sort of filesystem are you testing on? (soft-mounted NFS by any chance?) regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> yeah testing that patch now (seems to apply just fine on -HEAD) but it >> seems that there is something strange going on because I just got: > > ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes > > Is that repeatable? What sort of filesystem are you testing on? that box is slow - still testing ... > (soft-mounted NFS by any chance?) no - local SATA disk with ext3 (no OS related errors and SMART is clean too) Stefan
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> yeah testing that patch now (seems to apply just fine on -HEAD) but it >> seems that there is something strange going on because I just got: > > ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes > > Is that repeatable? What sort of filesystem are you testing on? > (soft-mounted NFS by any chance?) doesn't seem to be repeatable :-( on the postitive side - the pltcl patch does seem to fix the mentioned regression failures quagga triggered earlier this year. I did about a dozends of full buildfarm runs with that patch and about ten manual executions of the pltcl regression tests without any sign of misbehaviour so this seems like a clear candidate for at least -HEAD. Stefan
"Marshall, Steve" <smarshall@wsi.com> writes: > There is a problem in PL/TCL that can cause the postgres backend to > become multithreaded. Postgres is not designed to be multithreaded, so > this causes downstream errors in signal handling. We have seen this > cause a number of "unexpected state" errors associated with notification > handling; however, unpredictable signal handling would be likely to > cause other errors as well. I've applied this patch to CVS HEAD (8.3-to-be). I'm a bit hesitant to back-patch it however, at least not till it gets through some beta testing. Thanks for the detailed explanation, test case, and patch! regards, tom lane
I'm glad to see the patch making its way through the process. I'm also glad you guys do comprehensive testing before accepting it, since we are only able to test in a more limited range of environments. We have applied the patch to our 8.2.4 installations and are running it in a high transaction rate system (processing lots and lots of continually changing weather data). Let me know if there is any information we could provide that would be of help in making the back-patching decision. Yours, Steve Marshall -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Thursday, September 20, 2007 8:33 PM To: Marshall, Steve Cc: pgsql-patches@postgresql.org Subject: Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded "Marshall, Steve" <smarshall@wsi.com> writes: > There is a problem in PL/TCL that can cause the postgres backend to > become multithreaded. Postgres is not designed to be multithreaded, so > this causes downstream errors in signal handling. We have seen this > cause a number of "unexpected state" errors associated with > notification handling; however, unpredictable signal handling would be > likely to cause other errors as well. I've applied this patch to CVS HEAD (8.3-to-be). I'm a bit hesitant to back-patch it however, at least not till it gets through some beta testing. Thanks for the detailed explanation, test case, and patch! regards, tom lane
Marshall, Steve wrote: > I'm glad to see the patch making its way through the process. I'm also > glad you guys do comprehensive testing before accepting it, since we are > only able to test in a more limited range of environments. > > We have applied the patch to our 8.2.4 installations and are running it > in a high transaction rate system (processing lots and lots of > continually changing weather data). Let me know if there is any > information we could provide that would be of help in making the > back-patching decision. I have re-enabled tcl builds(for -HEAD) on lionfish (mipsel) and quagga (arm) a few days ago so we should get a bit of additional coverage from boxes that definitly had problems with the tcl-threading behaviour. Stefan