Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram - Mailing list pgsql-bugs
From | Robert Haas |
---|---|
Subject | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Date | |
Msg-id | CA+TgmoZheGK5AvR8Nw0WPwwxzvfpzFENdCbP_2ennJSBnraEnA@mail.gmail.com Whole thread Raw |
In response to | BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram ("Luke Koops" <luke.koops@entrust.com>) |
Responses |
Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
|
List | pgsql-bugs |
On Fri, Jul 31, 2009 at 10:59 AM, Luke Koops <luke.koops@entrust.com> wrote: > -- postgres.exe!mainCRTStartup -- > ntoskrnl.exe!KiSwapContext+0x26 > ntoskrnl.exe!KiSwapThread+0x2e5 > ntoskrnl.exe!KeWaitForSingleObject+0x346 > ntoskrnl.exe!KiSuspendThread+0x18 > ntoskrnl.exe!KiDeliverApc+0x117 > ntoskrnl.exe!KiSwapThread+0x300 > ntoskrnl.exe!KeWaitForMultipleObjects+0x3d7 > ntoskrnl.exe!ObpWaitForMultipleObjects+0x202 > ntoskrnl.exe!NtWaitForMultipleObjects+0xe9 > ntoskrnl.exe!KiFastCallEntry+0xfc > ntdll.dll!KiFastSystemCallRet > ntdll.dll!NtWaitForMultipleObjects+0xc > kernel32.dll!WaitForMultipleObjectsEx+0x11a > postgres.exe!pgwin32_waitforsinglesocket+0x1ed > postgres.exe!pgwin32_recv+0x90 > postgres.exe!PgstatCollectorMain+0x17f > postgres.exe!SubPostmasterMain+0x33a > postgres.exe!main+0x168 > postgres.exe!__tmainCRTStartup+0x10f > kernel32.dll!BaseProcessStart+0x23 We just had a customer hit a very similar problem on 9.1.3, running on Windows Server 2008 SP2. They were able to extract the following stack trace: ntoskrnl.exe!KiSwapContext+0x7a ntoskrnl.exe!KiCommitThreadWait+0x1d2 ntoskrnl.exe!KeWaitForMultipleObjects+0x271 ntoskrnl.exe!ObpWaitForMultipleObjects+0x294 ntoskrnl.exe!NtWaitForMultipleObjects+0xe5 ntoskrnl.exe!KiSystemServiceCopyEnd+0x13 ntdll.dll!ZwWaitForMultipleObjects+0xa KERNELBASE.dll!WaitForMultipleObjectsEx+0xe8 kernel32.dll!WaitForMultipleObjectsExImplementation+0xb3 postgres.exe!pgwin32_waitforsinglesocket+0x26d postgres.exe!pgwin32_recv+0xf0 postgres.exe!PgstatCollectorMain+0x1cc postgres.exe!SubPostmasterMain+0x4c2 postgres.exe!main+0x1d0 postgres.exe!__tmainCRTStartup+0x11a kernel32.dll!BaseThreadInitThunk+0xd ntdll.dll!RtlUserThreadStart+0x1d The customer finds that they can reproduce this on a variety of systems under heavy load. However, removing the load doesn't fix the problem; the system continues to spew pgstat wait timeout messages into the logs. Autovacuum fails to DTRT due to lack of current stats and things go downhill rapidly from there. Terminating the stats collector process resolves the issue; the postmaster starts a new one within 60 seconds and after that the pgstat wait timeout messages cease and vacuuming consequently resumes. Now, it looks to me like for this stack trace to happen, PgstatCollectorMain() has got to call pgwin32_waitforsinglesocket (at line 3002), and that function has to return true, so that got_data gets set to true. Then PgstatCollectorMain() will call recv(), which on Windows will really be pgwin32_recv, which will call pgwin32_waitforsinglesocket, which must now hang. The fact that the first pgwin32_waitforsinglesocket call returned true should mean that the stats collector socket is ready for read, while the fact that the second one did not return seems to imply that it's not ready for read, close, or accept. So it almost looks like Windows can change its mind about whether the socket is readable. Or maybe we're telling it to change its mind. This sounds an awful lot like something that could have been caused by the oversights fixed in commit b85427f2276d02756b558c0024949305ea65aca5. Was there a reason we didn't back-patch that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-bugs by date: