Thread: Why is citext/regress failing on hamerkop?
For example, 'i'::citext = 'İ'::citext fails to be true. It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype, given that the test isn't skipped. This isn't the first time that we've noticed that Windows doesn't seem to know about İ→i (see [1]), but I don't think anyone has explained exactly why, yet. It could be that it just doesn't know about that in any locale, or that it is locale-dependent and would only do that for Turkish, the same reason we skip the test for ICU, or ... Either way, it seems like we'll need to skip that test on Windows if we want hamerkop to be green. That can probably be cribbed from collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's prelude... I just don't know how to explain it in the comment 'cause I don't know why. One new development in Windows-land is that the system now does actually support UTF-8 in the runtime libraries[2]. You can set it at a system level, or for an application at build time, or by adding ".UTF-8" to a locale name when opening the locale (apparently much more like Unix systems, but I don't know what exactly it does). I wonder why we see this change now... did hamerkop have that ACP=UTF-8 setting applied on purpose, or if computers in Japan started doing that by default instead of using Shift-JIS, or if it only started picking UTF-8 around the time of the Meson change somehow, or the initdb-template change. It's a little hard to tell from the logs. [1] https://www.postgresql.org/message-id/CAC%2BAXB10p%2BmnJ6wrAEm6jb51%2B8%3DBfYzD%3Dw6ftHRbMjMuSFN3kQ%40mail.gmail.com [2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Either way, it seems like we'll need to skip that test on Windows if > we want hamerkop to be green. That can probably be cribbed from > collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's > prelude... I just don't know how to explain it in the comment 'cause I > don't know why. Here's a minimal patch like that. I don't think it's worth back-patching. Only 15 and 16 could possibly be affected, I think, because the test wasn't enabled before that. I think this is all just a late-appearing blow-up predicted by the commit that enabled it: commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Jan 5 13:30:07 2022 -0500 Enable routine running of citext's UTF8-specific test cases. These test cases have been commented out since citext was invented, because at the time we had no nice way to deal with tests that have restrictions such as requiring UTF8 encoding. But now we do have a convention for that, ie put them into a separate test file with an early-exit path. So let's enable these tests to run when their prerequisites are satisfied. (We may have to tighten the prerequisites beyond the "encoding = UTF8 and locale != C" checks made here. But let's put it on the buildfarm and see what blows up.) Hamerkop is already green on the 15 and 16 branches, apparently because it's using the pre-meson test stuff that I guess just didn't run the relevant test. In other words, nobody would notice the difference anyway, and a master-only fix would be enough to end this 44-day red streak.
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Either way, it seems like we'll need to skip that test on Windows if >> we want hamerkop to be green. That can probably be cribbed from >> collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's >> prelude... I just don't know how to explain it in the comment 'cause I >> don't know why. > Here's a minimal patch like that. WFM until some Windows person cares to probe more deeply. BTW, I've also been wondering why hamerkop has been failing isolation-check in the 12 and 13 branches for the last six months or so. It is surely unrelated to this issue, and it looks like it must be due to some platform change rather than anything we committed at the time. I'm not planning on looking into that question myself, but really somebody ought to. Or is Windows just as dead as AIX, in terms of anybody being willing to put effort into supporting it? regards, tom lane
Hello Tom, 12.05.2024 08:34, Tom Lane wrote: > BTW, I've also been wondering why hamerkop has been failing > isolation-check in the 12 and 13 branches for the last six months > or so. It is surely unrelated to this issue, and it looks like > it must be due to some platform change rather than anything we > committed at the time. > > I'm not planning on looking into that question myself, but really > somebody ought to. Or is Windows just as dead as AIX, in terms of > anybody being willing to put effort into supporting it? I've reproduced the failure locally with GSS enabled, so I'll try to figure out what's going on here in the next few days. Best regards, Alexander
On 2024-05-12 Su 01:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing isolation-check in the 12 and 13 branches for the last six months or so. It is surely unrelated to this issue, and it looks like it must be due to some platform change rather than anything we committed at the time.
Possibly. It looks like this might be the issue:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information: Credential cache is empty +FATAL: sorry, too many clients already There are several questions here, including: 1. why isn't it failing on later branches? 2. why isn't it failing on drongo (which has more modern compiler and OS)? I think we'll need the help of the animal owner to dig into the issue.
I'm not planning on looking into that question myself, but really somebody ought to. Or is Windows just as dead as AIX, in terms of anybody being willing to put effort into supporting it?
Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-05-12 Su 08:26, Andrew Dunstan wrote:
On 2024-05-12 Su 01:34, Tom Lane wrote:BTW, I've also been wondering why hamerkop has been failing isolation-check in the 12 and 13 branches for the last six months or so. It is surely unrelated to this issue, and it looks like it must be due to some platform change rather than anything we committed at the time.
Possibly. It looks like this might be the issue:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information: Credential cache is empty +FATAL: sorry, too many clients already There are several questions here, including: 1. why isn't it failing on later branches? 2. why isn't it failing on drongo (which has more modern compiler and OS)? I think we'll need the help of the animal owner to dig into the issue.
Aha! drongo doesn't have GSSAPI enabled. Will work on that.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainlymotivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows portfrom Microsoft over the years has been more than disappointing. I think "state of the Windows port" would make a good discussion topic at pgconf.dev (with write-up for those who can't be there). If there is interest, I could organise that with a short presentation of the issues I am aware of so far and possible improvements and smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I would focus on technical stuff, not who-should-be-doing-what, 'cause I can't make anyone do anything. For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so here's hoping for 100% green on master by Tuesday or so.
On 2024-05-12 Su 18:05, Thomas Munro wrote: > On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainlymotivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows portfrom Microsoft over the years has been more than disappointing. > I think "state of the Windows port" would make a good discussion topic > at pgconf.dev (with write-up for those who can't be there). If there > is interest, I could organise that with a short presentation of the > issues I am aware of so far and possible improvements and > smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I > would focus on technical stuff, not who-should-be-doing-what, 'cause I > can't make anyone do anything. > +1 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so > here's hoping for 100% green on master by Tuesday or so. In the meantime, some off-list investigation by Alexander Lakhin has turned up a good deal of information about why we're seeing failures on hamerkop in the back branches. Summarizing, it appears that 1. In a GSS-enabled Windows build without any active Kerberos server, libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to open a GSSAPI connection, but then gss_init_sec_context() fails, leading to client-side reports like this: +Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information:Credential cache is empty +FATAL: sorry, too many clients already (The first of these lines comes out during the attempted GSS connection, the second during the only-slightly-more-successful non-GSS connection.) So that's problem number 1: how is it that gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims knowledge of any credentials? Can we find a way to get this failure to be detected during pg_GSS_have_cred_cache()? It is mighty expensive to launch a backend connection that is doomed to fail, especially when this happens during *every single libpq connection attempt*. 2. Once gss_init_sec_context() fails, libpq abandons the connection and starts over; since it has already initiated a GSS handshake on the connection, there's not much choice. Although libpq faithfully issues closesocket() on the abandoned connection, Alexander found that the connected backend doesn't reliably see that: it may just sit there until the AuthenticationTimeout elapses (1 minute by default). That backend is still consuming a postmaster child slot, so if this happens on any sizable fraction of test connection attempts, it's little surprise that we soon get "sorry, too many clients already" failures. 3. We don't know exactly why hamerkop suddenly started seeing these failures, but a plausible theory emerges after noting that its reported time for the successful "make check" step dropped pretty substantially right when this started. In the v13 branch, "make check" was taking 2:18 or more in the several runs right before the first isolationcheck failure, but 1:40 or less just after. So it looks like the animal was moved onto faster hardware. That feeds into this problem because, with a successful isolationcheck run taking more than a minute, there was enough time for some of the earlier stuck sessions to time out and exit before their postmaster-child slots were needed. Alexander confirmed this theory by demonstrating that the main regression tests in v15 would pass if he limited their speed enough (by reducing the CPU allowed to a VM) but not at full speed. So the buildfarm results suggesting this is only an issue in <= v13 must be just a timing artifact; the problem is still there. Of course, backends waiting till timeout is not a good behavior independently of what is triggering that, so we have two problems to solve here, not one. I have no ideas about the gss_init_sec_context() failure, but I see a plausible theory about the failure to detect socket closure in Microsoft's documentation about closesocket() [1]: If the l_onoff member of the LINGER structure is zero on a stream socket, the closesocket call will return immediately and does not receive WSAEWOULDBLOCK whether the socket is blocking or nonblocking. However, any data queued for transmission will be sent, if possible, before the underlying socket is closed. This is also called a graceful disconnect or close. In this case, the Windows Sockets provider cannot release the socket and other resources for an arbitrary period, thus affecting applications that expect to use all available sockets. This is the default behavior for a socket. I'm not sure whether we've got unsent data pending in the problematic condition, nor why it'd remain unsent if we do (shouldn't the backend consume it anyway?). But this has the right odor for an explanation. I'm pretty hesitant to touch this area myself, because it looks an awful lot like commits 6051857fc and ed52c3707, which eventually had to be reverted. I think we need a deeper understanding of exactly what Winsock is doing or not doing before we try to fix it. I wonder if there are any Microsoft employees around here with access to the relevant source code. In the short run, it might be a good idea to deprecate using --with-gssapi on Windows builds. A different stopgap idea could be to drastically reduce the default AuthenticationTimeout, perhaps only on Windows. regards, tom lane [1] https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket
On Tue, May 14, 2024 at 8:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not sure whether we've got unsent data pending in the problematic > condition, nor why it'd remain unsent if we do (shouldn't the backend > consume it anyway?). But this has the right odor for an explanation. > > I'm pretty hesitant to touch this area myself, because it looks an > awful lot like commits 6051857fc and ed52c3707, which eventually > had to be reverted. I think we need a deeper understanding of > exactly what Winsock is doing or not doing before we try to fix it. I was beginning to suspect that lingering odour myself. I haven't look at the GSS code but I was imagining that what we have here is perhaps not unsent data dropped on the floor due to linger policy (unclean socket close on process exist), but rather that the server didn't see the socket as ready to read because it lost track of the FD_CLOSE somewhere because the client closed it gracefully, and our server-side FD_CLOSE handling has always been a bit suspect. I wonder if the GSS code is somehow more prone to brokenness. One thing we learned in earlier problems was that abortive/error disconnections generate FD_CLOSE repeatedly, while graceful ones give you only one. In other words, if the other end politely calls closesocket(), the server had better not miss the FD_CLOSE event, because it won't come again. That's what https://commitfest.postgresql.org/46/3523/ is intended to fix. Does it help here? Unfortunately that's unpleasantly complicated and unbackpatchable (keeping a side-table of socket FDs and event handles, so we don't lose events between the cracks).
13.05.2024 23:17, Tom Lane wrote: > 3. We don't know exactly why hamerkop suddenly started seeing these > failures, but a plausible theory emerges after noting that its > reported time for the successful "make check" step dropped pretty > substantially right when this started. In the v13 branch, "make > check" was taking 2:18 or more in the several runs right before the > first isolationcheck failure, but 1:40 or less just after. So it > looks like the animal was moved onto faster hardware. That feeds > into this problem because, with a successful isolationcheck run > taking more than a minute, there was enough time for some of the > earlier stuck sessions to time out and exit before their > postmaster-child slots were needed. Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings substantially differ from REL_13_STABLE-, say, for the check stage: REL_14_STABLE: the oldest available test log (from 2021-10-30) shows check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18). Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/ 00:02:14 (2021-11-07), and now it's 1:40, as you said. Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27 secs) and on REL_14_STABLE (33, 29, 29 secs). 14.05.2024 03:38, Thomas Munro wrote: > I was beginning to suspect that lingering odour myself. I haven't > look at the GSS code but I was imagining that what we have here is > perhaps not unsent data dropped on the floor due to linger policy > (unclean socket close on process exist), but rather that the server > didn't see the socket as ready to read because it lost track of the > FD_CLOSE somewhere because the client closed it gracefully, and our > server-side FD_CLOSE handling has always been a bit suspect. I wonder > if the GSS code is somehow more prone to brokenness. One thing we > learned in earlier problems was that abortive/error disconnections > generate FD_CLOSE repeatedly, while graceful ones give you only one. > In other words, if the other end politely calls closesocket(), the > server had better not miss the FD_CLOSE event, because it won't come > again. That's what > > https://commitfest.postgresql.org/46/3523/ > > is intended to fix. Does it help here? Unfortunately that's > unpleasantly complicated and unbackpatchable (keeping a side-table of > socket FDs and event handles, so we don't lose events between the > cracks). Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last state when that patch set can be applied). An excerpt (all lines related to process 12500) from a failed run log without the patch set: 2024-05-14 05:57:29.526 UTC [8228:128] DEBUG: forked new backend, pid=12500 socket=5524 2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG: connection received: host=::1 port=51394 2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() ... 2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG: server process (PID 12500) exited with exit code 1 # at the end of the test run And an excerpt (all lines related to process 11736) from a successful run log with the patch set applied: 2024-05-14 06:03:57.216 UTC [4524:130] DEBUG: forked new backend, pid=11736 socket=5540 2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG: connection received: host=::1 port=51914 2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1 2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error 2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket 2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1 2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make 2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG: exit(0) 2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make 2024-05-14 06:03:57.243 UTC [4524:132] DEBUG: forked new backend, pid=10536 socket=5540 2024-05-14 06:03:57.243 UTC [4524:133] DEBUG: server process (PID 11736) exited with exit code 0 Best regards, Alexander
Alexander Lakhin <exclusion@gmail.com> writes: > 13.05.2024 23:17, Tom Lane wrote: >> 3. We don't know exactly why hamerkop suddenly started seeing these >> failures, but a plausible theory emerges after noting that its >> reported time for the successful "make check" step dropped pretty >> substantially right when this started. In the v13 branch, "make >> check" was taking 2:18 or more in the several runs right before the >> first isolationcheck failure, but 1:40 or less just after. So it >> looks like the animal was moved onto faster hardware. > Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings > substantially differ from REL_13_STABLE-, say, for the check stage: As I mentioned in our off-list discussion, I have a lingering feeling that this v14 commit could be affecting the results somehow: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400 Update the Winsock API version requested by libpq. According to Microsoft's documentation, 2.2 has been the current version since Windows 98 or so. Moreover, that's what the Postgres backend has been requesting since 2004 (cf commit 4cdf51e64). So there seems no reason for libpq to keep asking for 1.1. I didn't believe at the time that that'd have any noticeable effect, but maybe it somehow made Winsock play a bit nicer with the GSS support? regards, tom lane
14.05.2024 17:38, Tom Lane wrote: > As I mentioned in our off-list discussion, I have a lingering feeling > that this v14 commit could be affecting the results somehow: > > Author: Tom Lane <tgl@sss.pgh.pa.us> > Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400 > > Update the Winsock API version requested by libpq. > > According to Microsoft's documentation, 2.2 has been the current > version since Windows 98 or so. Moreover, that's what the Postgres > backend has been requesting since 2004 (cf commit 4cdf51e64). > So there seems no reason for libpq to keep asking for 1.1. > > I didn't believe at the time that that'd have any noticeable effect, > but maybe it somehow made Winsock play a bit nicer with the GSS > support? Yes, probably, but may be not nicer, as the test duration increased? Still I can't see the difference locally to check that commit. Will try other VMs/configurations, maybe I could find a missing factor... Best regards, Alexander
On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > 14.05.2024 03:38, Thomas Munro wrote: > > I was beginning to suspect that lingering odour myself. I haven't > > look at the GSS code but I was imagining that what we have here is > > perhaps not unsent data dropped on the floor due to linger policy > > (unclean socket close on process exist), but rather that the server > > didn't see the socket as ready to read because it lost track of the > > FD_CLOSE somewhere because the client closed it gracefully, and our > > server-side FD_CLOSE handling has always been a bit suspect. I wonder > > if the GSS code is somehow more prone to brokenness. One thing we > > learned in earlier problems was that abortive/error disconnections > > generate FD_CLOSE repeatedly, while graceful ones give you only one. > > In other words, if the other end politely calls closesocket(), the > > server had better not miss the FD_CLOSE event, because it won't come > > again. That's what > > > > https://commitfest.postgresql.org/46/3523/ > > > > is intended to fix. Does it help here? Unfortunately that's > > unpleasantly complicated and unbackpatchable (keeping a side-table of > > socket FDs and event handles, so we don't lose events between the > > cracks). > > Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last > state when that patch set can be applied). Thanks for checking, and generally for your infinite patience with all these horrible Windows problems. OK, so we know what the problem is here. Here is the simplest solution I know of for that problem. I have proposed this in the past and received negative feedback because it's a really gross hack. But I don't personally know what else to do about the back-branches (or even if that complex solution is the right way forward for master). The attached kludge at least has the [de]merit of being a mirror image of the kludge that follows it for the "opposite" event. Does this fix it?
Attachment
15.05.2024 01:26, Thomas Munro wrote: > OK, so we know what the problem is here. Here is the simplest > solution I know of for that problem. I have proposed this in the past > and received negative feedback because it's a really gross hack. But > I don't personally know what else to do about the back-branches (or > even if that complex solution is the right way forward for master). > The attached kludge at least has the [de]merit of being a mirror image > of the kludge that follows it for the "opposite" event. Does this fix > it? Yes, I see that abandoned GSS connections are closed immediately, as expected. I have also confirmed that `meson test` with the basic configuration passes on REL_16_STABLE. So from the outside, the fix looks good to me. Thank you for working on this! Best regards, Alexander
On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > 15.05.2024 01:26, Thomas Munro wrote: > > OK, so we know what the problem is here. Here is the simplest > > solution I know of for that problem. I have proposed this in the past > > and received negative feedback because it's a really gross hack. But > > I don't personally know what else to do about the back-branches (or > > even if that complex solution is the right way forward for master). > > The attached kludge at least has the [de]merit of being a mirror image > > of the kludge that follows it for the "opposite" event. Does this fix > > it? > > Yes, I see that abandoned GSS connections are closed immediately, as > expected. I have also confirmed that `meson test` with the basic > configuration passes on REL_16_STABLE. So from the outside, the fix > looks good to me. Alright, unless anyone has an objection or ideas for improvements, I'm going to go ahead and back-patch this slightly tidied up version everywhere.
Attachment
On Thu, May 16, 2024 at 9:46 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Alright, unless anyone has an objection or ideas for improvements, I'm > going to go ahead and back-patch this slightly tidied up version > everywhere. Of course as soon as I wrote that I thought of a useful improvement myself: as far as I can tell, you only need to do the extra poll on the first wait for WL_SOCKET_READABLE for any given WaitEventSet. I don't think it's needed for later waits done by long-lived WaitEventSet objects, so we can track that with a flag. That avoids adding new overhead for regular backend socket waits after authentication, it's just in these code paths that do a bunch of WaitLatchOrSocket() calls that we need to consider FD_CLOSE events lost between the cracks. I also don't know if the condition should include "&& received == 0". It probably doesn't make much difference, but by leaving that off we don't have to wonder how peeking interacts with events, ie if it's a problem that we didn't do the "reset" step. Thinking about that, I realised that I should probably set reset = true in this new return path, just like the "normal" WL_SOCKET_READABLE notification path, just to be paranoid. (Programming computers you don't have requires extra paranoia.) Any chance you could test this version please Alexander?
Attachment
On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Any chance you could test this version please Alexander? Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI thing and is superficially better, but it doesn't handle code that calls twice in a row and ignores the first result (I know that PostgreSQL does that occasionally in a few places), and it's also broken if someone gets recv() = 0 (EOF), and then decides to wait anyway. The only ways I can think of to get full reliable poll()-like semantics is to do that peek every time, OR the complicated patch (per-socket-workspace + intercepting recv etc). So I'm back to v2.
Hello Thomas, 16.05.2024 04:32, Thomas Munro wrote: > On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> Any chance you could test this version please Alexander? > Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI > thing and is superficially better, but it doesn't handle code that > calls twice in a row and ignores the first result (I know that > PostgreSQL does that occasionally in a few places), and it's also > broken if someone gets recv() = 0 (EOF), and then decides to wait > anyway. The only ways I can think of to get full reliable poll()-like > semantics is to do that peek every time, OR the complicated patch > (per-socket-workspace + intercepting recv etc). So I'm back to v2. I've tested v2 and can confirm that it works as v1, `vcregress check` passes with no failures on REL_16_STABLE, `meson test` with the basic configuration too. By the way, hamerkop is not configured to enable gssapi for HEAD [1] and I could not enable gss locally yet (just passing extra_lib_dirs, extra_include_dirs doesn't work for me). It looks like we need to find a way to enable it for meson to continue testing v17+ with GSS on Windows. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2024-05-12%2011%3A00%3A28&stg=configure Best regards, Alexander
Thomas Munro <thomas.munro@gmail.com> writes: > For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so > here's hoping for 100% green on master by Tuesday or so. Meanwhile, back at the ranch, it doesn't seem that changed anything: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32 ... and now that I look more closely, the reason why it didn't change anything is that hamerkop is still building 0294df2 on HEAD. All its other branches are equally stuck at the end of March. So this is a flat-out-broken animal, and I plan to just ignore it until its owner un-sticks it. (In particular, I think we shouldn't be in a hurry to push the patch discussed downthread.) Andrew: maybe the buildfarm server could be made to flag animals building exceedingly old commits? This is the second problem of this sort that I've noticed this month, and you really have to look closely to realize it's happening. regards, tom lane
On 2024-05-16 Th 16:18, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so >> here's hoping for 100% green on master by Tuesday or so. > Meanwhile, back at the ranch, it doesn't seem that changed anything: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32 > > ... and now that I look more closely, the reason why it didn't > change anything is that hamerkop is still building 0294df2 > on HEAD. All its other branches are equally stuck at the > end of March. So this is a flat-out-broken animal, and I > plan to just ignore it until its owner un-sticks it. > (In particular, I think we shouldn't be in a hurry to push > the patch discussed downthread.) > > Andrew: maybe the buildfarm server could be made to flag > animals building exceedingly old commits? This is the second > problem of this sort that I've noticed this month, and you > really have to look closely to realize it's happening. > > Yeah, that should be doable. Since we have the git ref these days we should be able to mark it as old, or maybe just reject builds for very old commits (the latter would be easier). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-05-16 Th 16:18, Tom Lane wrote: >> Andrew: maybe the buildfarm server could be made to flag >> animals building exceedingly old commits? This is the second >> problem of this sort that I've noticed this month, and you >> really have to look closely to realize it's happening. > Yeah, that should be doable. Since we have the git ref these days we > should be able to mark it as old, or maybe just reject builds for very > old commits (the latter would be easier). I'd rather have some visible status on the BF dashboard. Invariably, with a problem like this, the animal's owner is unaware there's a problem. If it's just silently not reporting, then no one else will notice either, and we effectively lose an animal (despite it still burning electricity to perform those rejected runs). regards, tom lane
On 2024-05-16 Th 17:15, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2024-05-16 Th 16:18, Tom Lane wrote: >>> Andrew: maybe the buildfarm server could be made to flag >>> animals building exceedingly old commits? This is the second >>> problem of this sort that I've noticed this month, and you >>> really have to look closely to realize it's happening. >> Yeah, that should be doable. Since we have the git ref these days we >> should be able to mark it as old, or maybe just reject builds for very >> old commits (the latter would be easier). > I'd rather have some visible status on the BF dashboard. Invariably, > with a problem like this, the animal's owner is unaware there's a > problem. If it's just silently not reporting, then no one else will > notice either, and we effectively lose an animal (despite it still > burning electricity to perform those rejected runs). > > Fair enough. That will mean some database changes and other stuff, so it will take a bit longer. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-05-16 Th 17:15, Tom Lane wrote: >> I'd rather have some visible status on the BF dashboard. Invariably, >> with a problem like this, the animal's owner is unaware there's a >> problem. If it's just silently not reporting, then no one else will >> notice either, and we effectively lose an animal (despite it still >> burning electricity to perform those rejected runs). > Fair enough. That will mean some database changes and other stuff, so it > will take a bit longer. Sure, I don't think it's urgent. regards, tom lane
On 2024-05-16 Th 17:34, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2024-05-16 Th 17:15, Tom Lane wrote: >>> I'd rather have some visible status on the BF dashboard. Invariably, >>> with a problem like this, the animal's owner is unaware there's a >>> problem. If it's just silently not reporting, then no one else will >>> notice either, and we effectively lose an animal (despite it still >>> burning electricity to perform those rejected runs). >> Fair enough. That will mean some database changes and other stuff, so it >> will take a bit longer. > Sure, I don't think it's urgent. I've pushed a small change, that should just mark with an asterisk any gitref that is more than 2 days older than the tip of the branch at the time of reporting. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I've pushed a small change, that should just mark with an asterisk any > gitref that is more than 2 days older than the tip of the branch at the > time of reporting. Thanks! regards, tom lane
On Fri, May 17, 2024 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > I've tested v2 and can confirm that it works as v1, `vcregress check` > passes with no failures on REL_16_STABLE, `meson test` with the basic > configuration too. Pushed, including back-branches. This is all not very nice code and I hope we can delete it all some day. Ideas include: (1) Thinking small: change over to the WAIT_USE_POLL implementation of latch.c on this OS (Windows has poll() these days), using a socket pair for latch wakeup (i.e. give up trying to multiplex with native Windows event handles, even though they are a great fit for our latch abstraction, as the sockets are too different from Unix). (2) Thinking big: use native completion-based asynchronous socket APIs, as part of a much larger cross-platform AIO socket reengineering project that would deliver higher performance networking on all OSes. The thought of (2) puts me off investing time into (1), but on the other hand it would be nice if Windows could almost completely share code with some Unixen. I may be more inclined to actually try it if/when we can rip out the fake signal support, because it is tangled up with this stuff and does not spark joy.
Thomas Munro wrote 2024-05-12 06:31: > Hamerkop is already green on the 15 and 16 branches, apparently > because it's using the pre-meson test stuff that I guess just didn't > run the relevant test. In other words, nobody would notice the > difference anyway, and a master-only fix would be enough to end this > 44-day red streak. Sorry for necroposting, but in our automated testing system we have found some fails of this test. The most recent one was a couple of days ago (see attached files) on PostgreSQL 15.7. Also I've reported this bug some time ago [1], but provided an example only for PostgreSQL 17. Back then the bug was actually found on 15 or 16 branches (no logs remain from couple of months back), but i wanted to show that it was reproducible on 17. I would appreciate if you would backpatch this change to 15 and 16 branches. [1] https://www.postgresql.org/message-id/6885a0b52d06f7e5910d2b6276bbb4e8%40postgrespro.ru Oleg Tselebrovskiy, Postgres Pro
Attachment
On Fri, Aug 2, 2024 at 1:37 AM Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru> wrote: > I would appreciate if you would backpatch this change to 15 and 16 > branches. Done (e52a44b8, 91f498fd). Any elucidation on how and why Windows machines have started using UTF-8 would be welcome.
On Aug 1, 2024, at 18:54, Thomas Munro <thomas.munro@gmail.com> wrote: > Done (e52a44b8, 91f498fd). > > Any elucidation on how and why Windows machines have started using > UTF-8 would be welcome. Haven’t been following this thread, but this post reminded me of an issue I saw with locales on Windows[1]. Could it be thatthe introduction of Universal CRT[2] in Windows 10 has improved UTF-8 support? Bit of a wild guess, but I assume worth bringing up at least. D [1]: https://github.com/shogo82148/actions-setup-perl/issues/1713 [2]: https://learn.microsoft.com/en-us/cpp/porting/upgrade-your-code-to-the-universal-crt?view=msvc-170
On Sat, Aug 3, 2024 at 2:11 AM David E. Wheeler <david@justatheory.com> wrote: > Haven’t been following this thread, but this post reminded me of an issue I saw with locales on Windows[1]. Could it bethat the introduction of Universal CRT[2] in Windows 10 has improved UTF-8 support? Yeah. We have a few places that claim that Windows APIs can't do UTF-8 and they have to do extra wchar_t conversions, but that doesn't seem to be true on modern Windows. Example: https://github.com/postgres/postgres/blob/7926a9a80f6daf0fcc1feb1bee5c51fd001bc173/src/backend/utils/adt/pg_locale.c#L1814 I suspect that at least when the locale name is "en-US.UTF-8", then the regular POSIXoid strcoll_l() function should just work™ and we could delete all that stuff and save Windows users a lot of wasted CPU cycles.