Rare SSL failures on eelpout - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Rare SSL failures on eelpout |
Date | |
Msg-id | CAEepm=2n6Nv+5tFfe8YnkUm1fXgvxR0Mm1FoD+QKG-vLNGLyKg@mail.gmail.com Whole thread Raw |
Responses |
Re: Rare SSL failures on eelpout
|
List | pgsql-hackers |
(Moving this discussion from the pgsql-committers thread "pgsql: Update ssl test certificates and keys", which is innocent.) On Wed, Jan 16, 2019 at 10:37 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > Thanks. FWIW I've just updated eelpout (a Debian testing BF animal > > that runs all the extra tests including SSL) to use libssl-dev > > (instead of libssl1.0-dev), and cleared its accache files. Let's see > > if that works... > > Since that upgrade (to libssl 1.1.1a-1), there are have been a few > intermittent failures in the SSL tests on that animal (thanks Tom for > pointing that out off-list). I reproduced this manually. From time to time (less than 1% of the times I tried), the server hangs up without sending a goodbye message, thereby causing a failure to match the expected psql error message. From the kernel's perspective on the psql side, a working-as-expected case looks like this: sendto(3, "\27\3\3\2\356\313\334\372\201\353\201h\204\353\240A\4\355\232\340\375\0X\220\326V\374\253\222i\2724"..., 1115, MSG_NOSIGNAL, NULL, 0) = 1115 ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3, revents=POLLOUT}]) sendto(3, "\27\3\3\0f\335\313\224\263\256r\371\215\177\273,N\345\342\200\257\r\321\6\323@\316\241\316\17\204\26"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 ppoll([{fd=3, events=POLLIN|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3, "\27\3\3\0\23", 5, 0, NULL, NULL) = 5 recvfrom(3, "I\"\t;\3006\276\347\344]7>\2\234m\340]B\241", 19, 0, NULL, NULL) = 19 close(3) = 0 write(2, "psql: SSL error: sslv3 alert cer"..., 49psql: SSL error: sslv3 alert certificate revoked ) = 49 ... and the unexpected failure case looks like this: sendto(3, "\27\3\3\2\356/\254\307\277\342G?&BB\321\f\314\245\22\246U\337\263;\203f\302s\306\37\276"..., 1115, MSG_NOSIGNAL, NULL, 0) = 1115 ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3, revents=POLLOUT|POLLERR|POLLHUP}]) sendto(3, "\27\3\3\0f\177\335\20\305\353\234\306\211#\343\321\336\22111J\312\323F\210\6U\331\264GAr"..., 107, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer) write(2, "psql: server closed the connecti"..., 199psql: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. could not send startup packet: Connection reset by peer ) = 199 To the kernel on the server side, a good case looks like this: [pid 13740] sendto(8, "\27\3\3\0\23\252\21\227\2\232s\354\21\243\236\207\301\3B\341\253\306k\346", 24, 0, NULL, 0) = 24 [pid 13740] write(2, "2019-01-22 09:23:04.425 UTC [137"..., 1112019-01-22 09:23:04.425 UTC [13740] [unknown] LOG: could not accept SSL connection: certificate verify failed ) = 111 [pid 13740] exit_group(0) = ? [pid 13740] +++ exited with 0 +++ I didn't manage to trace a bad case on the server side (maybe the strace interferes with the required timing), but I noticed that SSL_shutdown()'s return code (which we don't check) is always -1, and I noticed that SSL_get_error() after that is always SSL_ERROR_SSL. I wondered if we might somehow be reaching exit() when the alert response is still buffered inside OpenSSL, but that doesn't seem right -- the port has noblock=0 at that stage. Hmm. Why is psql doing two sendto() calls without reading a response in between, when it's possible for the server to exit after the first, anyway? Seems like a protocol violation somewhere? -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: