Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: conchuela timeouts since 2021-10-09 system upgrade |
Date | |
Msg-id | 52098.1635228234@sss.pgh.pa.us Whole thread Raw |
In response to | Re: conchuela timeouts since 2021-10-09 system upgrade (Noah Misch <noah@leadboat.com>) |
Responses |
Re: conchuela timeouts since 2021-10-09 system upgrade
|
List | pgsql-bugs |
Noah Misch <noah@leadboat.com> writes: > On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote: >> What I think we should do in these two tests is nuke the use of >> background_pgbench entirely; that looks like a solution in search >> of a problem, and it seems unnecessary here. Why not run >> the DROP/CREATE/bt_index_check transaction as one of three script >> options in the main pgbench run? > The author tried that and got deadlocks: > https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru Hmm, I guess that's because two concurrent CICs can deadlock against each other. I wonder if we could fix that ... or maybe we could teach pgbench that it mustn't launch more than one instance of that script? Or more practically, use advisory locks in that script to enforce that only one runs at once. > On prairiedog, the proximate trouble is pgbench getting stuck. IPC::Run is > behaving normally given a stuck pgbench. When pgbench stops sending queries, > does pg_stat_activity show anything at all running? If so, are those backends > waiting on locks? If not, what's the pgbench stack trace at that time? The connected backend is idle, waiting for client input: #0 0x9002ec88 in kevent () #1 0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1, wait_event_info=0)at latch.c:1601 #2 0x0028d6a0 in secure_read (port=0x2501f40, ptr=0x7ad1d0, len=8192) at be-secure.c:186 #3 0x002958a4 in pq_recvbuf () at pqcomm.c:957 #4 0x00295994 in pq_getbyte () at pqcomm.c:1000 #5 0x004196a0 in PostgresMain The last few entries in the postmaster log are: 2021-10-26 01:19:31.122 EDT [1013] 002_cic.pl LOG: statement: DROP INDEX CONCURRENTLY idx; 2021-10-26 01:19:31.136 EDT [1013] 002_cic.pl LOG: statement: CREATE INDEX CONCURRENTLY idx ON tbl(i); 2021-10-26 01:19:31.167 EDT [1013] 002_cic.pl LOG: statement: SELECT bt_index_check('idx',true); which is consistent with pg_stat_activity: datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port| backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type -------+----------+------+------------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+------------------------------- +-----------------+---------------------+--------+-------------+--------------+----------+------------------------------------+------------------------------ 13019 | postgres | 1013 | | 10 | tgl | 002_cic.pl | | | -1 | 2021-10-26 01:19:18.49282-04 | | 2021-10-26 01:19:31.167508-04 | 2021-10-26 01:19:31.190256-04 | Client | ClientRead | idle | | | | SELECT bt_index_check('idx',true);| client backend The pgbench process is sitting here: #0 0x9001f488 in select () #1 0x0000758c in wait_on_socket_set (sa=0x300bb0, usecs=0) at pgbench.c:7098 #2 0x0000d514 in threadRun (arg=0x300420) at pgbench.c:6776 #3 0x00010c44 in main (argc=-1635084506, argv=0x0) at pgbench.c:6540 which means it thinks it's waiting for input. The PGconn object is in this state: status = CONNECTION_OK, asyncStatus = PGASYNC_BUSY, xactStatus = PQTRANS_IDLE, ... inBuffer = 0x6a000 "T", inBufSize = 16384, inStart = 0, inCursor = 0, inEnd = 0, outBuffer = 0x6e000 "Q", outBufSize = 16384, outCount = 0, outMsgStart = 1, outMsgEnd = 34, The inBuffer is logically empty, but its contents look like the last input was the result of the "SELECT bt_index_check" command, consistent with the postmaster log. The outBuffer is likewise logically empty, but what it contains appears to be the next DROP INDEX command: (gdb) x/64c state->con->outBuffer 0x6e000: 81 'Q' 0 '\0' 0 '\0' 0 '\0' 33 '!' 68 'D' 82 'R' 79 'O' 0x6e008: 80 'P' 32 ' ' 73 'I' 78 'N' 68 'D' 69 'E' 88 'X' 32 ' ' 0x6e010: 67 'C' 79 'O' 78 'N' 67 'C' 85 'U' 82 'R' 82 'R' 69 'E' 0x6e018: 78 'N' 84 'T' 76 'L' 89 'Y' 32 ' ' 105 'i' 100 'd' 120 'x' 0x6e020: 59 ';' 0 '\0' 114 'r' 117 'u' 101 'e' 41 ')' 59 ';' 0 '\0' 0x6e028: 108 'l' 40 '(' 105 'i' 41 ')' 59 ';' 0 '\0' 95 '_' 110 'n' 0x6e030: 97 'a' 109 'm' 101 'e' 0 '\0' 48 '0' 48 '0' 50 '2' 95 '_' 0x6e038: 99 'c' 105 'i' 99 'c' 46 '.' 112 'p' 108 'l' 0 '\0' 0 '\0' So what we have is that libpq thinks it's sent the next DROP INDEX, but the backend hasn't seen it. It's fairly hard to blame that state of affairs on the IPC::Run harness. I'm wondering if we might be looking at some timing-dependent corner-case bug in the new libpq pipelining code. Pipelining isn't enabled: pipelineStatus = PQ_PIPELINE_OFF, but that doesn't mean that the pipelining code hasn't been anywhere near this command. I can see cmd_queue_head = 0x300d40, cmd_queue_tail = 0x300d40, cmd_queue_recycle = 0x0, (gdb) p *state->con->cmd_queue_head $4 = { queryclass = PGQUERY_SIMPLE, query = 0x3004e0 "DROP INDEX CONCURRENTLY idx;", next = 0x0 } The trouble with this theory, of course, is "if libpq is busted, why is only this test case showing it?". But AFAICS it would take some pretty spooky action-at-a-distance for the Perl harness to have caused this. regards, tom lane
pgsql-bugs by date: