Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: conchuela timeouts since 2021-10-09 system upgrade |
Date | |
Msg-id | 20211026134500.GA128912@rfd.leadboat.com Whole thread Raw |
In response to | Re: conchuela timeouts since 2021-10-09 system upgrade (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: conchuela timeouts since 2021-10-09 system upgrade
Re: conchuela timeouts since 2021-10-09 system upgrade |
List | pgsql-bugs |
On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote: > >> What I think we should do in these two tests is nuke the use of > >> background_pgbench entirely; that looks like a solution in search > >> of a problem, and it seems unnecessary here. Why not run > >> the DROP/CREATE/bt_index_check transaction as one of three script > >> options in the main pgbench run? > > > The author tried that and got deadlocks: > > https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru > > Hmm, I guess that's because two concurrent CICs can deadlock against each > other. I wonder if we could fix that ... or maybe we could teach pgbench > that it mustn't launch more than one instance of that script? Both sound doable, but I don't expect either to fix prairiedog's trouble. > Or more > practically, use advisory locks in that script to enforce that only one > runs at once. The author did try that. > So what we have is that libpq thinks it's sent the next DROP INDEX, > but the backend hasn't seen it. Thanks for isolating that. > It's fairly hard to blame that state of affairs on the IPC::Run harness. > I'm wondering if we might be looking at some timing-dependent corner-case > bug in the new libpq pipelining code. Pipelining isn't enabled: > > pipelineStatus = PQ_PIPELINE_OFF, > > but that doesn't mean that the pipelining code hasn't been anywhere > near this command. I can see > > cmd_queue_head = 0x300d40, > cmd_queue_tail = 0x300d40, > cmd_queue_recycle = 0x0, > > (gdb) p *state->con->cmd_queue_head > $4 = { > queryclass = PGQUERY_SIMPLE, > query = 0x3004e0 "DROP INDEX CONCURRENTLY idx;", > next = 0x0 > } > > The trouble with this theory, of course, is "if libpq is busted, why is > only this test case showing it?". Agreed, it's not clear how the new tests would reveal a libpq bug that src/bin/pgbench/t/001_pgbench_with_server.pl has been unable to reveal. Does the problem reproduce on v13? Grasping at straws, background_pgbench does differ by specifying stdin as a ref to an empty scalar. I think that makes IPC::Run open a pipe and never write to it. The older pgbench tests don't override stdin, so I think that makes pgbench inherit the original stdin. Given your pgbench stack trace, this seems awfully unlikely to be the relevant difference. If we run out of ideas, you could try some runs with that difference removed: --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -2110,7 +2110,7 @@ sub background_pgbench # IPC::Run would otherwise append to existing contents: $$stdout = "" if ref($stdout); - my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1', + my $harness = IPC::Run::start \@cmd, '>', $stdout, '2>&1', $timer; return $harness; > But AFAICS it would take some pretty > spooky action-at-a-distance for the Perl harness to have caused this. Agreed. We'll have to consider the harness innocent for the moment.
pgsql-bugs by date: