Home > mailing lists

Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

From	Noah Misch
Subject	Re: conchuela timeouts since 2021-10-09 system upgrade
Date	October 26, 2021 13:45:00
Msg-id	20211026134500.GA128912@rfd.leadboat.com Whole thread Raw
In response to	Re: conchuela timeouts since 2021-10-09 system upgrade (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: conchuela timeouts since 2021-10-09 system upgrade Re: conchuela timeouts since 2021-10-09 system upgrade
List	pgsql-bugs

Tree view

On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote:
> >> What I think we should do in these two tests is nuke the use of
> >> background_pgbench entirely; that looks like a solution in search
> >> of a problem, and it seems unnecessary here.  Why not run
> >> the DROP/CREATE/bt_index_check transaction as one of three script
> >> options in the main pgbench run?
> 
> > The author tried that and got deadlocks:
> > https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru
> 
> Hmm, I guess that's because two concurrent CICs can deadlock against each
> other.  I wonder if we could fix that ... or maybe we could teach pgbench
> that it mustn't launch more than one instance of that script?

Both sound doable, but I don't expect either to fix prairiedog's trouble.

> Or more
> practically, use advisory locks in that script to enforce that only one
> runs at once.

The author did try that.

> So what we have is that libpq thinks it's sent the next DROP INDEX,
> but the backend hasn't seen it.

Thanks for isolating that.

> It's fairly hard to blame that state of affairs on the IPC::Run harness.
> I'm wondering if we might be looking at some timing-dependent corner-case
> bug in the new libpq pipelining code.  Pipelining isn't enabled:
> 
>   pipelineStatus = PQ_PIPELINE_OFF, 
> 
> but that doesn't mean that the pipelining code hasn't been anywhere
> near this command.  I can see
> 
>   cmd_queue_head = 0x300d40, 
>   cmd_queue_tail = 0x300d40, 
>   cmd_queue_recycle = 0x0, 
> 
> (gdb) p *state->con->cmd_queue_head
> $4 = {
>   queryclass = PGQUERY_SIMPLE, 
>   query = 0x3004e0 "DROP INDEX CONCURRENTLY idx;", 
>   next = 0x0
> }
> 
> The trouble with this theory, of course, is "if libpq is busted, why is
> only this test case showing it?".

Agreed, it's not clear how the new tests would reveal a libpq bug that
src/bin/pgbench/t/001_pgbench_with_server.pl has been unable to reveal.  Does
the problem reproduce on v13?

Grasping at straws, background_pgbench does differ by specifying stdin as a
ref to an empty scalar.  I think that makes IPC::Run open a pipe and never
write to it.  The older pgbench tests don't override stdin, so I think that
makes pgbench inherit the original stdin.  Given your pgbench stack trace,
this seems awfully unlikely to be the relevant difference.  If we run out of
ideas, you could try some runs with that difference removed:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2110,7 +2110,7 @@ sub background_pgbench
     # IPC::Run would otherwise append to existing contents:
     $$stdout = "" if ref($stdout);
 
-    my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+    my $harness = IPC::Run::start \@cmd, '>', $stdout, '2>&1',
       $timer;
 
     return $harness;

> But AFAICS it would take some pretty
> spooky action-at-a-distance for the Perl harness to have caused this.

Agreed.  We'll have to consider the harness innocent for the moment.

pgsql-bugs by date:

From: Alexander Korotkov
Date: 26 October 2021, 09:34:32
Subject: Re: BUG #17229: Segmentation Fault after upgrading to version 13

From: Tom Lane
Date: 26 October 2021, 14:29:39
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade

Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

Previous

Next