Thread: Recent 027_streaming_regress.pl hangs

Recent 027_streaming_regress.pl hangs

From

Thomas Munro

Date:

12 March 2024, 21:53:40

Hi,

Several animals are timing out while waiting for catchup,
sporadically.  I don't know why.  The oldest example I have found so
far by clicking around is:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35

So perhaps something was committed ~3 weeks ago triggered this.

There are many examples since, showing as recoveryCheck failures.
Apparently they are all on animals wrangled by Andres.  Hmm.  I think
some/all share a physical host, they seem to have quite high run time
variance, and they're using meson.

Re: Recent 027_streaming_regress.pl hangs

From

Thomas Munro

Date:

14 March 2024, 02:00:28

On Wed, Mar 13, 2024 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35

Assuming it is due to a commit in master, and given the failure
frequency, I think it is very likely to be a change from this 3 day
window of commits, and more likely in the top half dozen or so:

d360e3cc60e Fix compiler warning on typedef redeclaration
8af25652489 Introduce a new smgr bulk loading facility.
e612384fc78 Fix mistake in SQL features list
d13ff82319c Fix BF failure in commit 93db6cbda0.
efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
unexpected input
93db6cbda03 Add a new slot sync worker to synchronize logical slots.
3d47b75546d pgindent fix
b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
fbc93b8b5f5 Remove custom Constraint node read/write implementations
801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.
011d60c4352 Speed up uuid_out() by not relying on a StringInfo
943f7ae1c86 Add lookup table for replication slot conflict reasons
28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
ed345c2728b Fix typo
690805ca754 doc: Fix link to pg_ident_file_mappings view
ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
fcd210d496d Doc: improve explanation of type interval, especially extract().
489072ab7a9 Replace relids in lateral subquery parse tree during SJE
74563f6b902 Revert "Improve compression and storage support with inheritance"
d2ca9a50b5b Minor corrections for partition pruning
818fefd8fd4 Fix race leading to incorrect conflict cause in
InvalidatePossiblyObsoleteSlot()
01ec4d89b91 doc: Use system-username instead of system-user

I just haven't got a specific theory yet, as the logs are empty.  I
wonder if some kind of failures could start firing signals around to
get us a stack.

Re: Recent 027_streaming_regress.pl hangs

From

Michael Paquier

Date:

14 March 2024, 02:26:55

On Thu, Mar 14, 2024 at 03:00:28PM +1300, Thomas Munro wrote:
> Assuming it is due to a commit in master, and given the failure
> frequency, I think it is very likely to be a change from this 3 day
> window of commits, and more likely in the top half dozen or so:
>
> d360e3cc60e Fix compiler warning on typedef redeclaration
> 8af25652489 Introduce a new smgr bulk loading facility.
> e612384fc78 Fix mistake in SQL features list
> d13ff82319c Fix BF failure in commit 93db6cbda0.
> efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
> unexpected input
> 93db6cbda03 Add a new slot sync worker to synchronize logical slots.
> 3d47b75546d pgindent fix
> b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
> fbc93b8b5f5 Remove custom Constraint node read/write implementations
> 801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
> 7a424ece48.
> 011d60c4352 Speed up uuid_out() by not relying on a StringInfo
> 943f7ae1c86 Add lookup table for replication slot conflict reasons
> 28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
> 4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
> ed345c2728b Fix typo
> 690805ca754 doc: Fix link to pg_ident_file_mappings view
> ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
> 75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
> fcd210d496d Doc: improve explanation of type interval, especially extract().
> 489072ab7a9 Replace relids in lateral subquery parse tree during SJE
> 74563f6b902 Revert "Improve compression and storage support with inheritance"
> d2ca9a50b5b Minor corrections for partition pruning
> 818fefd8fd4 Fix race leading to incorrect conflict cause in
> InvalidatePossiblyObsoleteSlot()
> 01ec4d89b91 doc: Use system-username instead of system-user
>
> I just haven't got a specific theory yet, as the logs are empty.  I
> wonder if some kind of failures could start firing signals around to
> get us a stack.

Thanks for providing this list and an analysis.

Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
list that could have played with the way WAL is generated, hence
potentially impacting the records that are replayed.

93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
unrelated territory.
--
Michael

Attachment

signature.asc

Re: Recent 027_streaming_regress.pl hangs

From

Thomas Munro

Date:

14 March 2024, 03:16:24

On Thu, Mar 14, 2024 at 3:27 PM Michael Paquier <michael@paquier.xyz> wrote:
> Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
> list that could have played with the way WAL is generated, hence
> potentially impacting the records that are replayed.

Yeah, I was wondering if its checkpoint delaying logic might have
got the checkpointer jammed or something like that, but I don't
currently see how.  Yeah, the replay of bulk newpages could be
relevant, but it's not exactly new technology.  One thing I wondered
about is whether the Perl "wait for catchup" thing, which generates
large volumes of useless log, could be somehow changed to actually
show the progress after some time.  Something like "I'm still waiting
for this replica to reach LSN X, but it has so far only reported LSN
Y, and here's a dump of the WAL around there"?

> 93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
> unrelated territory.

Hmm.

Re: Recent 027_streaming_regress.pl hangs

From

Alexander Lakhin

Date:

14 March 2024, 18:00:00

Hello Thomas and Michael,

14.03.2024 06:16, Thomas Munro wrote:
>
> Yeah, I was wondering if its checkpoint delaying logic might have
> got the checkpointer jammed or something like that, but I don't
> currently see how.  Yeah, the replay of bulk newpages could be
> relevant, but it's not exactly new technology.  One thing I wondered
> about is whether the Perl "wait for catchup" thing, which generates
> large volumes of useless log, could be somehow changed to actually
> show the progress after some time.  Something like "I'm still waiting
> for this replica to reach LSN X, but it has so far only reported LSN
> Y, and here's a dump of the WAL around there"?

I have perhaps reproduced the issue here (at least I'm seeing something
similar), and going to investigate the issue in the coming days, but what
I'm confused with now is the duration of poll_query_until:
For the failure you referenced:
[15:55:54.740](418.725s) # poll_query_until timed out executing this query:

And a couple of others:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-08%2000%3A34%3A06
[00:45:57.747](376.159s) # poll_query_until timed out executing this query:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-04%2016%3A32%3A17
[16:45:24.870](407.970s) # poll_query_until timed out executing this query:

Could it be that the timeout (360 sec?) is just not enough for the test
under the current (changed due to switch to meson) conditions?

Best regards,
Alexander

Re: Recent 027_streaming_regress.pl hangs

From

Thomas Munro

Date:

14 March 2024, 20:28:19

On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Could it be that the timeout (360 sec?) is just not enough for the test
> under the current (changed due to switch to meson) conditions?

Hmm, well it looks like he switched over to meson around 42 days ago
2024-02-01, looking at "calliphoridae" (skink has the extra
complication of valgrind, let's look at a more 'normal' animal
instead).  The first failure that looks like that on calliphoridae is
19 days ago 2024-02-23, and after that it's happening every 3 days,
sometimes in clusters.

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=calliphoridae&br=HEAD

But you're right that under meson the test takes a lot longer, I guess
due to increased concurrency:

287/287 postgresql:recovery / recovery/027_stream_regress
           OK              684.50s   6 subtests passed

With make we don't have an individual time per script, but for for all
of the recovery tests we had for example:

t/027_stream_regress.pl ............... ok
All tests successful.
Files=39, Tests=542, 65 wallclock secs ( 0.26 usr  0.06 sys + 20.16
cusr 31.65 csys = 52.13 CPU)

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

14 March 2024, 20:56:39

Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> Could it be that the timeout (360 sec?) is just not enough for the test
>> under the current (changed due to switch to meson) conditions?

> But you're right that under meson the test takes a lot longer, I guess
> due to increased concurrency:

What it seems to be is highly variable.  Looking at calliphoridae's
last half dozen successful runs, I see reported times for
027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
else is happening on that machine.  Also, this is probably not
helping anything:

                   'extra_config' => {
                                                      ...
                                                      'fsync = on'

I would suggest turning that off and raising wait_timeout a good
deal, and then we'll see if calliphoridae gets any more stable.

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Alexander Lakhin

Date:

19 March 2024, 11:00:00

14.03.2024 23:56, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>>> Could it be that the timeout (360 sec?) is just not enough for the test
>>> under the current (changed due to switch to meson) conditions?
>> But you're right that under meson the test takes a lot longer, I guess
>> due to increased concurrency:
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.  Also, this is probably not
> helping anything:
>
>                     'extra_config' => {
>                                                        ...
>                                                        'fsync = on'
>
> I would suggest turning that off and raising wait_timeout a good
> deal, and then we'll see if calliphoridae gets any more stable.

I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming WAL from primary at 0/3000000 on
timeline1

...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 
22:41:26.647756+00.
2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed
the

connection unexpectedly

primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET 
(security_barrier=false);
# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  statement: DROP TABLESPACE
regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from (:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

21 March 2024, 00:41:45

Hi,

On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >> Could it be that the timeout (360 sec?) is just not enough for the test
> >> under the current (changed due to switch to meson) conditions?
> 
> > But you're right that under meson the test takes a lot longer, I guess
> > due to increased concurrency:
> 
> What it seems to be is highly variable.  Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
> else is happening on that machine.

There's a lot of other animals on the same machine, however it's rarely fuly
loaded (with either CPU or IO).

I don't think the test just being slow is the issue here, e.g. in the last
failing iteration
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15

the tests completed

2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG:  statement: DROP ROLE regress_tablespace_user2;
2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG:  disconnection: session time: 0:00:12.431 user=bf
database=regressionhost=[local]

and we waited to replicate for quite a while:

2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  connection authorized: user=bf database=postgres
application_name=027_stream_regress.pl
2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  statement: SELECT '0/15BA21B0' <= replay_lsn AND
state= 'streaming'

             FROM pg_catalog.pg_stat_replication
             WHERE application_name IN ('standby_1', 'walreceiver')
2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: session time: 0:00:00.063 user=bf
database=postgreshost=[local]

2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received immediate shutdown request
2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system is shut down

There was no activity for 7 minutes.

System statistics show relatively low load CPU and IO load for the period from
22:00 - 22:10.

I suspect we have some more fundamental instability at our hands, there have
been failures like this going back a while, and on various machines.

I think at the very least we should make Cluster.pm's wait_for_catchup() print
some information when it times out - right now it's neigh on undebuggable,
because we don't even log what we were waiting for and what the actual
replication position was.

> Also, this is probably not
> helping anything:
> 
>                    'extra_config' => {
>                                                       ...
>                                                       'fsync = on'

At some point we had practically no test coverage of fsync, so I made my
animals use fsync. I think we still have little coverage.  I probably could
reduce the number of animals using it though.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

21 March 2024, 02:50:24

Hi,

On 2024-03-20 17:41:47 -0700, Andres Freund wrote:
> There's a lot of other animals on the same machine, however it's rarely fuly
> loaded (with either CPU or IO).
>
> I don't think the test just being slow is the issue here, e.g. in the last
> failing iteration
>
> [...]
>
> I suspect we have some more fundamental instability at our hands, there have
> been failures like this going back a while, and on various machines.

I'm somewhat confused by the timestamps in the log:

[22:07:50.263](223.929s) ok 2 - regression tests pass
...
[22:14:02.051](371.788s) # poll_query_until timed out executing this query:

I read this as 371.788s having passed between the messages. Which of course is
much higher than PostgreSQL::Test::Utils::timeout_default=180

Ah.

The way that poll_query_until() implements timeouts seems decidedly
suboptimal. If a psql invocation, including query processing, takes any
appreciateble amount of time, poll_query_until() waits much longer than it
shoulds, because it very naively determines a number of waits ahead of time:

    my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
    my $attempts = 0;

    while ($attempts < $max_attempts)
    {
...

        # Wait 0.1 second before retrying.
        usleep(100_000);

        $attempts++;
    }

Ick.

What's worse is that if the query takes too long, the timeout afaict never
takes effect.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

21 March 2024, 06:39:53

Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG:  connection authorized: user=bf database=postgres
application_name=027_stream_regress.pl
> 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG:  statement: SELECT '0/15BA21B0' <= replay_lsn AND
state= 'streaming'
 
>              FROM pg_catalog.pg_stat_replication
>              WHERE application_name IN ('standby_1', 'walreceiver')
> 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG:  disconnection: session time: 0:00:00.063 user=bf
database=postgreshost=[local]
 
> 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG:  received immediate shutdown request
> 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG:  database system is shut down
>
> There was no activity for 7 minutes.
>
> System statistics show relatively low load CPU and IO load for the period from
> 22:00 - 22:10.
>
>
> I suspect we have some more fundamental instability at our hands, there have
> been failures like this going back a while, and on various machines.

I've reproduced something like this scenario locally, although I am not sure
it is precisely what is happening on the buildfarm.  At least here it looks
like the problem is that apply is lagging substantially:

2024-03-20 22:43:11.024 PDT [1023505][walreceiver][:0][] DEBUG:  sendtime 2024-03-20 22:43:11.024348-07 receipttime
2024-03-2022:43:11.02437-07 replication apply delay 285322 ms transfer latency 1 ms
 

Which then means that we'll wait for a long time for apply to finish:

Waiting for replication conn standby_1's replay_lsn to pass 0/14385E20 on primary
[22:41:34.521](0.221s) # state before polling:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14385E20
# primary_wal_lsn    | 0/14385E20
# standby_write_lsn  | 0/14385E20
# primary_flush_lsn  | 0/14385E20
# standby_flush_lsn  | 0/14385E20
# standby_replay_lsn | 0/126D5C58
...
[22:43:16.376](0.161s) # running query, attempt 679/1800
[22:43:16.627](0.251s) # state now:
# pid                | 1023508
# application_name   | standby_1
# sent_lsn           | 0/14778468
# primary_wal_lsn    | 0/14778468
# standby_write_lsn  | 0/14778468
# primary_flush_lsn  | 0/14778468
# standby_flush_lsn  | 0/14778468
# standby_replay_lsn | 0/14778468



I am not sure I have debugged why exactly, but it sure looks like one part is
the startup process being busy unlinking files synchronously. This appears to
be exacerbated by mdunlinkfork() first truncating and then separately
unlinking the file - that looks to trigger a lot of filesystem journal
flushes (on xfs).

We also spend a fair bit of time updating the control file, because we flush
the WAL when replaying a transaction commit with a relation unlink. That also
badly interacts with doing metadata operations...

Thirdly, we flush received WAL extremely granularly at times, which requires
another fsync:
2024-03-20 23:30:21.469 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB0000
2024-03-20 23:30:21.473 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB0170
2024-03-20 23:30:21.479 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB2528
2024-03-20 23:30:21.480 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB58C8
2024-03-20 23:30:21.487 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB7DA0
2024-03-20 23:30:21.490 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BB92B0
2024-03-20 23:30:21.494 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBBAC0
2024-03-20 23:30:21.496 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBCCC0
2024-03-20 23:30:21.499 PDT [1525084][walreceiver][:0][] LOG:  flushed received WAL up to 0/13BBCE18

This all when we're quite far behind with apply...

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

26 March 2024, 03:56:07

Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> >                    'extra_config' => {
> >                                                       ...
> >                                                       'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage.  I probably could
> reduce the number of animals using it though.

I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different.  I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit

If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.

From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.

It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().

Of course the failure rate is low enough that it could have been a day or two
before that, too.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

26 March 2024, 04:00:38

Andres Freund <andres@anarazel.de> writes:
> I think there must be some actual regression involved. The frequency of
> failures on HEAD vs failures on 16 - both of which run the tests concurrently
> via meson - is just vastly different.

Are you sure it's not just that the total time to run the core
regression tests has grown to a bit more than what the test timeout
allows for?

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

26 March 2024, 04:28:32

Hi,

On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think there must be some actual regression involved. The frequency of
> > failures on HEAD vs failures on 16 - both of which run the tests concurrently
> > via meson - is just vastly different.
>
> Are you sure it's not just that the total time to run the core
> regression tests has grown to a bit more than what the test timeout
> allows for?

You're right, that could be it - in a way at least, the issue is replay not
catching up within 180s, so it'd have to be the data volume growing, I think.

But it doesn't look like the regression volume meaningfully grew around that
time?

I guess I'll try to write a buildfarm database query to extract how long that
phase of the test took from all runs on my menagerie, not just the failing
one, and see if there's a visible trend.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

26 March 2024, 04:54:54

Andres Freund <andres@anarazel.de> writes:
> On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
>> Are you sure it's not just that the total time to run the core
>> regression tests has grown to a bit more than what the test timeout
>> allows for?

> You're right, that could be it - in a way at least, the issue is replay not
> catching up within 180s, so it'd have to be the data volume growing, I think.
> But it doesn't look like the regression volume meaningfully grew around that
> time?

No, but my impression is that the failure rate has been getting slowly
worse for awhile now.

> I guess I'll try to write a buildfarm database query to extract how long that
> phase of the test took from all runs on my menagerie, not just the failing
> one, and see if there's a visible trend.

+1

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

26 March 2024, 07:59:12

Hi,

On 2024-03-26 00:54:54 -0400, Tom Lane wrote:
> > I guess I'll try to write a buildfarm database query to extract how long that
> > phase of the test took from all runs on my menagerie, not just the failing
> > one, and see if there's a visible trend.
>
> +1

Only the query for successful runs has finished, and it looks like the error
cse is still going to take a while longer, so here are excerpts from a query
showing how long 027_stream_regress.pl took to succeed:

    sysname    |    date    | count | avg_duration
---------------+------------+-------+--------------

 calliphoridae | 2024-01-29 |    10 |          435
 calliphoridae | 2024-02-05 |    25 |          496
 calliphoridae | 2024-02-12 |    36 |          522
 calliphoridae | 2024-02-19 |    25 |          445
 calliphoridae | 2024-02-26 |    35 |          516
 calliphoridae | 2024-03-04 |    53 |          507
 calliphoridae | 2024-03-11 |    51 |          532
 calliphoridae | 2024-03-18 |    53 |          548
 calliphoridae | 2024-03-25 |    13 |          518

 culicidae     | 2024-01-29 |    11 |          420
 culicidae     | 2024-02-05 |    31 |          485
 culicidae     | 2024-02-12 |    35 |          513
 culicidae     | 2024-02-19 |    29 |          489
 culicidae     | 2024-02-26 |    36 |          512
 culicidae     | 2024-03-04 |    63 |          541
 culicidae     | 2024-03-11 |    62 |          597
 culicidae     | 2024-03-18 |    56 |          603
 culicidae     | 2024-03-25 |    16 |          550

 tamandua      | 2024-01-29 |    13 |          420
 tamandua      | 2024-02-05 |    29 |          433
 tamandua      | 2024-02-12 |    34 |          431
 tamandua      | 2024-02-19 |    27 |          382
 tamandua      | 2024-02-26 |    36 |          492
 tamandua      | 2024-03-04 |    60 |          475
 tamandua      | 2024-03-11 |    56 |          533
 tamandua      | 2024-03-18 |    54 |          527
 tamandua      | 2024-03-25 |    21 |          507

Particularly on tamandua it does look like there has been an upwards trend.

Late, will try to look more in the next few days.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Alexander Lakhin

Date:

04 April 2024, 16:00:00

Hello Andres,

26.03.2024 10:59, Andres Freund wrote:
> Late, will try to look more in the next few days.
>

AFAICS, last 027_streaming_regress.pl failures on calliphoridae,
culicidae, tamandua occurred before 2024-03-27:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27

So it looks like the issue resolved, but there is another apparently
performance-related issue: deadlock-parallel test failures.

A recent one:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22
test deadlock-parallel            ... FAILED   345099 ms

+isolationtester: canceling step d2a1 after 300 seconds
  step d2a1: <... completed>
-  sum
------
-10000
-(1 row)
-
...

The server log shows:
2024-04-02 23:56:45.353 UTC [3583878][client backend][5/530:0] LOG: statement: SET force_parallel_mode = on;
...
                   SELECT lock_share(3,x) FROM bigt LIMIT 1;
2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] DETAIL:  parameters: $1 = '3583878'
...
2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] DETAIL:  parameters: $1 = '3583877'
2024-04-02 23:57:29.016 UTC [3583877][client backend][4/530:0] LOG: statement: COMMIT;
2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] DETAIL:  parameters: $1 = '3583879'
...
2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] LOG:  execute isolationtester_waiting: SELECT 
pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}')
2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] DETAIL:  parameters: $1 = '3583878'
2024-04-03 00:02:29.172 UTC [3905345][not initialized][:0] LOG: connection received: host=[local]
2024-04-03 00:02:29.240 UTC [3583878][client backend][5/530:0] ERROR:  canceling statement due to user request

The last step duration is 00:02:29.096 - 23:57:29.039 ~ 300 seconds
(default max_step_wait for REL_15_STABLE- (for REL_16_STABLE+ the default
value was increased to 360 by c99c67fc4)).

The average deadlock-parallel duration for REL_15_STABLE on canebrake is
around 128 seconds (for 140 runs I analyzed), but we can find also:
test deadlock-parallel            ... ok       377895 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2024-03-27%2001%3A06%3A24&stg=isolation-check
test deadlock-parallel            ... ok       302549 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-06%2012%3A47%3A01&stg=isolation-check
test deadlock-parallel            ... ok       255045 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-09%2010%3A02%3A59&stg=isolation-check

The similar situation on phycodurus:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2024-02-11%2021:05:41
test deadlock-parallel            ... FAILED   389381 ms

The average deadlock-parallel duration for REL_13_STABLE on phycodurus is
around 78 seconds (for 138 recent runs), but there were also:
test deadlock-parallel            ... ok       441736 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2024-03-04%2015%3A19%3A04&stg=isolation-check
test deadlock-parallel            ... ok       187844 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2023-11-03%2016%3A13%3A46&stg=isolation-check

And also pogona, REL_14_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2024-02-20%2003%3A50%3A49
test deadlock-parallel            ... FAILED   425482 ms

(I could reach similar duration on a slowed-down VM, with JIT enabled as
on these animals.)

So, maybe these machines require larger PGISOLATIONTIMEOUT or there is
still some OS/environment issue there?

Best regards,
Alexander

Re: Recent 027_streaming_regress.pl hangs

From

Andres Freund

Date:

04 April 2024, 17:00:55

Hi,

On 2024-04-04 19:00:00 +0300, Alexander Lakhin wrote:
> 26.03.2024 10:59, Andres Freund wrote:
> > Late, will try to look more in the next few days.
> > 
> 
> AFAICS, last 027_streaming_regress.pl failures on calliphoridae,
> culicidae, tamandua occurred before 2024-03-27:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27
> 
> So it looks like the issue resolved, but there is another apparently
> performance-related issue: deadlock-parallel test failures.

I reduced test concurrency a bit. I hadn't quite realized how the buildfarm
config and meson test concurrency interact.  But there's still something off
with the frequency of fsyncs during replay, but perhaps that doesn't qualify
as a bug.


> A recent one:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22
> test deadlock-parallel            ... FAILED   345099 ms

> (I could reach similar duration on a slowed-down VM, with JIT enabled as
> on these animals.)
> 
> So, maybe these machines require larger PGISOLATIONTIMEOUT or there is
> still some OS/environment issue there?

Hm, possible. Forcing every query to be JITed, in a debug build of LLVM is
absurdly expensive.

Greetings,

Andres Freund

Re: Recent 027_streaming_regress.pl hangs

From

Alexander Lakhin

Date:

04 June 2024, 10:00:00

Hello Andres,

>> So it looks like the issue resolved, but there is another apparently
>> performance-related issue: deadlock-parallel test failures.
> I reduced test concurrency a bit. I hadn't quite realized how the buildfarm
> config and meson test concurrency interact.  But there's still something off
> with the frequency of fsyncs during replay, but perhaps that doesn't qualify
> as a bug.

It looks like that set of animals is still suffering from extreme load.
Please take a look at the today's failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-06-04%2002%3A44%3A19

1/1 postgresql:regress-running / regress-running/regress TIMEOUT        3000.06s   killed by signal 15 SIGTERM

inst/logfile ends with:
2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] ERROR:  column "c2" of relation "test_add_column" 
already exists
2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] STATEMENT:  ALTER TABLE test_add_column
         ADD COLUMN c2 integer, -- fail because c2 already exists
         ADD COLUMN c3 integer primary key;
2024-06-04 03:39:30.815 UTC [3905755][client backend][5/0:0] LOG: could not send data to client: Broken pipe
2024-06-04 03:39:30.816 UTC [3905755][client backend][5/0:0] FATAL: connection to client lost

"ALTER TABLE test_add_column" is from the alter_table test, which executed
in the group 21 out of 25.

Another similar failure:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-24%2002%3A22%3A26&stg=install-check-C

1/1 postgresql:regress-running / regress-running/regress TIMEOUT        3000.06s   killed by signal 15 SIGTERM

inst/logfile ends with:
2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] ERROR:  could not change table "logged1" to unlogged

because it references logged table "logged2"
2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] STATEMENT:  ALTER TABLE logged1 SET UNLOGGED;
(This is the alter_table test again.)

I've analyzed duration of the regress-running/regress test for the recent
167 runs on skink and found that the average duration is 1595 seconds, but
there were much longer test runs:
2979.39: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-01%2004%3A15%3A29&stg=install-check-C
2932.86: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-04-28%2018%3A57%3A37&stg=install-check-C
2881.78: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-15%2020%3A53%3A30&stg=install-check-C

So it seems that the default timeout is not large enough for these
conditions. (I've counted 10 such timeout failures of 167 test runs.)

Also, 027_stream_regress still fails due to the same reason:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50
(It's remarkable that these two animals failed at the same time.)

Best regards,
Alexander

Re: Recent 027_streaming_regress.pl hangs

From

Alexander Lakhin

Date:

25 July 2024, 04:00:00

Hello Andrew,

04.06.2024 13:00, Alexander Lakhin wrote:
> Also, 027_stream_regress still fails due to the same reason:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50
> (It's remarkable that these two animals failed at the same time.)
>

It looks like crake is failing now due to other reasons (not just
concurrency) — it produced 10+ failures of the
027_stream_regress test starting from July, 9.

The first such failure on REL_16_STABLE was [1], and that was the first run
with 'PG_TEST_EXTRA' => '... wal_consistency_checking'.

There is one known issue related to wal_consistency_checking [2], but I
see no "incorrect resource manager data checksum" in the failure log...

Moreover, the first such failure on REL_17_STABLE was [3], but that run
was performed without wal_consistency_checking, as far as I can see.

Can that failure be also related to the OS upgrade (I see that back in
June crake was running on Fedora 39, but now it's running on Fedora 40)?

So maybe we have two factors combined or there is another one?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-17%2014%3A56%3A09
[2] https://www.postgresql.org/message-id/055bb33c-bccc-bc1d-c2f8-8598534448ac@gmail.com
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-09%2021%3A37%3A04

Best regards,
Alexander

Re: Recent 027_streaming_regress.pl hangs

From

Andrew Dunstan

Date:

25 July 2024, 18:08:43

On 2024-07-25 Th 12:00 AM, Alexander Lakhin wrote:
> Hello Andrew,
>
> 04.06.2024 13:00, Alexander Lakhin wrote:
>> Also, 027_stream_regress still fails due to the same reason:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03 
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50 
>>
>> (It's remarkable that these two animals failed at the same time.)
>>
>
> It looks like crake is failing now due to other reasons (not just
> concurrency) — it produced 10+ failures of the
> 027_stream_regress test starting from July, 9.
>
> The first such failure on REL_16_STABLE was [1], and that was the 
> first run
> with 'PG_TEST_EXTRA' => '... wal_consistency_checking'.
>
> There is one known issue related to wal_consistency_checking [2], but I
> see no "incorrect resource manager data checksum" in the failure log...
>
> Moreover, the first such failure on REL_17_STABLE was [3], but that run
> was performed without wal_consistency_checking, as far as I can see.
>
> Can that failure be also related to the OS upgrade (I see that back in
> June crake was running on Fedora 39, but now it's running on Fedora 40)?
>
> So maybe we have two factors combined or there is another one?
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-17%2014%3A56%3A09
> [2] 
> https://www.postgresql.org/message-id/055bb33c-bccc-bc1d-c2f8-8598534448ac@gmail.com
> [3] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-09%2021%3A37%3A04



Unlikely. The change in OS version was on June 17, more than a month ago.

But yes we do seem to have seen a lot of recovery_check failures on 
crake in the last 8 days, which is roughly when I changed PG_TEST_EXTRA 
to get more coverage.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

25 July 2024, 19:06:19

Andrew Dunstan <andrew@dunslane.net> writes:
> But yes we do seem to have seen a lot of recovery_check failures on
> crake in the last 8 days, which is roughly when I changed PG_TEST_EXTRA
> to get more coverage.

I'm confused by crake's buildfarm logs.  AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02

It seems implausible that it would only run the test occasionally,
so what I suspect is a bug in the buildfarm client causing it to
omit that step's log if successful.

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

25 July 2024, 21:14:13

I wrote:
> I'm confused by crake's buildfarm logs.  AFAICS it is not running
> recovery-check at all in most of the runs; at least there is no
> mention of that step, for example here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02

Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful.  That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?

Anyway, in this successful run:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check

here are some salient test timings:

  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.18s   9
subtestspassed 
  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               15.95s   12
subtestspassed 
  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               16.29s   14
subtestspassed 
 17/297 postgresql:isolation / isolation/isolation                                      OK               71.60s   119
subtestspassed 
 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              169.13s   18
subtestspassed 
140/297 postgresql:initdb / initdb/001_initdb                                           OK               41.34s   52
subtestspassed 
170/297 postgresql:recovery / recovery/027_stream_regress                               OK              469.49s   9
subtestspassed 

while in the next, failing run

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check

the same tests took:

  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.22s   9
subtestspassed 
  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               56.62s   12
subtestspassed 
  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               71.92s   14
subtestspassed 
 21/297 postgresql:isolation / isolation/isolation                                      OK              299.12s   119
subtestspassed 
 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              344.42s   18
subtestspassed 
159/297 postgresql:initdb / initdb/001_initdb                                           OK              344.46s   52
subtestspassed 
162/297 postgresql:recovery / recovery/027_stream_regress                               ERROR           840.84s   exit
status29 

Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Thomas Munro

Date:

25 July 2024, 22:10:29

On Fri, Jul 26, 2024 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Based on this, it seems fairly likely that crake is simply timing out
> as a consequence of intermittent heavy background activity.

Would it be better to keep going as long as progress is being made?
I.e. time out only when the relevant LSN stops advancing for N
seconds?  Or something like that...

Re: Recent 027_streaming_regress.pl hangs

From

Andrew Dunstan

Date:

25 July 2024, 22:33:19

On 2024-07-25 Th 5:14 PM, Tom Lane wrote:

I wrote:

I'm confused by crake's buildfarm logs.  AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02

Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful.  That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?

Yes, it was discussed some time ago. As suggested by Andres, we run the meson test suite more or less all together (in effect like "make checkworld" but without the main regression suite, which is run on its own first), rather than in the separate (and serialized) way we do with the configure/make tests. That results in significant speedup. If the tests fail we report the failure as happening at the first failure we encounter, which is possibly less than ideal, but I haven't got a better idea.


Anyway, in this successful run:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check

here are some salient test timings:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.18s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               15.95s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               16.29s   14 subtests passed 17/297 postgresql:isolation / isolation/isolation                                      OK               71.60s   119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              169.13s   18 subtests passed
140/297 postgresql:initdb / initdb/001_initdb                                           OK               41.34s   52 subtests passed
170/297 postgresql:recovery / recovery/027_stream_regress                               OK              469.49s   9 subtests passed

while in the next, failing run

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check

the same tests took:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.22s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               56.62s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               71.92s   14 subtests passed 21/297 postgresql:isolation / isolation/isolation                                      OK              299.12s   119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              344.42s   18 subtests passed
159/297 postgresql:initdb / initdb/001_initdb                                           OK              344.46s   52 subtests passed
162/297 postgresql:recovery / recovery/027_stream_regress                               ERROR           840.84s   exit status 29

Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.

The latest failure is this:

Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103.


Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.

I have added a mild throttle to the buildfarm config so it doesn't try to run every branch at once. Maybe I also need to bring down the number or meson jobs too? But I suspect there's something deeper. Prior to the failure of this test 10 days ago it hadn't failed in a very long time. The OS was upgraded a month ago. Eight or so days ago I changed PG_TEST_EXTRA. I can't think of anything else.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

From

Andrew Dunstan

Date:

31 July 2024, 15:54:31

On 2024-07-25 Th 6:33 PM, Andrew Dunstan wrote:

On 2024-07-25 Th 5:14 PM, Tom Lane wrote:

I wrote:

I'm confused by crake's buildfarm logs.  AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02

Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful.  That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?

Anyway, in this successful run:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check

here are some salient test timings:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.18s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               15.95s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               16.29s   14 subtests passed 17/297 postgresql:isolation / isolation/isolation                                      OK               71.60s   119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              169.13s   18 subtests passed
140/297 postgresql:initdb / initdb/001_initdb                                           OK               41.34s   52 subtests passed
170/297 postgresql:recovery / recovery/027_stream_regress                               OK              469.49s   9 subtests passed

while in the next, failing run

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check

the same tests took:
  1/297 postgresql:pg_upgrade / pg_upgrade/001_basic                                    OK                0.22s   9 subtests passed  2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots                            OK               56.62s   12 subtests passed  3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription                             OK               71.92s   14 subtests passed 21/297 postgresql:isolation / isolation/isolation                                      OK              299.12s   119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade                               OK              344.42s   18 subtests passed
159/297 postgresql:initdb / initdb/001_initdb                                           OK              344.46s   52 subtests passed
162/297 postgresql:recovery / recovery/027_stream_regress                               ERROR           840.84s   exit status 29

Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.

The latest failure is this:

Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103.


Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.

There seem to be a bunch of recent failures, and not just on crake, and not just on HEAD: <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

31 July 2024, 16:05:34

Andrew Dunstan <andrew@dunslane.net> writes:
> There seem to be a bunch of recent failures, and not just on crake, and
> not just on HEAD:
> <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>

There were a batch of recovery-stage failures ending about 9 days ago
caused by instability of the new 043_vacuum_horizon_floor.pl test.
Once you take those out it doesn't look quite so bad.

            regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

From

Andrew Dunstan

Date:

31 July 2024, 16:58:37

On 2024-07-31 We 12:05 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> There seem to be a bunch of recent failures, and not just on crake, and
>> not just on HEAD:
>> <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>
> There were a batch of recovery-stage failures ending about 9 days ago
> caused by instability of the new 043_vacuum_horizon_floor.pl test.
> Once you take those out it doesn't look quite so bad.

We'll see. I have switched crake from --run-parallel mode to --run-all 
mode i.e. the runs are serialized. Maybe that will be enough to stop the 
errors. I'm still annoyed that this test is susceptible to load, if that 
is indeed what is the issue.

cheers

andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

12 August 2024, 00:32:05

Andrew Dunstan <andrew@dunslane.net> writes:
> We'll see. I have switched crake from --run-parallel mode to --run-all 
> mode i.e. the runs are serialized. Maybe that will be enough to stop the 
> errors. I'm still annoyed that this test is susceptible to load, if that 
> is indeed what is the issue.

crake is still timing out intermittently on 027_streaming_regress.pl,
so that wasn't it.  I think we need more data.  We know that the
wait_for_catchup query is never getting to true:

    SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'

but we don't know if the LSN condition or the state condition is
what is failing.  And if it is the LSN condition, it'd be good
to see the actual last LSN, so we can look for patterns like
whether there is a page boundary crossing involved.  So I suggest
adding something like the attached.

If we do this, I'd be inclined to instrument wait_for_slot_catchup
and wait_for_subscription_sync similarly, but I thought I'd check
for contrary opinions first.

            regards, tom lane

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 32ee98aebc..3403626f92 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2948,6 +2948,13 @@ sub wait_for_catchup
         }
         else
         {
+            # Fetch additional detail for debugging purposes
+            $query = qq[SELECT application_name, ${mode}_lsn, state
+                        FROM pg_catalog.pg_stat_replication
+                        WHERE application_name IN ('$standby_name', 'walreceiver')];
+            my $details = $self->safe_psql('postgres', $query);
+            diag qq(Last application_name|${mode}_lsn|state:
+${details});
             croak "timed out waiting for catchup";
         }
     }

Re: Recent 027_streaming_regress.pl hangs

From

Andrew Dunstan

Date:

12 August 2024, 13:43:43

On 2024-08-11 Su 8:32 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We'll see. I have switched crake from --run-parallel mode to --run-all 
mode i.e. the runs are serialized. Maybe that will be enough to stop the 
errors. I'm still annoyed that this test is susceptible to load, if that 
is indeed what is the issue.

crake is still timing out intermittently on 027_streaming_regress.pl,
so that wasn't it.  I think we need more data.  We know that the
wait_for_catchup query is never getting to true:
	SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'

but we don't know if the LSN condition or the state condition is
what is failing.  And if it is the LSN condition, it'd be good
to see the actual last LSN, so we can look for patterns like
whether there is a page boundary crossing involved.  So I suggest
adding something like the attached.

If we do this, I'd be inclined to instrument wait_for_slot_catchup
and wait_for_subscription_sync similarly, but I thought I'd check
for contrary opinions first.

Seems reasonable.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Recent 027_streaming_regress.pl hangs

From

Tom Lane

Date:

12 August 2024, 17:21:04

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-08-11 Su 8:32 PM, Tom Lane wrote:
>> I think we need more data.  We know that the
>> wait_for_catchup query is never getting to true:
>> 
>> SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
>> 
>> but we don't know if the LSN condition or the state condition is
>> what is failing.  And if it is the LSN condition, it'd be good
>> to see the actual last LSN, so we can look for patterns like
>> whether there is a page boundary crossing involved.  So I suggest
>> adding something like the attached.

> Seems reasonable.

Pushed.  In the event I made it just "SELECT * FROM" the relevant
view: there will be few if any rows that aren't potentially
interesting, and filtering the columns doesn't seem like a
forward-looking idea either.

            regards, tom lane