Thread: Recent 027_streaming_regress.pl hangs
Hi, Several animals are timing out while waiting for catchup, sporadically. I don't know why. The oldest example I have found so far by clicking around is: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35 So perhaps something was committed ~3 weeks ago triggered this. There are many examples since, showing as recoveryCheck failures. Apparently they are all on animals wrangled by Andres. Hmm. I think some/all share a physical host, they seem to have quite high run time variance, and they're using meson.
On Wed, Mar 13, 2024 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-23%2015%3A44%3A35 Assuming it is due to a commit in master, and given the failure frequency, I think it is very likely to be a change from this 3 day window of commits, and more likely in the top half dozen or so: d360e3cc60e Fix compiler warning on typedef redeclaration 8af25652489 Introduce a new smgr bulk loading facility. e612384fc78 Fix mistake in SQL features list d13ff82319c Fix BF failure in commit 93db6cbda0. efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on unexpected input 93db6cbda03 Add a new slot sync worker to synchronize logical slots. 3d47b75546d pgindent fix b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list. fbc93b8b5f5 Remove custom Constraint node read/write implementations 801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and 7a424ece48. 011d60c4352 Speed up uuid_out() by not relying on a StringInfo 943f7ae1c86 Add lookup table for replication slot conflict reasons 28f3915b73f Remove superfluous 'pgprocno' field from PGPROC 4989ce72644 MERGE ... DO NOTHING: require SELECT privileges ed345c2728b Fix typo 690805ca754 doc: Fix link to pg_ident_file_mappings view ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init() 75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal() fcd210d496d Doc: improve explanation of type interval, especially extract(). 489072ab7a9 Replace relids in lateral subquery parse tree during SJE 74563f6b902 Revert "Improve compression and storage support with inheritance" d2ca9a50b5b Minor corrections for partition pruning 818fefd8fd4 Fix race leading to incorrect conflict cause in InvalidatePossiblyObsoleteSlot() 01ec4d89b91 doc: Use system-username instead of system-user I just haven't got a specific theory yet, as the logs are empty. I wonder if some kind of failures could start firing signals around to get us a stack.
On Thu, Mar 14, 2024 at 03:00:28PM +1300, Thomas Munro wrote: > Assuming it is due to a commit in master, and given the failure > frequency, I think it is very likely to be a change from this 3 day > window of commits, and more likely in the top half dozen or so: > > d360e3cc60e Fix compiler warning on typedef redeclaration > 8af25652489 Introduce a new smgr bulk loading facility. > e612384fc78 Fix mistake in SQL features list > d13ff82319c Fix BF failure in commit 93db6cbda0. > efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on > unexpected input > 93db6cbda03 Add a new slot sync worker to synchronize logical slots. > 3d47b75546d pgindent fix > b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list. > fbc93b8b5f5 Remove custom Constraint node read/write implementations > 801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and > 7a424ece48. > 011d60c4352 Speed up uuid_out() by not relying on a StringInfo > 943f7ae1c86 Add lookup table for replication slot conflict reasons > 28f3915b73f Remove superfluous 'pgprocno' field from PGPROC > 4989ce72644 MERGE ... DO NOTHING: require SELECT privileges > ed345c2728b Fix typo > 690805ca754 doc: Fix link to pg_ident_file_mappings view > ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init() > 75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal() > fcd210d496d Doc: improve explanation of type interval, especially extract(). > 489072ab7a9 Replace relids in lateral subquery parse tree during SJE > 74563f6b902 Revert "Improve compression and storage support with inheritance" > d2ca9a50b5b Minor corrections for partition pruning > 818fefd8fd4 Fix race leading to incorrect conflict cause in > InvalidatePossiblyObsoleteSlot() > 01ec4d89b91 doc: Use system-username instead of system-user > > I just haven't got a specific theory yet, as the logs are empty. I > wonder if some kind of failures could start firing signals around to > get us a stack. Thanks for providing this list and an analysis. Hmm. Perhaps 8af25652489? That looks like the closest thing in the list that could have played with the way WAL is generated, hence potentially impacting the records that are replayed. 93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch unrelated territory. -- Michael
Attachment
On Thu, Mar 14, 2024 at 3:27 PM Michael Paquier <michael@paquier.xyz> wrote: > Hmm. Perhaps 8af25652489? That looks like the closest thing in the > list that could have played with the way WAL is generated, hence > potentially impacting the records that are replayed. Yeah, I was wondering if its checkpoint delaying logic might have got the checkpointer jammed or something like that, but I don't currently see how. Yeah, the replay of bulk newpages could be relevant, but it's not exactly new technology. One thing I wondered about is whether the Perl "wait for catchup" thing, which generates large volumes of useless log, could be somehow changed to actually show the progress after some time. Something like "I'm still waiting for this replica to reach LSN X, but it has so far only reported LSN Y, and here's a dump of the WAL around there"? > 93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch > unrelated territory. Hmm.
Hello Thomas and Michael, 14.03.2024 06:16, Thomas Munro wrote: > > Yeah, I was wondering if its checkpoint delaying logic might have > got the checkpointer jammed or something like that, but I don't > currently see how. Yeah, the replay of bulk newpages could be > relevant, but it's not exactly new technology. One thing I wondered > about is whether the Perl "wait for catchup" thing, which generates > large volumes of useless log, could be somehow changed to actually > show the progress after some time. Something like "I'm still waiting > for this replica to reach LSN X, but it has so far only reported LSN > Y, and here's a dump of the WAL around there"? I have perhaps reproduced the issue here (at least I'm seeing something similar), and going to investigate the issue in the coming days, but what I'm confused with now is the duration of poll_query_until: For the failure you referenced: [15:55:54.740](418.725s) # poll_query_until timed out executing this query: And a couple of others: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-08%2000%3A34%3A06 [00:45:57.747](376.159s) # poll_query_until timed out executing this query: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-04%2016%3A32%3A17 [16:45:24.870](407.970s) # poll_query_until timed out executing this query: Could it be that the timeout (360 sec?) is just not enough for the test under the current (changed due to switch to meson) conditions? Best regards, Alexander
On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > Could it be that the timeout (360 sec?) is just not enough for the test > under the current (changed due to switch to meson) conditions? Hmm, well it looks like he switched over to meson around 42 days ago 2024-02-01, looking at "calliphoridae" (skink has the extra complication of valgrind, let's look at a more 'normal' animal instead). The first failure that looks like that on calliphoridae is 19 days ago 2024-02-23, and after that it's happening every 3 days, sometimes in clusters. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=calliphoridae&br=HEAD But you're right that under meson the test takes a lot longer, I guess due to increased concurrency: 287/287 postgresql:recovery / recovery/027_stream_regress OK 684.50s 6 subtests passed With make we don't have an individual time per script, but for for all of the recovery tests we had for example: t/027_stream_regress.pl ............... ok All tests successful. Files=39, Tests=542, 65 wallclock secs ( 0.26 usr 0.06 sys + 20.16 cusr 31.65 csys = 52.13 CPU)
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >> Could it be that the timeout (360 sec?) is just not enough for the test >> under the current (changed due to switch to meson) conditions? > But you're right that under meson the test takes a lot longer, I guess > due to increased concurrency: What it seems to be is highly variable. Looking at calliphoridae's last half dozen successful runs, I see reported times for 027_stream_regress anywhere from 183 to 704 seconds. I wonder what else is happening on that machine. Also, this is probably not helping anything: 'extra_config' => { ... 'fsync = on' I would suggest turning that off and raising wait_timeout a good deal, and then we'll see if calliphoridae gets any more stable. regards, tom lane
14.03.2024 23:56, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >>> Could it be that the timeout (360 sec?) is just not enough for the test >>> under the current (changed due to switch to meson) conditions? >> But you're right that under meson the test takes a lot longer, I guess >> due to increased concurrency: > What it seems to be is highly variable. Looking at calliphoridae's > last half dozen successful runs, I see reported times for > 027_stream_regress anywhere from 183 to 704 seconds. I wonder what > else is happening on that machine. Also, this is probably not > helping anything: > > 'extra_config' => { > ... > 'fsync = on' > > I would suggest turning that off and raising wait_timeout a good > deal, and then we'll see if calliphoridae gets any more stable. I could reproduce similar failures with PG_TEST_EXTRA=wal_consistency_checking only, running 5 tests in parallel on a slowed-down VM, so that test duration increased to ~1900 seconds, but perhaps that buildfarm machine has a different bottleneck (I/O?) or it's concurrent workload is not uniform as in my experiments. Meanwhile, I've analyzed failed test logs from buildfarm and calculated the percentage of WAL replayed before timeout. For instance, one of the failures: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40 standby_1.log: 2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG: started streaming WAL from primary at 0/3000000 on timeline1 ... 2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart point at 0/E00E030 2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 22:41:26.647756+00. 2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly primary.log: 2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: GRANT ALL ON SCHEMA public TO public; # ^^^ One of the first records produced by `make check` ... 2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG: statement: ALTER VIEW my_property_secure SET (security_barrier=false); # ^^^ A record near the last completed transaction on standby ... 2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG: statement: DROP TABLESPACE regress_tblspace_renamed; # ^^^ One of the last records produced by `make check` \set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756' select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from (:'t1'::time - :'t0'::time)); ~52% https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58 ~48% https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13 ~43% https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09 ~36% https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38 ~87% https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41 ~36% So it still looks like a performance-related issue to me. And yes, fsync = off -> on greatly increases (~3x) the overall test duration in that my environment. Best regards, Alexander
Hi, On 2024-03-14 16:56:39 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > >> Could it be that the timeout (360 sec?) is just not enough for the test > >> under the current (changed due to switch to meson) conditions? > > > But you're right that under meson the test takes a lot longer, I guess > > due to increased concurrency: > > What it seems to be is highly variable. Looking at calliphoridae's > last half dozen successful runs, I see reported times for > 027_stream_regress anywhere from 183 to 704 seconds. I wonder what > else is happening on that machine. There's a lot of other animals on the same machine, however it's rarely fuly loaded (with either CPU or IO). I don't think the test just being slow is the issue here, e.g. in the last failing iteration https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15 the tests completed 2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG: statement: DROP ROLE regress_tablespace_user2; 2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG: disconnection: session time: 0:00:12.431 user=bf database=regressionhost=[local] and we waited to replicate for quite a while: 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name IN ('standby_1', 'walreceiver') 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgreshost=[local] 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down There was no activity for 7 minutes. System statistics show relatively low load CPU and IO load for the period from 22:00 - 22:10. I suspect we have some more fundamental instability at our hands, there have been failures like this going back a while, and on various machines. I think at the very least we should make Cluster.pm's wait_for_catchup() print some information when it times out - right now it's neigh on undebuggable, because we don't even log what we were waiting for and what the actual replication position was. > Also, this is probably not > helping anything: > > 'extra_config' => { > ... > 'fsync = on' At some point we had practically no test coverage of fsync, so I made my animals use fsync. I think we still have little coverage. I probably could reduce the number of animals using it though. Greetings, Andres Freund
Hi, On 2024-03-20 17:41:47 -0700, Andres Freund wrote: > There's a lot of other animals on the same machine, however it's rarely fuly > loaded (with either CPU or IO). > > I don't think the test just being slow is the issue here, e.g. in the last > failing iteration > > [...] > > I suspect we have some more fundamental instability at our hands, there have > been failures like this going back a while, and on various machines. I'm somewhat confused by the timestamps in the log: [22:07:50.263](223.929s) ok 2 - regression tests pass ... [22:14:02.051](371.788s) # poll_query_until timed out executing this query: I read this as 371.788s having passed between the messages. Which of course is much higher than PostgreSQL::Test::Utils::timeout_default=180 Ah. The way that poll_query_until() implements timeouts seems decidedly suboptimal. If a psql invocation, including query processing, takes any appreciateble amount of time, poll_query_until() waits much longer than it shoulds, because it very naively determines a number of waits ahead of time: my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default; my $attempts = 0; while ($attempts < $max_attempts) { ... # Wait 0.1 second before retrying. usleep(100_000); $attempts++; } Ick. What's worse is that if the query takes too long, the timeout afaict never takes effect. Greetings, Andres Freund
Hi, On 2024-03-20 17:41:45 -0700, Andres Freund wrote: > 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl > 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state= 'streaming' > FROM pg_catalog.pg_stat_replication > WHERE application_name IN ('standby_1', 'walreceiver') > 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgreshost=[local] > 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request > 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down > > There was no activity for 7 minutes. > > System statistics show relatively low load CPU and IO load for the period from > 22:00 - 22:10. > > > I suspect we have some more fundamental instability at our hands, there have > been failures like this going back a while, and on various machines. I've reproduced something like this scenario locally, although I am not sure it is precisely what is happening on the buildfarm. At least here it looks like the problem is that apply is lagging substantially: 2024-03-20 22:43:11.024 PDT [1023505][walreceiver][:0][] DEBUG: sendtime 2024-03-20 22:43:11.024348-07 receipttime 2024-03-2022:43:11.02437-07 replication apply delay 285322 ms transfer latency 1 ms Which then means that we'll wait for a long time for apply to finish: Waiting for replication conn standby_1's replay_lsn to pass 0/14385E20 on primary [22:41:34.521](0.221s) # state before polling: # pid | 1023508 # application_name | standby_1 # sent_lsn | 0/14385E20 # primary_wal_lsn | 0/14385E20 # standby_write_lsn | 0/14385E20 # primary_flush_lsn | 0/14385E20 # standby_flush_lsn | 0/14385E20 # standby_replay_lsn | 0/126D5C58 ... [22:43:16.376](0.161s) # running query, attempt 679/1800 [22:43:16.627](0.251s) # state now: # pid | 1023508 # application_name | standby_1 # sent_lsn | 0/14778468 # primary_wal_lsn | 0/14778468 # standby_write_lsn | 0/14778468 # primary_flush_lsn | 0/14778468 # standby_flush_lsn | 0/14778468 # standby_replay_lsn | 0/14778468 I am not sure I have debugged why exactly, but it sure looks like one part is the startup process being busy unlinking files synchronously. This appears to be exacerbated by mdunlinkfork() first truncating and then separately unlinking the file - that looks to trigger a lot of filesystem journal flushes (on xfs). We also spend a fair bit of time updating the control file, because we flush the WAL when replaying a transaction commit with a relation unlink. That also badly interacts with doing metadata operations... Thirdly, we flush received WAL extremely granularly at times, which requires another fsync: 2024-03-20 23:30:21.469 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB0000 2024-03-20 23:30:21.473 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB0170 2024-03-20 23:30:21.479 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB2528 2024-03-20 23:30:21.480 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB58C8 2024-03-20 23:30:21.487 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB7DA0 2024-03-20 23:30:21.490 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB92B0 2024-03-20 23:30:21.494 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBBAC0 2024-03-20 23:30:21.496 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBCCC0 2024-03-20 23:30:21.499 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBCE18 This all when we're quite far behind with apply... Greetings, Andres Freund
Hi, On 2024-03-20 17:41:45 -0700, Andres Freund wrote: > On 2024-03-14 16:56:39 -0400, Tom Lane wrote: > > Also, this is probably not > > helping anything: > > > > 'extra_config' => { > > ... > > 'fsync = on' > > At some point we had practically no test coverage of fsync, so I made my > animals use fsync. I think we still have little coverage. I probably could > reduce the number of animals using it though. I think there must be some actual regression involved. The frequency of failures on HEAD vs failures on 16 - both of which run the tests concurrently via meson - is just vastly different. I'd expect the absolute number of failures in 027_stream_regress.pl to differ between branches due to fewer runs on 16, but there's no explanation for the difference in percentage of failures. My menagerie had only a single recoveryCheck failure on !HEAD in the last 30 days, but in the vicinity of 100 on HEAD https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit If anything the load when testing back branch changes is higher, because commonly back-branch builds are happening on all branches, so I don't think that can be the explanation either. From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a rash of recoveryCheck failures in the days before that too, but not 027_stream_regress.pl in that way. It certainly seems suspicious that one commit before the first observed failure is 2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead(). Of course the failure rate is low enough that it could have been a day or two before that, too. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think there must be some actual regression involved. The frequency of > failures on HEAD vs failures on 16 - both of which run the tests concurrently > via meson - is just vastly different. Are you sure it's not just that the total time to run the core regression tests has grown to a bit more than what the test timeout allows for? regards, tom lane
Hi, On 2024-03-26 00:00:38 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I think there must be some actual regression involved. The frequency of > > failures on HEAD vs failures on 16 - both of which run the tests concurrently > > via meson - is just vastly different. > > Are you sure it's not just that the total time to run the core > regression tests has grown to a bit more than what the test timeout > allows for? You're right, that could be it - in a way at least, the issue is replay not catching up within 180s, so it'd have to be the data volume growing, I think. But it doesn't look like the regression volume meaningfully grew around that time? I guess I'll try to write a buildfarm database query to extract how long that phase of the test took from all runs on my menagerie, not just the failing one, and see if there's a visible trend. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2024-03-26 00:00:38 -0400, Tom Lane wrote: >> Are you sure it's not just that the total time to run the core >> regression tests has grown to a bit more than what the test timeout >> allows for? > You're right, that could be it - in a way at least, the issue is replay not > catching up within 180s, so it'd have to be the data volume growing, I think. > But it doesn't look like the regression volume meaningfully grew around that > time? No, but my impression is that the failure rate has been getting slowly worse for awhile now. > I guess I'll try to write a buildfarm database query to extract how long that > phase of the test took from all runs on my menagerie, not just the failing > one, and see if there's a visible trend. +1 regards, tom lane
Hi, On 2024-03-26 00:54:54 -0400, Tom Lane wrote: > > I guess I'll try to write a buildfarm database query to extract how long that > > phase of the test took from all runs on my menagerie, not just the failing > > one, and see if there's a visible trend. > > +1 Only the query for successful runs has finished, and it looks like the error cse is still going to take a while longer, so here are excerpts from a query showing how long 027_stream_regress.pl took to succeed: sysname | date | count | avg_duration ---------------+------------+-------+-------------- calliphoridae | 2024-01-29 | 10 | 435 calliphoridae | 2024-02-05 | 25 | 496 calliphoridae | 2024-02-12 | 36 | 522 calliphoridae | 2024-02-19 | 25 | 445 calliphoridae | 2024-02-26 | 35 | 516 calliphoridae | 2024-03-04 | 53 | 507 calliphoridae | 2024-03-11 | 51 | 532 calliphoridae | 2024-03-18 | 53 | 548 calliphoridae | 2024-03-25 | 13 | 518 culicidae | 2024-01-29 | 11 | 420 culicidae | 2024-02-05 | 31 | 485 culicidae | 2024-02-12 | 35 | 513 culicidae | 2024-02-19 | 29 | 489 culicidae | 2024-02-26 | 36 | 512 culicidae | 2024-03-04 | 63 | 541 culicidae | 2024-03-11 | 62 | 597 culicidae | 2024-03-18 | 56 | 603 culicidae | 2024-03-25 | 16 | 550 tamandua | 2024-01-29 | 13 | 420 tamandua | 2024-02-05 | 29 | 433 tamandua | 2024-02-12 | 34 | 431 tamandua | 2024-02-19 | 27 | 382 tamandua | 2024-02-26 | 36 | 492 tamandua | 2024-03-04 | 60 | 475 tamandua | 2024-03-11 | 56 | 533 tamandua | 2024-03-18 | 54 | 527 tamandua | 2024-03-25 | 21 | 507 Particularly on tamandua it does look like there has been an upwards trend. Late, will try to look more in the next few days. Greetings, Andres Freund
Hello Andres, 26.03.2024 10:59, Andres Freund wrote: > Late, will try to look more in the next few days. > AFAICS, last 027_streaming_regress.pl failures on calliphoridae, culicidae, tamandua occurred before 2024-03-27: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27 So it looks like the issue resolved, but there is another apparently performance-related issue: deadlock-parallel test failures. A recent one: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22 test deadlock-parallel ... FAILED 345099 ms +isolationtester: canceling step d2a1 after 300 seconds step d2a1: <... completed> - sum ------ -10000 -(1 row) - ... The server log shows: 2024-04-02 23:56:45.353 UTC [3583878][client backend][5/530:0] LOG: statement: SET force_parallel_mode = on; ... SELECT lock_share(3,x) FROM bigt LIMIT 1; 2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}') 2024-04-02 23:56:45.364 UTC [3583876][client backend][3/2732:0] DETAIL: parameters: $1 = '3583878' ... 2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}') 2024-04-02 23:57:28.967 UTC [3583876][client backend][3/5097:0] DETAIL: parameters: $1 = '3583877' 2024-04-02 23:57:29.016 UTC [3583877][client backend][4/530:0] LOG: statement: COMMIT; 2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}') 2024-04-02 23:57:29.039 UTC [3583876][client backend][3/5098:0] DETAIL: parameters: $1 = '3583879' ... 2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{3583877,3583878,3583879,3583880}') 2024-04-03 00:02:29.096 UTC [3583876][client backend][3/9472:0] DETAIL: parameters: $1 = '3583878' 2024-04-03 00:02:29.172 UTC [3905345][not initialized][:0] LOG: connection received: host=[local] 2024-04-03 00:02:29.240 UTC [3583878][client backend][5/530:0] ERROR: canceling statement due to user request The last step duration is 00:02:29.096 - 23:57:29.039 ~ 300 seconds (default max_step_wait for REL_15_STABLE- (for REL_16_STABLE+ the default value was increased to 360 by c99c67fc4)). The average deadlock-parallel duration for REL_15_STABLE on canebrake is around 128 seconds (for 140 runs I analyzed), but we can find also: test deadlock-parallel ... ok 377895 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2024-03-27%2001%3A06%3A24&stg=isolation-check test deadlock-parallel ... ok 302549 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-06%2012%3A47%3A01&stg=isolation-check test deadlock-parallel ... ok 255045 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=canebrake&dt=2023-11-09%2010%3A02%3A59&stg=isolation-check The similar situation on phycodurus: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2024-02-11%2021:05:41 test deadlock-parallel ... FAILED 389381 ms The average deadlock-parallel duration for REL_13_STABLE on phycodurus is around 78 seconds (for 138 recent runs), but there were also: test deadlock-parallel ... ok 441736 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2024-03-04%2015%3A19%3A04&stg=isolation-check test deadlock-parallel ... ok 187844 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=phycodurus&dt=2023-11-03%2016%3A13%3A46&stg=isolation-check And also pogona, REL_14_STABLE: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona&dt=2024-02-20%2003%3A50%3A49 test deadlock-parallel ... FAILED 425482 ms (I could reach similar duration on a slowed-down VM, with JIT enabled as on these animals.) So, maybe these machines require larger PGISOLATIONTIMEOUT or there is still some OS/environment issue there? Best regards, Alexander
Hi, On 2024-04-04 19:00:00 +0300, Alexander Lakhin wrote: > 26.03.2024 10:59, Andres Freund wrote: > > Late, will try to look more in the next few days. > > > > AFAICS, last 027_streaming_regress.pl failures on calliphoridae, > culicidae, tamandua occurred before 2024-03-27: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-26%2004%3A07%3A30 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-22%2013%3A26%3A21 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-24%2007%3A44%3A27 > > So it looks like the issue resolved, but there is another apparently > performance-related issue: deadlock-parallel test failures. I reduced test concurrency a bit. I hadn't quite realized how the buildfarm config and meson test concurrency interact. But there's still something off with the frequency of fsyncs during replay, but perhaps that doesn't qualify as a bug. > A recent one: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2024-04-02%2022%3A20%3A22 > test deadlock-parallel ... FAILED 345099 ms > (I could reach similar duration on a slowed-down VM, with JIT enabled as > on these animals.) > > So, maybe these machines require larger PGISOLATIONTIMEOUT or there is > still some OS/environment issue there? Hm, possible. Forcing every query to be JITed, in a debug build of LLVM is absurdly expensive. Greetings, Andres Freund
Hello Andres, >> So it looks like the issue resolved, but there is another apparently >> performance-related issue: deadlock-parallel test failures. > I reduced test concurrency a bit. I hadn't quite realized how the buildfarm > config and meson test concurrency interact. But there's still something off > with the frequency of fsyncs during replay, but perhaps that doesn't qualify > as a bug. It looks like that set of animals is still suffering from extreme load. Please take a look at the today's failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-06-04%2002%3A44%3A19 1/1 postgresql:regress-running / regress-running/regress TIMEOUT 3000.06s killed by signal 15 SIGTERM inst/logfile ends with: 2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] ERROR: column "c2" of relation "test_add_column" already exists 2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] STATEMENT: ALTER TABLE test_add_column ADD COLUMN c2 integer, -- fail because c2 already exists ADD COLUMN c3 integer primary key; 2024-06-04 03:39:30.815 UTC [3905755][client backend][5/0:0] LOG: could not send data to client: Broken pipe 2024-06-04 03:39:30.816 UTC [3905755][client backend][5/0:0] FATAL: connection to client lost "ALTER TABLE test_add_column" is from the alter_table test, which executed in the group 21 out of 25. Another similar failure: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-24%2002%3A22%3A26&stg=install-check-C 1/1 postgresql:regress-running / regress-running/regress TIMEOUT 3000.06s killed by signal 15 SIGTERM inst/logfile ends with: 2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] ERROR: could not change table "logged1" to unlogged because it references logged table "logged2" 2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] STATEMENT: ALTER TABLE logged1 SET UNLOGGED; (This is the alter_table test again.) I've analyzed duration of the regress-running/regress test for the recent 167 runs on skink and found that the average duration is 1595 seconds, but there were much longer test runs: 2979.39: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-01%2004%3A15%3A29&stg=install-check-C 2932.86: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-04-28%2018%3A57%3A37&stg=install-check-C 2881.78: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-15%2020%3A53%3A30&stg=install-check-C So it seems that the default timeout is not large enough for these conditions. (I've counted 10 such timeout failures of 167 test runs.) Also, 027_stream_regress still fails due to the same reason: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50 (It's remarkable that these two animals failed at the same time.) Best regards, Alexander
Hello Andrew, 04.06.2024 13:00, Alexander Lakhin wrote: > Also, 027_stream_regress still fails due to the same reason: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50 > (It's remarkable that these two animals failed at the same time.) > It looks like crake is failing now due to other reasons (not just concurrency) — it produced 10+ failures of the 027_stream_regress test starting from July, 9. The first such failure on REL_16_STABLE was [1], and that was the first run with 'PG_TEST_EXTRA' => '... wal_consistency_checking'. There is one known issue related to wal_consistency_checking [2], but I see no "incorrect resource manager data checksum" in the failure log... Moreover, the first such failure on REL_17_STABLE was [3], but that run was performed without wal_consistency_checking, as far as I can see. Can that failure be also related to the OS upgrade (I see that back in June crake was running on Fedora 39, but now it's running on Fedora 40)? So maybe we have two factors combined or there is another one? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-17%2014%3A56%3A09 [2] https://www.postgresql.org/message-id/055bb33c-bccc-bc1d-c2f8-8598534448ac@gmail.com [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-09%2021%3A37%3A04 Best regards, Alexander
On 2024-07-25 Th 12:00 AM, Alexander Lakhin wrote: > Hello Andrew, > > 04.06.2024 13:00, Alexander Lakhin wrote: >> Also, 027_stream_regress still fails due to the same reason: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03 >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50 >> >> (It's remarkable that these two animals failed at the same time.) >> > > It looks like crake is failing now due to other reasons (not just > concurrency) — it produced 10+ failures of the > 027_stream_regress test starting from July, 9. > > The first such failure on REL_16_STABLE was [1], and that was the > first run > with 'PG_TEST_EXTRA' => '... wal_consistency_checking'. > > There is one known issue related to wal_consistency_checking [2], but I > see no "incorrect resource manager data checksum" in the failure log... > > Moreover, the first such failure on REL_17_STABLE was [3], but that run > was performed without wal_consistency_checking, as far as I can see. > > Can that failure be also related to the OS upgrade (I see that back in > June crake was running on Fedora 39, but now it's running on Fedora 40)? > > So maybe we have two factors combined or there is another one? > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-17%2014%3A56%3A09 > [2] > https://www.postgresql.org/message-id/055bb33c-bccc-bc1d-c2f8-8598534448ac@gmail.com > [3] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-09%2021%3A37%3A04 Unlikely. The change in OS version was on June 17, more than a month ago. But yes we do seem to have seen a lot of recovery_check failures on crake in the last 8 days, which is roughly when I changed PG_TEST_EXTRA to get more coverage. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > But yes we do seem to have seen a lot of recovery_check failures on > crake in the last 8 days, which is roughly when I changed PG_TEST_EXTRA > to get more coverage. I'm confused by crake's buildfarm logs. AFAICS it is not running recovery-check at all in most of the runs; at least there is no mention of that step, for example here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02 It seems implausible that it would only run the test occasionally, so what I suspect is a bug in the buildfarm client causing it to omit that step's log if successful. regards, tom lane
I wrote: > I'm confused by crake's buildfarm logs. AFAICS it is not running > recovery-check at all in most of the runs; at least there is no > mention of that step, for example here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02 Oh, I see it: the log file that is called recovery-check in a failing run is called misc-check if successful. That seems mighty bizarre, and it's not how my own animals behave. Something weird about the meson code path, perhaps? Anyway, in this successful run: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check here are some salient test timings: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtestspassed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtestspassed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtestspassed 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtestspassed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtestspassed 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtestspassed 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtestspassed while in the next, failing run https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check the same tests took: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtestspassed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtestspassed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtestspassed 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtestspassed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtestspassed 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtestspassed 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status29 Based on this, it seems fairly likely that crake is simply timing out as a consequence of intermittent heavy background activity. regards, tom lane
On Fri, Jul 26, 2024 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Based on this, it seems fairly likely that crake is simply timing out > as a consequence of intermittent heavy background activity. Would it be better to keep going as long as progress is being made? I.e. time out only when the relevant LSN stops advancing for N seconds? Or something like that...
I wrote:I'm confused by crake's buildfarm logs. AFAICS it is not running recovery-check at all in most of the runs; at least there is no mention of that step, for example here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02Oh, I see it: the log file that is called recovery-check in a failing run is called misc-check if successful. That seems mighty bizarre, and it's not how my own animals behave. Something weird about the meson code path, perhaps?
Yes, it was discussed some time ago. As suggested by Andres, we run the meson test suite more or less all together (in effect like "make checkworld" but without the main regression suite, which is run on its own first), rather than in the separate (and serialized) way we do with the configure/make tests. That results in significant speedup. If the tests fail we report the failure as happening at the first failure we encounter, which is possibly less than ideal, but I haven't got a better idea.
Anyway, in this successful run: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check here are some salient test timings: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtests passed 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtests passed 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtests passed 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtests passed while in the next, failing run https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check the same tests took: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtests passed 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtests passed 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtests passed 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status 29 Based on this, it seems fairly likely that crake is simply timing out as a consequence of intermittent heavy background activity.
The latest failure is this:
Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary [16:40:29.481](208.545s) # poll_query_until timed out executing this query: # SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby_1', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103. Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.
I have added a mild throttle to the buildfarm config so it doesn't try to run every branch at once. Maybe I also need to bring down the number or meson jobs too? But I suspect there's something deeper. Prior to the failure of this test 10 days ago it hadn't failed in a very long time. The OS was upgraded a month ago. Eight or so days ago I changed PG_TEST_EXTRA. I can't think of anything else.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-07-25 Th 5:14 PM, Tom Lane wrote:I wrote:I'm confused by crake's buildfarm logs. AFAICS it is not running recovery-check at all in most of the runs; at least there is no mention of that step, for example here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02Oh, I see it: the log file that is called recovery-check in a failing run is called misc-check if successful. That seems mighty bizarre, and it's not how my own animals behave. Something weird about the meson code path, perhaps?
Yes, it was discussed some time ago. As suggested by Andres, we run the meson test suite more or less all together (in effect like "make checkworld" but without the main regression suite, which is run on its own first), rather than in the separate (and serialized) way we do with the configure/make tests. That results in significant speedup. If the tests fail we report the failure as happening at the first failure we encounter, which is possibly less than ideal, but I haven't got a better idea.
Anyway, in this successful run: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check here are some salient test timings: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtests passed 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtests passed 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtests passed 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtests passed 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtests passed while in the next, failing run https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check the same tests took: 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtests passed 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtests passed 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtests passed 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtests passed 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtests passed 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtests passed 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status 29 Based on this, it seems fairly likely that crake is simply timing out as a consequence of intermittent heavy background activity.
The latest failure is this:
Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary [16:40:29.481](208.545s) # poll_query_until timed out executing this query: # SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby_1', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103. Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.I have added a mild throttle to the buildfarm config so it doesn't try to run every branch at once. Maybe I also need to bring down the number or meson jobs too? But I suspect there's something deeper. Prior to the failure of this test 10 days ago it hadn't failed in a very long time. The OS was upgraded a month ago. Eight or so days ago I changed PG_TEST_EXTRA. I can't think of anything else.
There seem to be a bunch of recent failures, and not just on crake, and not just on HEAD: <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > There seem to be a bunch of recent failures, and not just on crake, and > not just on HEAD: > <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit> There were a batch of recovery-stage failures ending about 9 days ago caused by instability of the new 043_vacuum_horizon_floor.pl test. Once you take those out it doesn't look quite so bad. regards, tom lane
On 2024-07-31 We 12:05 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> There seem to be a bunch of recent failures, and not just on crake, and >> not just on HEAD: >> <https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit> > There were a batch of recovery-stage failures ending about 9 days ago > caused by instability of the new 043_vacuum_horizon_floor.pl test. > Once you take those out it doesn't look quite so bad. We'll see. I have switched crake from --run-parallel mode to --run-all mode i.e. the runs are serialized. Maybe that will be enough to stop the errors. I'm still annoyed that this test is susceptible to load, if that is indeed what is the issue. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > We'll see. I have switched crake from --run-parallel mode to --run-all > mode i.e. the runs are serialized. Maybe that will be enough to stop the > errors. I'm still annoyed that this test is susceptible to load, if that > is indeed what is the issue. crake is still timing out intermittently on 027_streaming_regress.pl, so that wasn't it. I think we need more data. We know that the wait_for_catchup query is never getting to true: SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming' but we don't know if the LSN condition or the state condition is what is failing. And if it is the LSN condition, it'd be good to see the actual last LSN, so we can look for patterns like whether there is a page boundary crossing involved. So I suggest adding something like the attached. If we do this, I'd be inclined to instrument wait_for_slot_catchup and wait_for_subscription_sync similarly, but I thought I'd check for contrary opinions first. regards, tom lane diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 32ee98aebc..3403626f92 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -2948,6 +2948,13 @@ sub wait_for_catchup } else { + # Fetch additional detail for debugging purposes + $query = qq[SELECT application_name, ${mode}_lsn, state + FROM pg_catalog.pg_stat_replication + WHERE application_name IN ('$standby_name', 'walreceiver')]; + my $details = $self->safe_psql('postgres', $query); + diag qq(Last application_name|${mode}_lsn|state: +${details}); croak "timed out waiting for catchup"; } }
Andrew Dunstan <andrew@dunslane.net> writes:We'll see. I have switched crake from --run-parallel mode to --run-all mode i.e. the runs are serialized. Maybe that will be enough to stop the errors. I'm still annoyed that this test is susceptible to load, if that is indeed what is the issue.crake is still timing out intermittently on 027_streaming_regress.pl, so that wasn't it. I think we need more data. We know that the wait_for_catchup query is never getting to true: SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming' but we don't know if the LSN condition or the state condition is what is failing. And if it is the LSN condition, it'd be good to see the actual last LSN, so we can look for patterns like whether there is a page boundary crossing involved. So I suggest adding something like the attached. If we do this, I'd be inclined to instrument wait_for_slot_catchup and wait_for_subscription_sync similarly, but I thought I'd check for contrary opinions first.
Seems reasonable.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-08-11 Su 8:32 PM, Tom Lane wrote: >> I think we need more data. We know that the >> wait_for_catchup query is never getting to true: >> >> SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming' >> >> but we don't know if the LSN condition or the state condition is >> what is failing. And if it is the LSN condition, it'd be good >> to see the actual last LSN, so we can look for patterns like >> whether there is a page boundary crossing involved. So I suggest >> adding something like the attached. > Seems reasonable. Pushed. In the event I made it just "SELECT * FROM" the relevant view: there will be few if any rows that aren't potentially interesting, and filtering the columns doesn't seem like a forward-looking idea either. regards, tom lane