Thread: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
From
Alexander Lakhin
Date:
Hello hackers, This week dodo started failing on the 008_fsm_truncation test sporadically due to pg_ctl timeout. For example, [1] (REL_14_STABLE): ### Starting node "standby" # Running: pg_ctl -D /media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_008_fsm_truncation_standby_data/pgdata -l /media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/log/008_fsm_truncation_standby.log -o --cluster-name=standby start waiting for server to start........................................................................................................................... stopped waiting pg_ctl: server did not start in time # pg_ctl start failed; logfile: 2024-06-19 21:27:30.843 ACST [13244:1] LOG: starting PostgreSQL 14.12 on armv7l-unknown-linux-gnueabihf, compiled by gcc (GCC) 15.0.0 20240617 (experimental), 32-bit 2024-06-19 21:27:31.768 ACST [13244:2] LOG: listening on Unix socket "/media/pi/250gb/proj/bf2/v17/buildroot/tmp/vLgcHgvc7O/.s.PGSQL.50013" 2024-06-19 21:27:35.055 ACST [13246:1] LOG: database system was interrupted; last known up at 2024-06-19 21:26:55 ACST A successful run between two failures [2]: 2024-06-21 05:17:43.102 ACST [18033:1] LOG: database system was interrupted; last known up at 2024-06-21 05:17:31 ACST 2024-06-21 05:18:06.718 ACST [18033:2] LOG: entering standby mode (That is, that start took around 20 seconds.) We can also find longer "successful" duration at [3]: 008_fsm_truncation_standby.log: 2024-06-19 23:11:13.854 ACST [26202:1] LOG: database system was interrupted; last known up at 2024-06-19 23:11:02 ACST 2024-06-19 23:12:07.115 ACST [26202:2] LOG: entering standby mode (That start took almost a minute.) So it doesn't seem impossible for this operation to last for more than two minutes. The facts that SyncDataDirectory() is executed between these two messages logged, 008_fsm_truncation is the only test which turns fsync on, and we see no such failures in newer branches (because of a7f417107), make me suspect that dodo is slow on fsync. I managed to reproduce similar fsync degradation (and reached 40 seconds duration of this start operation) on a slow armv7 device with a SD card, which getting significantly slower after many test runs without executing fstrim periodically. So maybe fstrim could help dodo too... On the other hand, backporting a7f417107 could fix the issue too, but I'm afraid we'll still see other tests (027_stream_regress) failing like [4]. When similar failures were observed on Andres Freund's animals, Andres came to conclusion that they were caused by fsync too ([5]). [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-19%2010%3A15%3A08 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2018%3A30%3A53 [3] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-06-19%2012%3A30%3A51&stg=recovery-check [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-21%2018%3A31%3A11 [5] https://www.postgresql.org/message-id/20240321063953.oyfojyq3wbc77xxb%40awork3.anarazel.de Best regards, Alexander
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
From
Robins Tharakan
Date:
On Sat, 22 Jun 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote:
So it doesn't seem impossible for this operation to last for more than two
minutes.
The facts that SyncDataDirectory() is executed between these two messages
logged, 008_fsm_truncation is the only test which turns fsync on, and we
see no such failures in newer branches (because of a7f417107), make me
suspect that dodo is slow on fsync.
Not sure if it helps but I can confirm that dodo is used for multiple tasks and that
it is using a (slow) external USB3 disk. Also, while using dodo last week (for
something unrelated), I noticed iotop at ~30MB/s usage & 1-min CPU around ~7.
Right now (while dodo's idle), via dd I see ~30MB/s is pretty much the max:
pi@pi4:/media/pi/250gb $ dd if=/dev/zero of=./test count=1024 oflag=direct bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 4.51225 s, 29.7 MB/s
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 4.51225 s, 29.7 MB/s
pi@pi4:/media/pi/250gb $ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 24.4916 s, 5.5 MB/s
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 24.4916 s, 5.5 MB/s
-
robins
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
From
Alexander Lakhin
Date:
22.06.2024 12:00, Alexander Lakhin wrote: > On the other hand, backporting a7f417107 could fix the issue too, but I'm > afraid we'll still see other tests (027_stream_regress) failing like [4]. > When similar failures were observed on Andres Freund's animals, Andres > came to conclusion that they were caused by fsync too ([5]). > It seems to me that another dodo failure [1] has the same cause: t/001_emergency_vacuum.pl .. ok # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 2. t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00) All 2 subtests passed t/003_wraparounds.pl ....... ok Test Summary Report ------------------- t/002_limits.pl (Wstat: 7424 Tests: 2 Failed: 0) Non-zero exit status: 29 Parse errors: No plan found in TAP output Files=3, Tests=10, 4235 wallclock secs ( 0.10 usr 0.13 sys + 18.05 cusr 12.76 csys = 31.04 CPU) Result: FAIL Unfortunately, the buildfarm log doesn't contain regress_log_002_limits, but I managed to reproduce the failure on that my device, when it's storage as slow as: $ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k 1024+0 records in 1024+0 records out 134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s The test fails as below: [15:36:04.253](729.754s) ok 1 - warn-limit reached #### Begin standard error psql:<stdin>:1: WARNING: database "postgres" must be vacuumed within 37483631 transactions HINT: To avoid XID assignment failures, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. #### End standard error [15:36:45.220](40.968s) ok 2 - stop-limit [15:36:45.222](0.002s) # issuing query via background psql: COMMIT IPC::Run: timeout on timer #1 at /usr/share/perl5/IPC/Run.pm line 2944. It looks like this bump (coined at [2]) is not enough for machines that are that slow: # Bump the query timeout to avoid false negatives on slow test systems. my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default; [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2007%3A18%3A46 [2] https://www.postgresql.org/message-id/CAD21AoBKBVkXyEwkApSUqN98CuOWw%3DYQdbkeE6gGJ0zH7z-TBw%40mail.gmail.com Best regards, Alexander
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
From
Robins Tharakan
Date:
On Sun, 23 Jun 2024 at 22:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Unfortunately, the buildfarm log doesn't contain regress_log_002_limits,
but I managed to reproduce the failure on that my device, when it's storage
as slow as:
$ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s
The past ~1 week, I tried to space out all other tasks on the machine, so as to ensure
that 1-min CPU is mostly <2 (and thus not many things hammering the disk) and with
that I see 0 failures these past few days. This isn't conclusive by any means, but it
does seem that reducing IO contention has helped remove the errors, like what
Alexander suspects / repros here.
Just a note, that I've reverted some of those recent changes now, and so if the theory
holds true, I wouldn't be surprised if some of these errors restarted on dodo.
-
robins
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
From
Alexander Lakhin
Date:
Hello Robins,
28.06.2024 13:20, Robins Tharakan wrote:
28.06.2024 13:20, Robins Tharakan wrote:
The past ~1 week, I tried to space out all other tasks on the machine, so as to ensurethat 1-min CPU is mostly <2 (and thus not many things hammering the disk) and withthat I see 0 failures these past few days. This isn't conclusive by any means, but itdoes seem that reducing IO contention has helped remove the errors, like whatAlexander suspects / repros here.Just a note, that I've reverted some of those recent changes now, and so if the theoryholds true, I wouldn't be surprised if some of these errors restarted on dodo.
Looking back at the test failures, I can see errors really reappeared
just after your revert (at 2024-06-28), so that theory proved true,
but I see none of those since 2024-07-02. Does it mean that you changed
something on dodo/fixed that performance issue?
Could you please describe how you resolved this issue, just for the record?
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-28%2017%3A00%3A28
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-28%2017%3A10%3A12
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-01%2012%3A10%3A12
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-01%2013%3A01%3A00
[5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-02%2005%3A00%3A36
[6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-02%2018%3A00%3A15
Best regards,
Alexander