Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) - Mailing list pgsql-bugs
From | Kyotaro Horiguchi |
---|---|
Subject | Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) |
Date | |
Msg-id | 20191218.122526.233778911708036076.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
|
List | pgsql-bugs |
At Tue, 17 Dec 2019 18:50:26 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Alexander Lakhin <exclusion@gmail.com> writes: > > 16.12.2019 23:17, Tom Lane wrote: > >> I've pushed this with adjustment of the other loop and some fooling > >> with the comment --- I still don't see a need to cite stackoverflow > >> as an authority. > > > Thank you! I hope now Windows machines will concede the first place of > > the race for failures. > > Seems like we're not there yet :-(. Buildfarm members jacana and > fairywren have been failing the recovery tests, in v10 and up, > since this went in. It looks like we're getting a timeout on > this step in 010_logical_decoding_timelines.pl: > > # Should be able to read from slot created before base backup > ($ret, $stdout, $stderr) = $node_replica->psql( > 'postgres', > "SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1');", > timeout => 30); > > Now, it's mighty suspicious that this has a timeout of 30 sec > which is exactly how long we made pgwin32_open retry for --- > but how is it that this test is causing an ERROR_ACCESS_DENIED > failure? And if it was, how did we get past that before? > > It does look suspiciously like this wait is triggering on these > machines and somehow getting masked in most other test cases. > If you compare the per-step runtimes in jacana's last successful > run, > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-12-16%2013%3A01%3A26 > > with those after this patch went in, > > https://buildfarm.postgresql.org/cgi-bin/shocw_log.pl?nm=jacana&dt=2019-12-17%2004%3A16%3A40 > > there's clearly something very wrong. fairywren likewise. The immediate cause of the failure seems to be a slowdown of startup or recovery. It took only 8 minutes in the successful case but 25 minutes even though stopped on the way in the failure case. In the failure case, it took 30 seconds to move only 26.1MB from 15E8C48 to 30013A0. just around 0.3 seconds for a move of the same size (15E9C48->3000DE8 (26.1MB)) in the successful case (*1). In the sucessful case, the time took from archiving 00000002.history to removing old segments is 0.15 s, but about 3.5 s in the failure case (*2). Seemingly the immediate cause of the timeout looks like just an extreme slowdown of file read, but it took 2 seconds to move within a page from 3000C18 to 3000C50. So I suspect that 6d7547c219 might slow down readTimeLineHistory() significantly in the case. regards. *1: jacana - WAL reading Success 2019-12-16 09:01:01.805 EST [5df78e1d.b74:4] 010_logical_decoding_timelines.pl LOG: starting logical decoding for slot "before_basebackup" 2019-12-16 09:01:01.805 EST [5df78e1d.b74:5] 010_logical_decoding_timelines.pl DETAIL: Streaming transactions committingafter 0/15CC960, reading WAL from 0/15CC928. ... 2019-12-16 09:01:01.836 EST [5df78e1d.b74:34] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed0/3000DE8, after 0/3000DE8, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960 Fail 2019-12-17 01:35:02.235 EST [5df87716.2ddc:4] 010_logical_decoding_timelines.pl LOG: starting logical decoding for slot"before_basebackup" 2019-12-17 01:35:02.235 EST [5df87716.2ddc:5] 010_logical_decoding_timelines.pl DETAIL: Streaming transactions committingafter 0/15CC960, reading WAL from 0/15CC928. ... 2019-12-17 01:35:24.195 EST [5df87716.2ddc:33] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn:proposed 0/3000C18, after 0/3000C18, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960 2019-12-17 01:35:24.195 EST [5df87716.2ddc:34] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-3000C50.snap 2019-12-17 01:35:26.379 EST [5df87716.2ddc:35] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn:proposed 0/3000C50, after 0/3000C50, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960 2019-12-17 01:35:26.379 EST [5df87716.2ddc:36] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-3000D68.snap 2019-12-17 01:35:28.575 EST [5df87716.2ddc:37] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn:proposed 0/3000D68, after 0/3000D68, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960 2019-12-17 01:35:28.575 EST [5df87716.2ddc:38] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-30012F0.snap 2019-12-17 01:35:30.775 EST [5df87716.2ddc:39] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn:proposed 0/30012F0, after 0/30012F0, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960 2019-12-17 01:35:30.775 EST [5df87716.2ddc:40] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-30013A0.snap ==== *2: jacana - checkpoint Success 2019-12-16 09:01:01.399 EST [5df78e1d.170c:1] DEBUG: archived write-ahead log file "00000002.history" 2019-12-16 09:01:01.430 EST [5df78e1d.170c:2] DEBUG: archived write-ahead log file "000000010000000000000003.partial" 2019-12-16 09:01:01.445 EST [5df78e1b.2f60:4] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2019 Fail 2019-12-17 01:35:01.950 EST [5df87715.2e0c:1] DEBUG: archived write-ahead log file "00000002.history" 2019-12-17 01:35:03.080 EST [5df87715.2e0c:2] DEBUG: archived write-ahead log file "000000010000000000000003.partial" 2019-12-17 01:35:05.389 EST [5df876ed.3370:4] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-bugs by date: