Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman |
Date | |
Msg-id | CA+hUKGJ-gvLg9vJgMvu62XW-Nsb2Hf40DvqmKdusVeJ7KV-crA@mail.gmail.com Whole thread Raw |
In response to | Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman |
List | pgsql-hackers |
On Sat, Apr 9, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > > I tried adjusting the patch so it does guarantee that (as attached), > > and in two out of two tries it got past the archive_cleanup_command > > failure but then hung up waiting for standby2 to promote. > > Adding > > $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()"); > just after > $node_standby2->start; > > makes the tests pass here. Sorry for the delay... I got a bit confused about the different things going on in this thread but I hope I've got it now: 1. This test had some pre-existing bugs/races, which hadn't failed before due to scheduling, even under Valgrind. The above changes appear to fix those problems. To Michael for comment. > What is that second test really testing? > > # Check the presence of temporary files specifically generated during > # archive recovery. To ensure the presence of the temporary history > # file, switch to a timeline large enough to allow a standby to recover > # a history file from an archive. As this requires at least two timeline > # switches, promote the existing standby first. Then create a second > # standby based on the promoted one. Finally, the second standby is > # promoted. > > Note "Then create a second standby based on the promoted one." - but that's > not actually what's happening: 2. There may also be other problems with the test but those aren't relevant to skink's failure, which starts on the 5th test. To Michael for comment. > But I think there's also something funky in the prefetching logic. I think it > may attempt restoring during prefetching somehow, even though there's code > that appears to try to prevent that? 3. Urghl. Yeah. There is indeed code to report XLREAD_WOULDBLOCK for the lastSourceFailed case, but we don't really want to do that more often than every 5 seconds. So I think I need to make that "sticky", so that we don't attempt to prefetch again (ie to read from the next segment, which invokes restore_command) until we've replayed all the records we already have, then hit the end and go to sleep. The attached patch does that, and makes the offending test pass under Valgrind for me, even without the other changes already mentioned. If I understand correctly, this is due to a timing race in the tests (though I didn't check where exactly), because all those extra fork/exec calls are extremely slow under Valgrind. > And interestingly I'm not seeing the > "switched WAL source from stream to archive after failure" > lines I'd expect. I see them now. It's because it gives up when it's reading ahead (nonblocking), which may not be strictly necessary but I found it simpler to think about. Then when it tries again in 5 seconds it's in blocking mode so it doesn't give up so easily. 2022-04-11 18:15:08.220 NZST [524796] DEBUG: switched WAL source from stream to archive after failure cp: cannot stat '/tmp/archive/000000010000000000000017': No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not restore file "000000010000000000000017" from archive: child process exited with exit code 1 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not open file "pg_wal/000000010000000000000017": No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: switched WAL source from archive to stream after failure
Attachment
pgsql-hackers by date: