Re: [BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs
From | Michael Paquier |
---|---|
Subject | Re: [BUG] non archived WAL removed during production crash recovery |
Date | |
Msg-id | 20200420073444.GA77439@paquier.xyz Whole thread Raw |
In response to | Re: [BUG] non archived WAL removed during production crash recovery (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: [BUG] non archived WAL removed during production crash recovery
|
List | pgsql-bugs |
On Mon, Apr 20, 2020 at 04:02:31PM +0900, Kyotaro Horiguchi wrote: > At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in > As the result, +1 to what v7 is doing and discussing on earlier > removal of such WAL segments separately if needed. Thanks for the extra review. >> Yeah. We could try to do with "false" as command anyway, and see what >> the buildfarm thinks. As the test is skipped on Windows, I would >> assume that it does not matter much anyway. Let's see what others >> think about this piece. I don't have plans to touch again this patch >> until likely the middle of next week. > > Couldn't we use "/" as a globally-results-in-failure command? But > that doesn't increment failed_count. The reason is pgarch_archiveXLog > exits with FATAL for "is a directory" error. The comment asserts that > we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to > check only exit-by-signal case. The following fix worked. Yeah, I was working on this stuff today and I noticed this problem. I was just going to send an email on the matter with a more portable patch and you also just beat me to it with this one :) So yes, using "false" may be a bad idea because we cannot rely on the case where the command does not exist in an environment in this test. After more testing, I have been hit hard about the fact that the archiver exits immediately if an archive command cannot be found (errcode = 127), and it does not report this failure back to pg_stat_archiver, which would cause the test to wait until the timeout of poll_query_until() kills the test. There is however an extra method not mentioned yet on this thread: we know that cp/copy is portable enough per the buildfarm, so let's use a copy command that we know *will* fail. A simple way of doing this is a command where the origin file does not exist. > --- a/src/backend/postmaster/pgarch.c > +++ b/src/backend/postmaster/pgarch.c > @@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog) > * "command not found" type of error. If we overreact it's no big > * deal, the postmaster will just start the archiver again. > */ > - int lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG; > + int lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG; > > if (WIFEXITED(rc)) > { > > I didn't tested it on Windows (I somehow broke my repo and it's too > slow to clone.) but system("/") returned 1 and I think that result > increments the counter. No, this would be a behavior change, which is not acceptable in my view. (By the way, just nuke your full repo if it does not work anymore on Windows, this method works). >> Indeed. The extra initialization was part of v4, and got removed as >> of v5. Still, it seems to me that this part was not complete without >> updating the shared memory field correctly at the beginning of the >> REDO processing as the last version of the patch does. > > I may not be following the discussion, but I think it is reasonable > that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE > as needed and finished by NONE. That transition also stables > RecoveryInProgress(). Thought as well about that over the weekend, and that's still the best option to me. > I think it would be better be RECOVERY_STATE_DONE. I like this suggestion better than the original in v7. > By the way I noticed that RecoveryState is exactly a subset of > DBState. And changes of SharedRecoveryState happens side-by-side with > ControlFileData->state in most places. Coundn't we just usee > ControlFile->state instead of SharedRecoveryState? I actually found confusing to use the same thing, because then the reader would thing that SharedRecoveryState could be set to more values but we don't want that. > By the way I found a typo. > > +# Recovery tests for the achiving with a standby partially check > s/achiving/archiving/ Thanks, fixed. Attached is an updated patch, where I tweaked more comments. Jehan-Guillaume, who is your colleague who found originally about this problem? We should credit him in the commit message. -- Michael
Attachment
pgsql-bugs by date: