Re: WAL consistency check facility - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: WAL consistency check facility |
Date | |
Msg-id | CAB7nPqT5GH2SmsnrL9nK5UTBOPOBOiUQEZqXAj4y+ACCK17dKQ@mail.gmail.com Whole thread Raw |
In response to | Re: WAL consistency check facility (Kuntal Ghosh <kuntalghosh.2007@gmail.com>) |
Responses |
Re: WAL consistency check facility
Re: WAL consistency check facility |
List | pgsql-hackers |
On Tue, Sep 13, 2016 at 6:07 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> For now, I've kept this as a WARNING message to detect all inconsistencies >> at once. Once, the patch is finalized, I'll modify it as an ERROR message. > > Or say FATAL. This way the server is taken down. What I'd really like to see here is a way to quickly identify in the buildfarm the moment an inconsistent WAL has been introduced by a new feature, some bug fix, or perhaps a deficiency in the masking routines. We could definitely tune that later on, by controlling with a GUC if this generates a WARNING instead of a FATAL, the former being more useful for production environments, and the latter for tests. It would be good to think as well about a set of tests, one rough thing would be to modify an on-disk page for a table, and work on that to force an inconsistency to be triggered.. >>A couple of extra thoughts: >>1) The routines of each rmgr are located in a dedicated file, for >>example GIN stuff is in ginxlog.c, etc. It seems to me that it would >>be better to move each masking function where it should be instead >>being centralized. A couple of routines need to be centralized, so I'd >>suggest putting them in a new file, like xlogmask.c, xlog because now >>this is part of WAL replay completely, including the lsn, the hint >>bint and the other common routines. > Sounds good. But, I think that the file name for common masking routines > should be as bufmask.c since we are masking the buffers only. That makes sense as well. No objections to that. >>2) Regarding page comparison: >>We could just use memcpy() here. compareImages was useful to get a >>clear image of what the inconsistencies were, but you don't do that >>anymore. > memcmp(), right? Yep :) >>6) >>+ /* >>+ * Remember that, if WAL consistency check is enabled for >>the current rmid, >>+ * we always include backup image with the WAL record. >>But, during redo we >>+ * restore the backup block only if needs_backup is set. >>+ */ >>+ if (needs_backup) >>+ bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO; >>This should use wal_consistency[rmid]? > > needs_backup is set when XLogRecordAssemble decides that backup image > should be included in the record for redo purpose. This image will be > restored during > redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included > image should be restored during redo(or has_image should be set or not). When decoding a record, I think that you had better not use has_image to assume that a FPW has to be double-checked. This has better be a different boolean flag, say check_page or similar. This way after decoding a record it is possible to know if there is a PFW, and if a check on it is needed or not. >>7) This patch has zero documentation. Please add some. Any human being >>on this list other than those who worked on the first versions >>(Heikki, Simon and I?) is going to have a hard time to review this >>patch in details moving on if there is no reference to tell what this >>feature does for the user... >> >>This patch is going to the good direction, but I don't think it's far >>from being ready for commit yet. So I am going to mark it as returned >>with feedback if there are no objections > > I think only major change that this patch needs a proper and detailed > documentation. Other than that there are very minor changes which can > be done quickly. Right? It seems to me that you need to think about the way to document things properly first, with for example: - Have a first documentation patch that explains what is a resource manager for WAL, and what are the types available with a nice table. - Add in your patch documentation to explain what are the benefits of using this facility, the main purpose is testing, but there are also mention upthread about users that would like to get that into production, assuming that the overhead is minimal. - Add more comments in your code to finish. One example is checkConsistency() that is here, but explains nothing. Well, if you'd simply use an on/off switch to control the feature, the documentation load for rmgrs would be zero, but as I am visibly outnumbered in this fight... We could also have an off/on switch implemented first, and extend that later on depending on the feedback from other users. We discussed rmgr-level or relation-level tuning of FPW compression at some point, but we've finished with the most simple approach, and we still stick with it. -- Michael
pgsql-hackers by date: