Home > mailing lists
Re: [HACKERS] WAL consistency check facility - Mailing list pgsql-hackers

From	Kuntal Ghosh
Subject	Re: [HACKERS] WAL consistency check facility
Date	February 8, 2017 09:25:10
Msg-id	CAGz5QCKa_n_qV-p_zf7b9HS0gjei=e18NiKurfrg-v9NRBinpg@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] WAL consistency check facility (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [HACKERS] WAL consistency check facility
List	pgsql-hackers
Tree view
On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've attached the patch with the modified changes. PFA.
>
> Can this patch check contrib/bloom?
>
Added support for generic resource manager type.

> +        /*
> +         * Mask some line pointer bits, particularly those marked as
> +         * used on a master and unused on a standby.
> +         */
>
> Comment doesn't explain why we need to do this.
>
Added comments.

> +        /*
> +         * For GIN_DELETED page, the page is initialized to empty.
> +         * Hence mask everything.
> +         */
> +        if (opaque->flags & GIN_DELETED)
> +            memset(page_norm, MASK_MARKER, BLCKSZ);
> +        else
> +            mask_unused_space(page_norm);
>
> If the page is initialized to empty, why do we need to mask
> anything/everything?  Anyway, it doesn't seem right to mask the
> GinPageOpaque structure itself. Or at least it doesn't seem right to
> mask the flags word.
>
Modified accordingly.

>
> +            if (!HeapTupleHeaderXminFrozen(page_htup))
> +                page_htup->t_infomask |= HEAP_XACT_MASK;
> +            else
> +                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID;
>
> Comment doesn't address this logic.  Also, the "else" case shouldn't
> exist at all, I think.
Added comments. I think that "Else" part is needed for xmax.

>
> +            /*
> +             * For a speculative tuple, the content of t_ctid is conflicting
> +             * between the backup page and current page. Hence, we set it
> +             * to the current block number and current offset.
> +             */
>
> Why does it differ?  Is that a bug?
>
Added comments.

>
> +    /*
> +     * During replay of a btree page split, we don't set the BTP_SPLIT_END
> +     * flag of the right sibling and initialize th cycle_id to 0 for the same
> +     * page.
> +     */
>
> A reference to the name of the redo function might be helpful here
> (and in some other places), unless it's just ${AMNAME}_redo.  "th" is
> a typo for "the".
Corrected.

> +                        appendStringInfoString(buf, " (full page
> image, apply)");
> +                    else
> +                        appendStringInfoString(buf, " (full page image)");
>
> How about "(full page image)" and "(full page image, for WAL verification)"?
>
> Similarly in XLogDumpDisplayRecord, I think we should assume that
> "FPW" means what it has always meant, and leave that output alone.
> Instead, distinguish the WAL-consistency-checking case when it
> happens, e.g. "(FPW for consistency check)".
>
Corrected.

> +checkConsistency(XLogReaderState *record)
>
> How about CheckXLogConsistency()?
>
> +         * If needs_backup is true or wal consistency check is enabled for
>
> ...or WAL checking is enabled...
>
> +             * If WAL consistency is enabled for the resource manager of
>
> If WAL consistency checking is enabled...
>
> + * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
>
> with the BKPIMAGE_APPLY flag
Modified accordingly.

>
> - * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
> - * with all-zeroes pages up to the referenced block number.  In
> - * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
> + * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
> + * is not set for the backup block, the relation is extended with all-zeroes
> + * pages up to the referenced block number.
>
> OK, I'm puzzled by this.  Surely we don't want the WAL consistency
> checking facility to cause the relation to be extended like this.  And
> I don't see why it would, because the WAL consistency checking happens
> after the page changes have already been applied.  I wonder if this
> hunk is in error and should be dropped.
>
Dropped comments.

> +    PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
> +    phdr->pd_prune_xid = PG_UINT32_MAX;
> +    phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
> +    phdr->pd_flags |= PD_ALL_VISIBLE;
> +#define MASK_MARKER        0xFF
> (and many others)
>
> Why do we mask by setting bits rather than clearing bits?  My
> intuition would have been to zero things we want to mask, rather than
> setting them to one.
>
Modified accordingly so that we can zero things during masking instead
of setting them to one.

> +                {
> +                    newwalconsistency[i] = true;
> +                }
>
> Superfluous braces.
>
Corrected.

> +    /*
> +     * Leave if no masking functions defined, this is possible in the case
> +     * resource managers generating just full page writes, comparing an
> +     * image to itself has no meaning in those cases.
> +     */
> +    if (RmgrTable[rmid].rm_mask == NULL)
> +        return;
>
> ...and also...
>
> +            /*
> +             * This feature is enabled only for the resource managers where
> +             * a masking function is defined.
> +             */
> +            for (i = 0; i <= RM_MAX_ID; i++)
> +            {
> +                if (RmgrTable[i].rm_mask != NULL)
>
> Why do we assume that the feature is only enabled for RMs that have a
> mask function?  Why not instead assume that if there's no masking
> function, no masking is required?
I've introduced a function in rmgr.c, named
consistencyCheck_is_enabled, which returns true if
wal_consistency_checking is enabled for a resource manager. It does
not have any dependency on the masking function. If a masking function
is defined, then we mask the page before consistency check. However,
I'm not sure whether rmgr.c is the right place to define the function
consistencyCheck_is_enabled.

>
> +        /* Definitely not an individual resource manager. Check for 'all'. */
> +        if (pg_strcasecmp(tok, "all") == 0)
>
> It seems like it might be cleaner to check for "all" first, and then
> check for individual RMs afterward.
>
Done.

> +    /*
> +     * Parameter should contain either 'all' or a combination of resource
> +     * managers.
> +     */
> +    if (isAll && isRmgrId)
> +    {
> +        GUC_check_errdetail("Invalid value combination");
> +        return false;
> +    }
>
> That error message is not very clear, and I don't see why we even need
> to check this.  If someone sets wal_consistency_checking = hash, all,
> let's just set it for all and the fact that hash is also set won't
> matter to anything.
>
Modified accordingly.

> +    void        (*rm_mask) (char *page, BlockNumber blkno);
>
> Could the page be passed as type "Page" rather than a "char *" to make
> things more convenient for the masking functions?  If not, could those
> functions at least do something like "Page page = (Page) pagebytes;"
> rather than "Page page_norm = (Page) page;"?
>
Modified it as "Page tempPage = (Page) page;"

> +        /*
> +         * Read the contents from the current buffer and store it in a
> +         * temporary page.
> +         */
> +        buf = XLogReadBufferExtended(rnode, forknum, blkno,
> +                                          RBM_NORMAL);
> +        if (!BufferIsValid(buf))
> +            continue;
> +
> +        new_page = BufferGetPage(buf);
> +
> +        /*
> +         * Read the contents from the backup copy, stored in WAL record
> +         * and store it in a temporary page. There is not need to allocate
> +         * a new page here, a local buffer is fine to hold its contents and
> +         * a mask can be directly applied on it.
> +         */
> +        if (!RestoreBlockImage(record, block_id, old_page_masked))
> +            elog(ERROR, "failed to restore block image");
> +
> +        /*
> +         * Take a copy of the new page where WAL has been applied to have
> +         * a comparison base before masking it...
> +         */
> +        memcpy(new_page_masked, new_page, BLCKSZ);
> +
> +        /* No need for this page anymore now that a copy is in */
> +        ReleaseBuffer(buf);
>
> The order of operations is strange here.  We read the "new" page,
> holding the pin (but no lock?).  Then we restore the block image into
> old_page_masked.  Now we copy the new page and release the pin.  It
> would be better, ISTM, to rearrange that so that we finish with the
> new page and release the pin before dealing with the old page.  Also,
> I think we need to actually lock the buffer before copying it.  Maybe
> that's not strictly necessary since this is all happening on the
> standby, but it seems like a bad idea to set the precedent that you
> can read a page without taking the content lock.
>
Modified accordingly.

> I think the "new" and "old" page terminology is kinda weird too.
> Maybe we should call them the "replay image" and the "master image" or
> something like that.   A few more comments wouldn't hurt either.
>
Done.

> +     * Also mask the all-visible flag.
> +     *
> +     * XXX: It is unfortunate that we have to do this. If the flag is set
> +     * incorrectly, that's serious, and we would like to catch it. If the flag
> +     * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
> +     * records don't currently set the flag, even though it is set in the
> +     * master, so we must silence failures that that causes.
> +     */
> +    phdr->pd_flags |= PD_ALL_VISIBLE;
>
> I'm puzzled by the reference to HEAP_CLEAN.  The thing that might set
> the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
> Unless I'm missing something, there's no situation in which
> XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
> Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
> always.  And there's a good reason for that, which is explained in
> this comment:
>
>          * We don't bump the LSN of the heap page when setting the visibility
>          * map bit (unless checksums or wal_hint_bits is enabled, in which
>          * case we must), because that would generate an unworkable volume of
>          * full-page writes.  This exposes us to torn page hazards, but since
>          * we're not inspecting the existing page contents in any way, we
>          * don't care.
>          *
>          * However, all operations that clear the visibility map bit *do* bump
>          * the LSN, and those operations will only be replayed if the XLOG LSN
>          * follows the page LSN.  Thus, if the page LSN has advanced past our
>          * XLOG record's LSN, we mustn't mark the page all-visible, because
>          * the subsequent update won't be replayed to clear the flag.
>
> So I think this comment needs to be rewritten with a bit more nuance.
>
Corrected.

> +extern void mask_unused_space(Page page);
> +#endif
>
> Missing newline.
>
Done.

Thank you Robert for the review. Please find the updated patch in the
attachment.

Thanks to Amit Kapila and Dilip Kumar for their suggestions in offline
discussions.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment

walconsistency_v17.patch
pgsql-hackers by date:
From: Tom Lane
Date: 08 February 2017, 09:05:08
Subject: Re: [HACKERS] [PATCH] configure-time knob to set default ssl ciphers
From: Masahiko Sawada
Date: 08 February 2017, 09:40:55
Subject: Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK
Re: [HACKERS] WAL consistency check facility - Mailing list pgsql-hackers

Attachment

Previous

Next