Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | CA+fd4k4B8QVxhQa8=5pf7SuDcB0vBhqzOn0c59FxVa99CF0ptQ@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Julien Rouhaud <rjuju123@gmail.com>) |
Responses |
Re: Online checksums verification in the backend
|
List | pgsql-hackers |
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote: > > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > > > >> With a large amount of > > > >> shared buffer eviction you actually increase the risk of torn page > > > >> reads. Instead of a logic relying on partition mapping locks, which > > > >> could be unwise on performance grounds, did you consider different > > > >> approaches? For example a kind of pre-emptive lock on the page in > > > >> storage to prevent any shared buffer operation to happen while the > > > >> block is read from storage, that would act like a barrier. > > > > > > > > Even with a workload having a large shared_buffers eviction pattern, I don't > > > > think that there's a high probability of hitting a torn page. Unless I'm > > > > mistaken it can only happen if all those steps happen concurrently to doing the > > > > block read just after releasing the LWLock: > > > > > > > > - postgres read the same block in shared_buffers (including all the locking) > > > > - dirties it > > > > - writes part of the page > > > > > > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less > > > > approach seems like a very good tradeoff. > > > > > > Having false reports in this area could be very confusing for the > > > user. That's for example possible now with checksum verification and > > > base backups. > > > > > > I agree, however this shouldn't be the case here, as the block will be > > rechecked while holding proper lock the 2nd time in case of possible false > > positive before being reported as corrupted. So the only downside is to check > > twice a corrupted block that's not found in shared buffers (or concurrently > > loaded/modified/half flushed). As the number of corrupted or concurrently > > loaded/modified/half flushed blocks should usually be close to zero, it seems > > worthwhile to have a lockless check first for performance reason. > > > I just noticed some dumb mistakes while adding the new GUCs. v5 attached to > fix that, no other changes. Thank you for updating the patch. I have some comments: 1. + <entry> + <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>, <parameter>fork</parameter> <type>text</type>)</function></literal> + </entry> Looking at the declaration of pg_check_relation, 'relation' and 'fork' are optional arguments. So I think the above is not correct. But as I commented below, 'relation' should not be optional, so maybe the above line could be: + <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal> 2. + <indexterm> + <primary>pg_check_relation</primary> + </indexterm> + <para> + <function>pg_check_relation</function> iterates over all the blocks of all + or the specified fork of a given relation and verify their checksum. It + returns the list of blocks for which the found checksum doesn't match the + expected one. You must be a member of the + <literal>pg_read_all_stats</literal> role to use this function. It can + only be used if data checksums are enabled. See <xref + linkend="app-initdb-data-checksums"/> for more information. + </para> * I think we need a description about possible values for 'fork' (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork' is omitted. * Do we need to explain about checksum cost-based delay here? 3. +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; Now that pg_check_relation doesn't accept NULL as 'relation', I think we need to make 'relation' a mandatory argument. 4. + /* Check if the relation (still) exists */ + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) + { + /* + * We use a standard relation_open() to acquire the initial lock. It + * means that this will block until the lock is acquired, or will + * raise an ERROR if lock_timeout has been set. If caller wants to + * check multiple tables while relying on a maximum wait time, it + * should process tables one by one instead of relying on a global + * processing with the main SRF. + */ + relation = relation_open(relid, AccessShareLock); + } IIUC the above was necessary because we used to have check_all_relations() which iterates all relations on the database to do checksum checks. But now that we don't have that function and pg_check_relation processes one relation. Can we just do relation_open() here? 5. I think we need to check if the relation is a temp relation. I'm not sure it's worth to check checksums of temp relations but at least we need not to check other's temp relations. 6. +/* + * Safely read the wanted buffer from disk, dealing with possible concurrency + * issue. Note that if a buffer is found dirty in shared_buffers, no read will + * be performed and the caller will be informed that no check should be done. + * We can safely ignore such buffers as they'll be written before next + * checkpoint's completion.. + * + * The following locks can be used in this function: + * + * - shared LWLock on the target buffer pool partition mapping. + * - IOLock on the buffer + * + * The IOLock is taken when reading the buffer from disk if it exists in + * shared_buffers, to avoid torn pages. + * + * If the buffer isn't in shared_buffers, it'll be read from disk without any + * lock unless caller asked otherwise, setting needlock. In this case, the + * read will be done while the buffer mapping partition LWLock is still being + * held. Reading with this lock is to avoid the unlikely but possible case + * that a buffer wasn't present in shared buffers when we checked but it then + * alloc'ed in shared_buffers, modified and flushed concurrently when we + * later try to read it, leading to false positive due to torn page. Caller + * can read first the buffer without holding the target buffer mapping + * partition LWLock to have an optimistic approach, and reread the buffer + * from disk in case of error. + * + * Caller should hold an AccessShareLock on the Relation + */ I think the above comment also needs some "/*-------" at the beginning and end. 7. +static void +check_get_buffer(Relation relation, ForkNumber forknum, + BlockNumber blkno, char *buffer, bool needlock, bool *checkit, + bool *found_in_sb) +{ Maybe we can make check_get_buffer() return a bool indicating we found a buffer to check, instead of having '*checkit'. That way, we can simplify the following code: + check_get_buffer(relation, forknum, blkno, buffer, force_lock, + &checkit, &found_in_sb); + + if (!checkit) + continue; to something like: + if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, + &found_in_sb)) + continue; 8. + if (PageIsVerified(buffer, blkno)) + { + /* + * If the page is really new, there won't by any checksum to be + * computed or expected. + */ + *chk_expected = *chk_found = NoComputedChecksum; + return true; + } + else + { + /* + * There's a corruption, but since this affect PageIsNew, we + * can't compute a checksum, so set NoComputedChecksum for the + * expected checksum. + */ + *chk_expected = NoComputedChecksum; + *chk_found = hdr->pd_checksum; + } + return false; * I think the 'else' is not necessary here. * Setting *chk_expected and *chk_found seems useless when we return true. The caller doesn't use them. * Should we forcibly overwrite ignore_checksum_failure to off in pg_check_relation()? Otherwise, this logic seems not work fine. * I don't understand why we cannot compute a checksum in case where a page looks like a new page but is actually corrupted. Could you please elaborate on that? 8. + { + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, + gettext_noop("Checksum cost for a page found in the buffer cache."), + NULL + }, + &ChecksumCostPageHit, + 1, 0, 10000, + NULL, NULL, NULL + }, * There is no description about the newly added four GUC parameters in the doc. * We need to put new GUC parameters into postgresql.conf.sample as well. * The patch doesn't use checksum_cost_page_hit at all. 9. diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index eb19644419..37f63e747c 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -134,6 +134,14 @@ int max_worker_processes = 8; int max_parallel_workers = 8; int MaxBackends = 0; +int ChecksumCostPageHit = 1; /* GUC parameters for checksum check */ +int ChecksumCostPageMiss = 10; +int ChecksumCostLimit = 200; +double ChecksumCostDelay = 0; + +int ChecksumCostBalance = 0; /* working state for checksums check */ +bool ChecksumCostActive = false; Can we declare them in checksum.c since these parameters are used only in checksum.c and it does I/O my itself. 10. + /* Report the failure to the stat collector and the logs. */ + pgstat_report_checksum_failure(); + ereport(WARNING, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("invalid page in block %u of relation %s", + blkno, + relpath(relation->rd_smgr->smgr_rnode, forknum)))); I think we could do pgstat_report_checksum_failure() and emit WARNING twice for the same page since PageIsVerified() also does the same. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: