Home > mailing lists

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Changing the state of data checksums in a running cluster
Date	November 5 17:05:48
Msg-id	c33d8d86-47ad-4cee-9860-26b324d7306b@vondra.me Whole thread Raw
In response to	Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>)
List	pgsql-hackers

Tree view

Hi Daniel,

I took a look at the patch again, focusing mostly on the docs and
comments. I think the code is OK, I haven't noticed anything serious.


testing
-------

I'm running the TAP tests - so far it looks fine, I've done 2000
iterations of the "single" test, now running ~2000 iterations of the
"standby" test. No issues/failures so far.

The question is whether we can/should make it even more "random", by
doing restarts in more places etc. I might give it a try, if I happen to
have some free time. But no promises, I'm not sure how feasible it is.
Making it "truly random" means it's hard to deduce what should be the
actual state of checksums, etc.


review
------

Attached is a patch adding FIXME/XXX comments to a bunch of places,
which I think makes it clearer which place I'm talking about. I'll
briefly go over the items, and maybe explain them a bit more.

1) func-admin.sgml

- This is missing documentation for the "fast" parameters for both
functions (enable/disable).

- The paragraph stars with "Initiates data checksums for ..." but that
doesn't sound right to me. I think you can initiate enabling/disabling,
not "data checksums".

- I think the docs should at least briefly describe the impact on the
cluster, and also on a standby, due to having to write everything into
WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
in some way. More about the standby stuff later.


2) glossary.sgml

- This describes "checksum worker" as process that enables or disables
checksums in a specific database, but we don't need any per-database
processes when disabling, no?


3) monitoring.sgml

- I'm not sure what "regardles" implies here. Does that mean we simply
don't hide/reset the counters?

- I added a brief explanation about using the "launcher" row for overall
progress, and per-database workers for "current progress".

- Do we want to refer to "datachecksumsworker"? Isn't it a bit too
low-level detail?

- The table of phases is missing "waiting" and "done" phases. IMHO if
the progress view can return it, it should be in the docs.


4) wal.sgml

- I added a sentence explaining that both enabling and disabling happens
in phases, with checkpoints in between. I think that's helpful for users
and DBAs.

- The section only described "enabling checksums", but it probably
should explain the process for disabling too. Added a para.

- Again, I think we should explain the checkpoints, restartpoints,
impact on standby ... somewhere. Not sure if this is the right place.


5) xlog.c

- Some minor corrections (typos, ...).

- Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
where we check InitialDataChecksumTransition stale? AFAIK we now check
this in AbsorbDataChecksumsBarrier for all transitions, no?


6) datachecksumsworker.c

- I understand what the comments at the beginning of the file say, but I
suspect it's partially due to already knowing the code. There's a couple
places that might be more explicit, I think. For example:

- One of the items in the synchronization/correctness section states
that "Backends SHALL NOT violate local data_checksums state" but what
does "violating local data_checksums state" mean? What even is "local
state in this context"? Should we explain/define that, or would that be
unnecessarily detailed?

- The section only talks about "enabling checksums" but also discusses
all four possible states. Maybe it should talk about disabling too, as
that requires the same synchronization/correctness.

- Maybe it'd be good make it more explicit at which point the process
waits on a barrier, which backend initiates that (and which backends are
required to absorb the signal). It kinda is there, but only indirectly.

- Another idea I had is that maybe it'd help to have some visualization
of the process (with data_checksum states, barriers, ...) e.g. in the
form of an ASCII image.


open questions
--------------

For me the main remaining question is impact people should expect in
production systems, and maybe ways to mitigate that.

In single-node systems this is entirely fine, I think. There will be
checkpoints, but those will be "spread" and it's just the checksum
worker waiting for that to complete.

It'll write everything into WAL, but that's fairly well understood /
should be expected. We should probably mention that in the sgml docs, so
that people are not surprised their WAL archive gets huge.

I'm much more concerned about streaming replicas, because there it
forces a restart point - and it *blocks redo* until it completes. Which
means there'll be replication lag, and for synchronous standbys this
would even block progress on the primary.

We should very clearly document this. But I'm also wondering if we might
mitigate this in some way / reduce the impact.

I see some similarities to shutdown checkpoints, which can take a lot of
time if there happen to be a lot of dirty data, increasing disruption
during planned restarts (when no one can connect). A common mitigation
is to run CHECKPOINT shortly before the restart, to flush most of the
dirty data while still allowing new connections.

Maybe we could do something like that for checksum changes? I don't know
how exactly we could do that, but let's say we can predict when roughly
to expect the next state change. And we'd ensure the standby starts
flushing stuff before that, so that creating the restartpoint is cheap.
Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
the amount of WAL as we're getting closer to the end?


regards

-- 
Tomas Vondra

Attachment

checksums-review.txt

pgsql-hackers by date:

From: sunil s
Date: 05 November, 17:05:27
Subject: Re: Unnecessary delay in streaming replication due to replay lag

From: Chao Li
Date: 05 November, 17:40:20
Subject: Re: Use stack-allocated StringInfoData

Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

Attachment

Previous

Next