Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Changing the state of data checksums in a running cluster |
| Date | |
| Msg-id | c33d8d86-47ad-4cee-9860-26b324d7306b@vondra.me Whole thread Raw |
| In response to | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
| List | pgsql-hackers |
Hi Daniel, I took a look at the patch again, focusing mostly on the docs and comments. I think the code is OK, I haven't noticed anything serious. testing ------- I'm running the TAP tests - so far it looks fine, I've done 2000 iterations of the "single" test, now running ~2000 iterations of the "standby" test. No issues/failures so far. The question is whether we can/should make it even more "random", by doing restarts in more places etc. I might give it a try, if I happen to have some free time. But no promises, I'm not sure how feasible it is. Making it "truly random" means it's hard to deduce what should be the actual state of checksums, etc. review ------ Attached is a patch adding FIXME/XXX comments to a bunch of places, which I think makes it clearer which place I'm talking about. I'll briefly go over the items, and maybe explain them a bit more. 1) func-admin.sgml - This is missing documentation for the "fast" parameters for both functions (enable/disable). - The paragraph stars with "Initiates data checksums for ..." but that doesn't sound right to me. I think you can initiate enabling/disabling, not "data checksums". - I think the docs should at least briefly describe the impact on the cluster, and also on a standby, due to having to write everything into WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that in some way. More about the standby stuff later. 2) glossary.sgml - This describes "checksum worker" as process that enables or disables checksums in a specific database, but we don't need any per-database processes when disabling, no? 3) monitoring.sgml - I'm not sure what "regardles" implies here. Does that mean we simply don't hide/reset the counters? - I added a brief explanation about using the "launcher" row for overall progress, and per-database workers for "current progress". - Do we want to refer to "datachecksumsworker"? Isn't it a bit too low-level detail? - The table of phases is missing "waiting" and "done" phases. IMHO if the progress view can return it, it should be in the docs. 4) wal.sgml - I added a sentence explaining that both enabling and disabling happens in phases, with checkpoints in between. I think that's helpful for users and DBAs. - The section only described "enabling checksums", but it probably should explain the process for disabling too. Added a para. - Again, I think we should explain the checkpoints, restartpoints, impact on standby ... somewhere. Not sure if this is the right place. 5) xlog.c - Some minor corrections (typos, ...). - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place where we check InitialDataChecksumTransition stale? AFAIK we now check this in AbsorbDataChecksumsBarrier for all transitions, no? 6) datachecksumsworker.c - I understand what the comments at the beginning of the file say, but I suspect it's partially due to already knowing the code. There's a couple places that might be more explicit, I think. For example: - One of the items in the synchronization/correctness section states that "Backends SHALL NOT violate local data_checksums state" but what does "violating local data_checksums state" mean? What even is "local state in this context"? Should we explain/define that, or would that be unnecessarily detailed? - The section only talks about "enabling checksums" but also discusses all four possible states. Maybe it should talk about disabling too, as that requires the same synchronization/correctness. - Maybe it'd be good make it more explicit at which point the process waits on a barrier, which backend initiates that (and which backends are required to absorb the signal). It kinda is there, but only indirectly. - Another idea I had is that maybe it'd help to have some visualization of the process (with data_checksum states, barriers, ...) e.g. in the form of an ASCII image. open questions -------------- For me the main remaining question is impact people should expect in production systems, and maybe ways to mitigate that. In single-node systems this is entirely fine, I think. There will be checkpoints, but those will be "spread" and it's just the checksum worker waiting for that to complete. It'll write everything into WAL, but that's fairly well understood / should be expected. We should probably mention that in the sgml docs, so that people are not surprised their WAL archive gets huge. I'm much more concerned about streaming replicas, because there it forces a restart point - and it *blocks redo* until it completes. Which means there'll be replication lag, and for synchronous standbys this would even block progress on the primary. We should very clearly document this. But I'm also wondering if we might mitigate this in some way / reduce the impact. I see some similarities to shutdown checkpoints, which can take a lot of time if there happen to be a lot of dirty data, increasing disruption during planned restarts (when no one can connect). A common mitigation is to run CHECKPOINT shortly before the restart, to flush most of the dirty data while still allowing new connections. Maybe we could do something like that for checksum changes? I don't know how exactly we could do that, but let's say we can predict when roughly to expect the next state change. And we'd ensure the standby starts flushing stuff before that, so that creating the restartpoint is cheap. Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce the amount of WAL as we're getting closer to the end? regards -- Tomas Vondra
Attachment
pgsql-hackers by date: