Re: Enable data checksums by default - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Enable data checksums by default
Date
Msg-id CAMFBP2pUCZz6YZH7k8bJ2pyh1XzPUR8ntCTjMM1O0-b7x85viw@mail.gmail.com
Whole thread Raw
In response to Re: Enable data checksums by default  (Ants Aasma <ants.aasma@cybertec.at>)
List pgsql-hackers


On Fri, Aug 1, 2025 at 6:37 AM Ants Aasma <ants.aasma@cybertec.at> wrote:
Even if we made the checksum algorithm itself faster, the main issue
is actually memory bandwidth. Intel server CPUs have about half the
bandwidth of AMD ones. A checksum has to pull in the whole page in a
few hundred cycles. Without checksums only a part of the page might be
accessed and the accesses are spread over a longer time, making them
easier to hide by out-of-order execution.

But all the above still ends up at being a few hundred nanoseconds per
buffer read. Basically this ends up only mattering measurably for
in-RAM but out of shared buffers workloads. And the easy workaround is
to increase shared buffers. As you said, the main issue is the other
overheads that checksums pull in.

I want to point out that at some point in time there might well be demand for checksumming pages living in shared_buffers. Modern storage systems assume that the durable media is going to have errors and already have robust ways to detect that. But they also assume that ECC memory is bulletproof (it's not), and that's the biggest benefit to Postgres checksums: they protect data in the filesystem cache[1]. You obviously lose that if you size shared_buffers to consume most of available memory.

Obviously trying to address that is way beyond the scope of what's being discussed here. I'm honestly unsure of how relevant it is, but I wanted to make sure folks were aware of it.

1: I can't go into details, but I have seen a case where Postgres checksums led to an investigation that ultimately revealed a memory-related issue. In other words, data was actually getting corrupted while in the filesystem cache. Obviously data could (and likely was) also get corrupted in shared buffers, but the corruption in the FS cache was what prompted the investigation that ultimately found the hardware issue. Fortunately shared_buffers was small enough to make it more likely that corruption would happen outside of Postgres, so it could be detected.

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: More protocol.h replacements this time into walsender.c
Next
From: Sami Imseih
Date:
Subject: Re: Improve LWLock tranche name visibility across backends