Even if we made the checksum algorithm itself faster, the main issue is actually memory bandwidth. Intel server CPUs have about half the bandwidth of AMD ones. A checksum has to pull in the whole page in a few hundred cycles. Without checksums only a part of the page might be accessed and the accesses are spread over a longer time, making them easier to hide by out-of-order execution.
But all the above still ends up at being a few hundred nanoseconds per buffer read. Basically this ends up only mattering measurably for in-RAM but out of shared buffers workloads. And the easy workaround is to increase shared buffers. As you said, the main issue is the other overheads that checksums pull in.
I want to point out that at some point in time there might well be demand for checksumming pages living in shared_buffers. Modern storage systems assume that the durable media is going to have errors and already have robust ways to detect that. But they also assume that ECC memory is bulletproof (it's not), and that's the biggest benefit to Postgres checksums: they protect data in the filesystem cache[1]. You obviously lose that if you size shared_buffers to consume most of available memory.
Obviously trying to address that is way beyond the scope of what's being discussed here. I'm honestly unsure of how relevant it is, but I wanted to make sure folks were aware of it.
1: I can't go into details, but I have seen a case where Postgres checksums led to an investigation that ultimately revealed a memory-related issue. In other words, data was actually getting corrupted while in the filesystem cache. Obviously data could (and likely was) also get corrupted in shared buffers, but the corruption in the FS cache was what prompted the investigation that ultimately found the hardware issue. Fortunately shared_buffers was small enough to make it more likely that corruption would happen outside of Postgres, so it could be detected.