Re: storing an explicit nonce - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: storing an explicit nonce |
Date | |
Msg-id | CA+Tgmobg+1Gypkyb8FbEhzt9Ve-4QF=HqrWWUN-eP2=Rqq_hdQ@mail.gmail.com Whole thread Raw |
In response to | Re: storing an explicit nonce (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: storing an explicit nonce
|
List | pgsql-hackers |
On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce@momjian.us> wrote: > On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote: > > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce@momjian.us> wrote: > > > You are saying that by using a non-LSN nonce, you can write out the page > > > with a new nonce, but the same LSN, and also discard the page during > > > crash recovery and use the WAL copy? > > > > I don't know what "discard the page during crash recovery and use the > > WAL copy" means. > > I was asking how decoupling the nonce from the LSN allows for us to > avoid full page writes for hint bit changes. I am guessing you are > saying that on recovery, if we see a hint-bit-only change in the WAL > (with a new nonce), we just throw away the page because it could be torn > and use the WAL full page write version. Well, in the design where the nonce is stored in the page, there is no need for every hint-type change to appear in the WAL at all. Once per checkpoint cycle, you need to write a full page image, as we do for checksums or wal_log_hints. The rest of the time, you can just bump the nonce and rewrite the page, same as we do today. > Yes, it might be 1e100+++ more expensive too, but we don't know, and I > am not ready to add a lot of complexity for such an unknown. No, it can't be 1e100+++ more expensive, because it's not realistically possible for a page to be written to disk 1e100+++ times per checkpoint cycle. It is however entirely possible for it to be written 100 times per checkpoint cycle. That is not something unknown about which we need to speculate; it is easy to see that this can happen, even on a simple test like pgbench with a data set larger than shared buffers. It is not right to confuse "we have no idea whether this will be expensive" with "how expensive this will be is workload-dependent," which is what you seem to be doing here. If we had no idea whether something would be expensive, then I agree that it might not be worth adding complexity for it, or maybe some testing should be done first to find out. But if we know for certain that in some workloads something can be very expensive, then we had better at least talk about whether it is worth adding complexity in order to resolve the problem. And that is the situation here. I am not even convinced that storing the nonce in the block is going to be more complex, because it seems to me that the patches I posted upthread worked out pretty cleanly. There are some things to discuss and think about there, for sure, but it is not like we are talking about inventing warp drive. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: