Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: BufferAlloc: don't take two simultaneous locks |
Date | |
Msg-id | CA+TgmoZRUPMHZfT4+53VCNsUMk-bq3cMomjpa-LP-84uBO6aQg@mail.gmail.com Whole thread Raw |
In response to | Re: BufferAlloc: don't take two simultaneous locks (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BufferAlloc: don't take two simultaneous locks
|
List | pgsql-hackers |
On Thu, Apr 14, 2022 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that "just hope it doesn't overflow" is unacceptable. > But couldn't you bound the number of extra entries as MaxBackends? Yeah, possibly ... as long as it can't happen that an operation still counts against the limit after it's failed due to an error or something like that. > FWIW, I have extremely strong doubts about whether this patch > is safe at all. This particular problem seems resolvable though. Can you be any more specific? This existing comment is surely in the running for terrible comment of the year: * To change the association of a valid buffer, we'll need to have * exclusive lock on both the old and new mapping partitions. Anybody with a little bit of C knowledge will have no difficulty gleaning from the code which follows that we are in fact acquiring both buffer locks, but whoever wrote this (and I think it was a very long time ago) did not feel it necessary to explain WHY we will need to have an exclusive lock on both the old and new mapping partitions, or more specifically, why we must hold both of those locks simultaneously. That's unfortunate. It is clear that we need to hold both locks at some point, just because the hash table is partitioned, but it is not clear why we need to hold them both simultaneously. It seems to me that whatever hazards exist must come from the fact that the operation is no longer fully atomic. The existing code acquires every relevant lock, then does the work, then releases locks. Ergo, we don't have to worry about concurrency because there basically can't be any. Stuff could be happening at the same time in other partitions that are entirely unrelated to what we're doing, but at the time we touch the two partitions we care about, we're the only one touching them. Now, if we do as proposed here, we will acquire one lock, release it, and then take the other lock, and that means that some operations could overlap that can't overlap today. Whatever gets broken must get broken because of that possible overlapping, because in the absence of concurrency, the end state is the same either way. So ... how could things get broken by having these operations overlap each other? The possibility that we might run out of buffer mapping entries is one concern. I guess there's also the question of whether the collision handling is adequate: if we fail due to a collision and handle that by putting the buffer on the free list, is that OK? And what if we fail midway through and the buffer doesn't end up either on the free list or in the buffer mapping table? I think maybe that's impossible, but I'm not 100% sure that it's impossible, and I'm not sure how bad it would be if it did happen. A permanent "leak" of a buffer that resulted in it becoming permanently unusable would be bad, for sure. But all of these issues seem relatively possible to avoid with sufficiently good coding. My intuition is that the buffer mapping table size limit is the nastiest of the problems, and if that's resolvable then I'm not sure what else could be a hard blocker. I'm not saying there isn't anything, just that I don't know what it might be. To put all this another way, suppose that we threw out the way we do buffer allocation today and always allocated from the freelist. If the freelist is found to be empty, the backend wanting a buffer has to do some kind of clock sweep to populate the freelist with >=1 buffers, and then try again. I don't think that would be performant or fair, because it would probably happen frequently that a buffer some backend had just added to the free list got stolen by some other backend, but I think it would be safe, because we already put buffers on the freelist when relations or databases are dropped, and we allocate from there just fine in that case. So then why isn't this safe? It's functionally the same thing, except we (usually) skip over the intermediate step of putting the buffer on the freelist and taking it off again. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: