Re: Changing shared_buffers without restart - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Changing shared_buffers without restart
Date
Msg-id 2781dd2b-8ff1-41a0-a68c-3fe2409b8644@vondra.me
Whole thread Raw
In response to Re: Changing shared_buffers without restart  (Dmitry Dolgov <9erthalion6@gmail.com>)
List pgsql-hackers
On 7/4/25 16:41, Dmitry Dolgov wrote:
>> On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
>> I took a look at this patch, because it's somewhat related to the NUMA
>> patch series I posted a couple days ago, and I've been wondering if
>> it makes some of the NUMA stuff harder or simpler.
> 
> Thanks a lot for the review! It's a plenty of feedback, and I'll
> probably take time to answer all of it, but I still want to address
> couple of most important topics quickly.
> 
>> But I'm getting a bit lost in how exactly this interacts with things
>> like overcommit, system memory accounting / OOM killer and this sort of
>> stuff. I went through the thread and it seems to me the reserve+map
>> approach works OK in this regard (and the messages on linux-mm seem to
>> confirm this). But this information is scattered over many messages and
>> it's hard to say for sure, because some of this might be relevant for
>> an earlier approach, or a subtly different variant of it.
>>
>> A similar question is portability. The comments and commit messages
>> seem to suggest most of this is linux-specific, and other platforms just
>> don't have these capabilities. But there's a bunch of messages (mostly
>> by Thomas Munro) that hint FreeBSD might be capable of this too, even if
>> to some limited extent. And possibly even Windows/EXEC_BACKEND, although
>> that seems much trickier.
>>
>> [...]
>>
>> So I think it'd be very helpful to write a README, explaining the
>> currnent design/approach, and summarizing all these aspects in a single
>> place. Including things like portability, interaction with the OS
>> accounting, OOM killer, this kind of stuff. Some of this stuff may be
>> already mentioned in code comments, but you it's hard to find those.
>>
>> Especially worth documenting are the states the processes need to go
>> through (using the barriers), and the transacitons between them (i.e.
>> what is allowed in each phase, what blocks can be visible, etc.).
> 
> Agree, I'll add some comprehensive readme in the next version. Note,
> that on the topic of portability the latest version implements a new
> approach suggested by Thomas Munro, which reduces problematic parts to
> memfd_create only, which is mentioned as Linux specific in the
> documentation, but AFAICT has FreeBSD counterparts.
> 

OK. It's not entirely clear to me if this README should be temporary, or
if it should eventually get committed. I'd probably vote to have a
proper README explaining the basic design / resizing processes etc. It
probably should not discuss portability in too much detail, that can get
stale pretty quick.

>> 1) no user docs
>>
>> There are no user .sgml docs, and maybe it's time to write some,
>> explaining how to use this thing - how to configure it, how to trigger
>> the resizing, etc. It took me a while to realize I need to do ALTER
>> SYSTEM + pg_reload_conf() to kick this off.
>>
>> It should also document the user-visible limitations, e.g. what activity
>> is blocked during the resizing, etc.
> 
> While the user interface is still under discussion, I agree, it makes
> sense to capture this information in sgml docs.
> 

Yeah. Spelling out the "official" way to use something is helpful.

>> 2) pending GUC changes
>>
>> [...]
>>
>> It also seems a bit strange that the "switch" gets to be be driven by a
>> randomly selected backend (unless I'm misunderstanding this bit). It
>> seems to be true for the buffer eviction during shrinking, at least.
> 
> The resize itself is coordinated by the postmaster alone, not by a
> randomly selected backend. But looks like buffer eviction indeed can
> happen anywhere, which is what we were discussing in the previous
> messages.
> 
>> Perhaps this should be a separate utility command, or maybe even just
>> a new ALTER SYSTEM variant? Or even just a function, similar to what
>> the "online checksums" patch did, possibly combined with a bgworder
>> (but probably not needed, there are no db-specific tasks to do).
> 
> This is one topic we still actively discuss, but haven't had much
> feedback otherwise. The pros and cons seem to be clear:
> 
> * Utilizing the existing GUC mechanism would allow treating
>   shared_buffers as any other configuration, meaning that potential
>   users of this feature don't have to do anything new to use it -- they
>   still can use whatever method they prefer to apply new configuration
>   (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
> 
>   I'm also wondering if it's only shared_buffers, or some other options
>   could use similar approach.
> 

I don't know. What are the "potential users" of this feature? I don't
recall any, but there may be some. How do we know the new pending flag
will work for them too?

> * Having a separate utility command is a mighty simplification, which
>   helps avoiding problems you've described above.
> 
> So far we've got two against one in favour of simple utility command, so
> we can as well go with that.
> 

Not sure voting is a good way to make design decisions ...

>> 3) max_available_memory
>>
>> Speaking of GUCs, I dislike how max_available_memory works. It seems a
>> bit backwards to me. I mean, we're specifying shared_buffers (and some
>> other parameters), and the system calculates the amount of shared memory
>> needed. But the limit determines the total limit?
> 
> The reason it's so backwards is that it's coming from the need to
> specify how much memory we would like to reserve, and what would be the
> upper boundary for increasing shared_buffers. My intention is eventually
> to get rid of this GUC and figure its value at runtime as a function of
> the total available memory.
> 

I understand why it's like this. It's simple, and people do want to
limit the memory the instance will allocate. That's understandable. The
trouble is it makes it very unclear what's the implied limit on shared
buffers size. Maybe if there was a sensible way to expose that, we could
keep the max_available_memory.

But I don't think you can get rid of the GUC, at least not entirely. You
need to leave some memory aside for queries, people may start multiple
instances at once, ...

>> I think the GUC should specify the maximum shared_buffers we want to
>> allow, and then we'd work out the total to pre-allocate? Considering
>> we're only allowing to resize shared_buffers, that should be pretty
>> trivial. Yes, it might happen that the "total limit" happens to exceed
>> the available memory or something, but we already have the problem
>> with shared_buffers. Seems fine if we explain this in the docs, and
>> perhaps print the calculated memory limit on start.
> 
> Somehow I'm not following what you suggest here. You mean having the
> maximum shared_buffers specified, but not as a separate GUC?
> 

My suggestion was to have a guc max_shared_buffers. Based on that you
can easily calculate the size of all other segments dependent on
NBuffers, and reserve memory for that.

>> 4) SHMEM_RESIZE_RATIO
>>
>> The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
>> these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
>> descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
>> 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
>> the rest is only "reserved".
> 
> SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent
> solution, nothing more. I probably have to mention that in the
> commentaries.
> 

OK

>> Moreover, all of the above is for mappings sized based on NBuffers. But
>> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
>> moment someone increases of max_connection, max_locks_per_transaction
>> and possibly some other stuff?
> 
> Can you elaborate, what do you mean by that? Increasing max_connection,
> etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
> but the ratio is for memory reservation only.
> 

Stuff like PGPROC, fast-path locks etc. are allocated as part of
MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
space for that. If I significantly increase GUCs like max_connections or
max_locks_per_transaction, how do you know it didn't exceed the 10%?

>> 5) no tests
>>
>> I mentioned no "user docs", but the patch has 0 tests too. Which seems
>> a bit strange for a patch of this age.
>>
>> A really serious part of the patch series seems to be the coordination
>> of processes when going through the phases, enforced by the barriers.
>> This seems like a perfect match for testing using injection points, and
>> I know we did something like this in the online checksums patch, which
>> needs to coordinate processes in a similar way.
> 
> Exactly what we're talking about recently, figuring out how to use
> injections points for testing. Keep in mind, that the scope of this work
> turned out to be huge, and with just two people on board we're
> addressing one thing at the time.
> 

Sure.

>> But even just a simple TAP test that does a bunch of (random?) resizes
>> while running a pgbench seem better than no tests. (That's what I did
>> manually, and it crashed right away.)
> 
> This is the type of testing I was doing before posting the series. I
> assume you've crashed it on buffers shrinking, singe you've got SIGBUS
> which would indicate that the memory is not available anymore. Before we
> go into debugging, just to be on the safe side I would like to make sure
> you were testing the latest patch version (there are some signs that
> it's not the case, about that later)?
> 

Maybe, I don't remember. But I also see crashes while expanding the
buffers, with assert failure here:

#4  0x0000556f159c43d1 in ExceptionalCondition
(conditionName=0x556f15c00e00 "node->prev != INVALID_PROC_NUMBER ||
list->head == procno", fileName=0x556f15c00ce0
"../../../../src/include/storage/proclist.h", lineNumber=163) at assert.c:66
#5  0x0000556f157a9831 in proclist_contains_offset (list=0x7f296333ce24,
procno=140, node_offset=100) at
../../../../src/include/storage/proclist.h:163
#6  0x0000556f157a9add in ConditionVariableTimedSleep
(cv=0x7f296333ce20, timeout=-1, wait_event_info=134217782) at
condition_variable.c:184
#7  0x0000556f157a99c9 in ConditionVariableSleep (cv=0x7f296333ce20,
wait_event_info=134217782) at condition_variable.c:98
#8  0x0000556f157902df in BarrierArriveAndWait (barrier=0x7f296333ce08,
wait_event_info=134217782) at barrier.c:191
#9  0x0000556f156d1226 in ProcessBarrierShmemResize
(barrier=0x7f296333ce08) at pg_shmem.c:1201


>> 10) what to do about stuck resize?
>>
>> AFAICS the resize can get stuck for various reasons, e.g. because it
>> can't evict pinned buffers, possibly indefinitely. Not great, it's not
>> clear to me if there's a way out (canceling the resize) after a timeout,
>> or something like that? Not great to start an "online resize" only to
>> get stuck with all activity blocked for indefinite amount of time, and
>> get to restart anyway.
>>
>> Seems related to Thomas' message [2], but AFAICS the patch does not do
>> anything about this yet, right? What's the plan here?
> 
> It's another open discussion right now, with an idea to eventually allow
> canceling after a timeout. I think canceling when stuck on buffer
> eviction should be pretty straightforward (the evition must take place
> before actual shared memory resize, so we know nothing has changed yet),
> but in some other failure scenarios it would be harder (e.g. if one
> backend is stuck resizing, while other have succeeded -- this would
> require another round of synchronization and some way to figure out what
> is the current status).
> 

I think it'll be crucial to structure it so that it can't get stuck
while resizing.

>> 11) preparatory actions?
>>
>> Even if it doesn't get stuck, some of the actions can take a while, like
>> evicting dirty buffers before shrinking, etc. This is similar to what
>> happens on restart, when the shutdown checkpoint can take a while, while
>> the system is (partly) unavailable.
>>
>> The common mitigation is to do an explicit checkpoint right before the
>> restart, to make the shutdown checkpoint cheap. Could we do something
>> similar for the shrinking, e.g. flush buffers from the part to be
>> removed before actually starting the resize?
> 
> Yeah, that's a good idea, we will try to explore it.
> 
>> 12) does this affect e.g. fork() costs?
>>
>> I wonder if this affects the cost of fork() in some undesirable way?
>> Could it make fork() more measurably more expensive?
> 
> The number of new mappings is quite limited, so I would not expect that.
> But I can measure the impact.
> 
>> 14) interesting messages from the thread
>>
>> While reading through the thread, I noticed a couple messages that I
>> think are still relevant:
> 
> Right, I'm aware there is a lot of not yet addressed feedback, even more
> than you've mentioned below. None of this feedback was ignored, we're
> just solving large problems step by step. So far the focus was on how to
> do memory reservation and to coordinate resize, and everybody is more
> than welcome to join. But thanks for collecting the list, I probably
> need to start tracking what was addressed and what was not.
> 
>> - Robert asked [5] if Linux might abruptly break this, but I find that
>>   unlikely. We'd point out we rely on this, and they'd likely rethink.
>>   This would be made safer if this was specified by POSIX - taking that
>>   away once implemented seems way harder than for custom extensions.
>>   It's likely they'd not take away the feature without an alternative
>>   way to achieve the same effect, I think (yes, harder to maintain).
>>   Tom suggests [7] this is not in POSIX.
> 
> This conversation was related to the original implementation, which was
> based on mremap and slicing of mappings. As I've mentioned, the new
> approach doesn't have most of those controversial points, it uses
> memfd_create and regular compatible mmap -- I don't see any of those
> changing their behavior any time soon.
> 
>> - Andres had an interesting comment about how overcommit interacts with
>>   MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
>>   accounting. There's also some comments about from linux-mm people [9].
> 
> The new implementation uses MAP_NORESERVE for the mapping.
> 
>> - There seem to be some issues with releasing memory backing a mapping
>>   with hugetlb [10]. With the fd (and truncating the file), this seems
>>   to release the memory, but it's linux-specific? But most of this stuff
>>   is specific to linux, it seems. So is this a problem? With this it
>>   should be working even for hugetlb ...
> 
> Again, the new implementation got rid of problematic bits here, and I
> haven't found any weak points related to hugetlb in testing so far.
> 
>> - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
>>   make the hugetlb stuff work just like on Linux? Unclear. Also, I
>>   thought the mfd stuff is linux-specific ... or am I confused?
> 
> Yep, probably.
> 
>> - Thomas asked [13] why we need to stop all the backends, instead of
>>   just waiting for them to acknowledge the new (smaller) NBuffers value
>>   and then let them continue. I also don't quite see why this should
>>   not work, and it'd limit the disruption when we have to wait for
>>   eviction of buffers pinned by paused cursors, etc.
> 
> I think I've replied to that one, the idea so far was to eliminate any
> chance of accessing to-be-truncated buffers and make it easier to reason
> about correctness of the implementation this way. I don't see any other
> way how to prevent backends from accessing buffers that may disappear
> without adding overhead on the read path, but if you folks have some
> ideas -- please share!
> 
>> v5-0001-Process-config-reload-in-AIO-workers.patch
>>
>> 1) Hmmm, so which other workers may need such explicit handling? Do all
>>    other processes participate in procsignal stuff, or does anything
>>    need an explicit handling?
> 
> So far I've noticed the issue only with io_workers and the checkpointer.
> 
>> v5-0003-Introduce-pss_barrierReceivedGeneration.patch
>>
>> 1) Do we actually need this? Isn't it enough to just have two barriers?
>>    Or a barrier + condition variable, or something like that.
> 
> The issue with two barriers is that they do not prevent disjoint groups,
> i.e. one backend joins the barrier, finishes the work and detaches from
> the barrier, then another backends joins. I'm not familiar with how this
> was solved for online checkums patch though, will take a look. Having a
> barrier and a condition variable would be possible, but it's hard to
> figure out for how many backends to wait. All in all, a small extention
> to the ProcSignalBarrier feels to me much more elegant.
> 
>> 2) The comment talks about "coordinated way" when processing messages,
>>    but it's not very clear to me. It should explain what is needed and
>>    not possible with the current barrier code.
> 
> Yeah, I need to work on the commentaries across the patch. Here in
> particular it means any coordinated way, whatever that could be. I can
> add an example to clarify that part.
> 
>> v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
> 
> Most of the commentaries here and in the following patches are obviously
> reasonable and I'll incorporate them into the next version.
> 
>> 5) I'm a bit confused about the segment/mapping difference. The patch
>>    seems to randomly mix those, or maybe I'm just confused. I mean,
>>    we are creating just shmem segment, and the pieces are mappings,
>>    right? So why do we index them by "shmem_segment"?
> 
> Indeed, the patch uses "segment" and "mapping" interchangeably, I need
> to tighten it up. The relation is still one to one, thus are multiple
> segments as well as mappings.
> 
>> 7) We should remember which segments got to use huge pages and which
>>    did not. And we should make it optional for each segment. Although,
>>    maybe I'm just confused about the "segment" definition - if we only
>>    have one, that's where huge pages are applied.
>>
>>    If we could have multiple segments for different segments (whatever
>>    that means), not sure what we'll report for cases when some segments
>>    get to use huge pages and others don't.
> 
> Exactly to avoid solving this, I've consciously decided to postpone
> implementing possibility to mix huge and regular pages so far. Any
> opinions, should a single reported value be removed and this information
> is instead represented as part of an informational view about shared
> memory (the one you were suggesting in this thread)?
> 
>> 11) Actually, what's the difference between the contents of Mappings
>>     and Segments? Isn't that the same thing, indexed in the same way?
>>     Or could it be unified? Or are they conceptually different thing?
> 
> Unless I'm mixing something badly, the content is the same. The relation
> is a segment as a structure "contains" a mapping.
> 

Then, why do we need to track it in two places? Doesn't it just increase
the likelihood that someone misses updating one of them?

>> v5-0005-Address-space-reservation-for-shared-memory.patch
>>
>> 1) Shouldn't reserved_offset and huge_pages_on really be in the segment
>>    info? Or maybe even in mapping info? (again, maybe I'm confused
>>    about what these structs store)
> 
> I don't think there is reserved_offset variable in the latest version
> anymore, can you please confirm you use it instead of ther one I've
> posted in April?
> 
>> 3) So ReserveAnonymousMemory is what makes decisions about huge pages,
>>    for the whole reserved space / all segments in it. That's a bit
>>    unfortunate with respect to the desirability of some segments
>>    benefiting from huge pages and others not. Maybe we should have two
>>    "reserved" areas, one with huge pages, one without?
> 
> Again, there is no ReserveAnonymousMemory anymore, the new approach is
> to reserve the memory via separate mappings.
> 

Will check. These may indeed be stale comments, from looking at the
earlier version of the patch (the last one from Ashutosh).

>>    I guess we don't want too many segments, because that might make
>>    fork() more expensive, etc. Just guessing, though. Also, how would
>>    this work with threading?
> 
> I assume multithreading will render it unnecessary to use shared memory
> favoring some other types of memory usage, but the mechanism around it
> could still be the same.
> 
>> 5) The general approach seems sound to me, but I'm not expert on this.
>>    I wonder how portable this behavior is. I mean, will it work on other
>>    Unix systems / Windows? Is it POSIX or Linux extension?
> 
> Don't know yet, it's a topic for investigation.
> 
>> v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
>>
>> 2) In fact, what happens if the user tries to resize to a value that is
>>    too large for one of the segments? How would the system know before
>>    starting the resize (and failing)?
> 
> This type of situation is handled (doing hard stop) in the latest
> version, because all the necessary information is present in the mapping
> structure.
> 

I don't know, but crashing the instance (I assume that's what you mean
by hard stop) does not seem like something we want to do. AFAIK the GUC
hook should be able to determine if the value is too large, and reject
it at that point. Not proceed and crash everything.

>> v5-0007-Allow-to-resize-shared-memory-without-restart.patch
>>
>> 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
>>    somewhere in the resizing?
> 
> When coordination with barriers kicks in, there is a cut off line after
> which any newly spawned backend will not be able to take part in it
> (e.g. it was too slow to init ProcSignal infrastructure).
> AdjustShmemSize is used to handle this cases.
> 
>> 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
>>    Why is it needed, exactly? If it's to flush stuff for processes
>>    consuming EmitProcSignalBarrier, it's that too late? What if a
>>    process consumes the barrier between the emit and memory barrier?
> 
> I think it's not needed, a leftover after code modifications.
> 
>> v5-0008-Support-shrinking-shared-buffers.patch
>> v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
>> v5-0010-Additional-validation-for-buffer-in-the-ring.patch
> 
> This reminds me I still need to review those, so Ashutosh probably can
> answer those questions better than I.



-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Suggestion to add --continue-client-on-abort option to pgbench
Next
From: Tom Lane
Date:
Subject: Re: A assert failure when initdb with track_commit_timestamp=on