Re: Proposal for fixing intra-query memory leaks - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Proposal for fixing intra-query memory leaks |
Date | |
Msg-id | 200006130015.UAA12057@candle.pha.pa.us Whole thread Raw |
Responses |
Re: Proposal for fixing intra-query memory leaks
|
List | pgsql-hackers |
FYI, Tom, is this still relivant? > This issue seems to have been on the back burner for a while, > but I think we need to put it on the front burner again for 7.1. > Here is a think-piece I just did. I'd appreciate comments, > particularly about possible interactions with TOAST --- Jan, > did you have any particular plan in mind for freeing datums created > by de-TOASTing? > > regards, tom lane > > > Proposal for memory allocation fixes 29-Apr-2000 > ------------------------------------ > > We know that Postgres has serious problems with memory leakage during > large queries that process a lot of pass-by-reference data. There is > no provision for recycling memory until end of query. This needs to be > fixed, even more so with the advent of TOAST which will allow very > large chunks of data to be passed around in the system. Furthermore, > 7.1 is an ideal time for fixing it since TOAST and the function-manager > interface changes will require visiting a lot of the same code that needs > to be cleaned up. So, here is a proposal. > > > Background > ---------- > > We already do most of our memory allocation in "memory contexts", which > are usually AllocSets as implemented by backend/utils/mmgr/aset.c. > (Is there any value in allowing for other memory context types? We could > save some cycles by getting rid of a level of indirection here.) What > we need to do is create more contexts and define proper rules about when > they can be freed. > > The basic operations on a memory context are: > > * create a context > > * delete a context (including freeing all the memory allocated therein) > > * reset a context (free all memory allocated in the context, but not the > context object itself) > > Given a context, one can allocate a chunk of memory within it, free a > previously allocated chunk, or realloc a previously allocated chunk larger > or smaller. (These operations correspond directly to standard C's > malloc(), free(), and realloc() routines.) At all times there is a > "current" context denoted by the CurrentMemoryContext global variable. > The backend macros palloc(), pfree(), prealloc() implicitly allocate space > in that context. The MemoryContextSwitchTo() operation selects a new > current context (and returns the previous context, so that the caller can > restore the previous context before exiting). > > Note: there is no really good reason for pfree() to be tied to the current > memory context; it ought to be possible to pfree() a chunk of memory no > matter which context it was allocated from. Currently we cannot do that > because of the possibility that there is more than one kind of memory > context. If they were all AllocSets then the problem goes away, which is > one reason I'd like to eliminate the provision for other kinds of > contexts. > > The main advantage of memory contexts over plain use of malloc/free is > that the entire contents of a memory context can be freed easily, without > having to request freeing of each individual chunk within it. This is > both faster and more reliable than per-chunk bookkeeping. We already use > this fact to clean up at transaction end: by resetting all the active > contexts, we reclaim all memory. What we need are additional contexts > that can be reset or deleted at strategic times within a query, such as > after each tuple. > > > Additions to the memory-context mechanism > ----------------------------------------- > > If we are going to have more contexts, we need more mechanism for keeping > track of them; else we risk leaking whole contexts under error conditions. > We can do this as follows: > > 1. There will be two kinds of contexts, "permanent" and "temporary". > Permanent contexts are never reset or deleted except by explicit caller > command (in practice, they probably won't ever be, period). There will > not be very many of these --- perhaps only the existing TopMemoryContext > and CacheMemoryContext. We should avoid having very much code run with > CurrentMemoryContext pointing at a permanent context, since any forgotten > palloc() represents a permanent memory leak. > > 2. Temporary contexts are remembered by the context manager and are > guaranteed to be deleted at transaction end. (If we ever have nested > transactions, we'd probably want to tie each temporary context to a > particular transaction, but for now that's not necessary.) Most activity > will happen in temporary contexts. > > 3. When a context is created, an existing context can be specified as its > parent; thus a tree of contexts is created. Resetting or deleting any > particular context resets or deletes all its direct and indirect children > as well. This feature allows us to manage a lot of contexts without fear > that some will be leaked; we just have to make sure everything descends > from one context that we remember to zap at transaction end. > > In practice, point #2 doesn't require any special support in the context > manager as long as it supports point #3. We simply start a new context > for each transaction and delete it at transaction end. All temporary > contexts created within the transaction must be direct or indirect > children of this "transaction top context". > > Note: it would probably be possible to adapt the existing "portal" memory > management mechanism to do what we need. I am instead proposing setting > up a totally new mechanism, because the portal code strikes me as > extremely crufty and unwieldy. It may be that we can eventually remove > portals entirely, or perhaps reimplement them with this mechanism > underneath. > > > Top-level (permanent) memory contexts > ------------------------------------- > > We currently have TopMemoryContext and CacheMemoryContext as permanent > memory contexts. The existing usages of these are probably OK, although > it might be a good idea to examine usages of TopMemoryContext to see if > they should go somewhere else. > > It might also be a good idea to set up a permanent ErrorMemoryContext that > elog() can switch into for processing an error; this would ensure that > there is at least ~8K of memory available for error processing, even if > we've run out otherwise. (ErrorMemoryContext could be reset, but not > deleted, after each successful error recovery.) > > We will also create a global variable TransactionTopMemoryContext, which > is valid at all times. Memory recovery at end of transaction is done by > deleting and immediately recreating this context. All transaction-local > contexts are created as children of TransactionTopMemoryContext, so that > they go away at transaction end too. (If we implement nested > transactions, it could be that TransactionTopMemoryContext will itself be > a child of some outer transaction's top context, but that's beyond the > scope of this proposal.) > > > Transaction-local memory contexts > --------------------------------- > > Relatively little stuff should get allocated directly in > TransactionTopMemoryContext; the bulk of the action should happen in > sub-contexts. I propose the following: > > QueryTopMemoryContext: this child of TransactionTopMemoryContext is > created at the start of each query cycle and deleted upon successful > completion. (On error, of course, it goes away because it is a child of > TransactionTopMemoryContext.) The query input buffer is allocated in this > context, as well as anything else that should live just till end of query. > > ParsePlanMemoryContext: this child of QueryTopMemoryContext is working > space for the parse/rewrite/plan/optimize pipeline. After completion > of planning, the final query plan is copied via copyObject() back into > QueryTopMemoryContext, and then the ParsePlanMemoryContext can be deleted. > This allows us to recycle the (perhaps large) amount of memory used by > planning before actual query execution starts. > > Execution per-run memory contexts: at startup, the executor will create a > child of QueryTopMemoryContext to hold data that should live until > ExecEndPlan; an example is the plan-node-local execution state. Some plan > node types may want to create shorter-lived contexts that are children of > their parent's per-run context. For example, a subplan node would create > its own "per run" context so that memory could be freed at completion of > each invocation of the subplan. > > Execution per-tuple memory contexts: each per-run context will have a > child context that the executor will reset (not delete) each time through > the node's per-tuple loop. This per-tuple context will be the active > CurrentMemoryContext most of the time during execution. > > By resetting the per-tuple context, we will be able to free memory after > each tuple is processed, rather than only after the whole plan is > processed. This should solve our memory leakage problems pretty well; > yet we do not need to add very much new bookkeeping logic to do it. > In particular, we do *not* need to try to keep track of individual values > palloc'd during expression evaluation. > > Note we assume that resetting a context is a cheap operation. This is > true already, and we can make it even more true with a little bit of > tuning in aset.c. > > > Coding rules required > --------------------- > > Functions that return pass-by-reference values will be required always > to palloc the returned space in the caller's memory context (ie, the > context that was CurrentMemoryContext at the time of call). It is not > OK to pass back an input pointer, even if we are returning an input value > verbatim, because we do not know the lifespan of the context the input > pointer points to. An example showing why this is necessary is provided > by aggregate-function execution. The aggregate function executor must > retain state values returned by state-transition functions from one tuple > to the next. Yet it does not want to keep them till end of run; that > would be a memory leak. The solution nodeAgg.c will use is to have two > per-tuple memory contexts that are used alternately. At each tuple, > an old state value existing in one context is passed to the state > transition function, which will return its result in the other context > (since that'll be where CurrentMemoryContext points). Then the first > context is reset and used as the target for the next cycle. This solution > works as long as the transition function always returns a newly palloc'd > datum, and never simply returns a pointer to its input data. > > Thus, a function must use the passed-in CurrentMemoryContext for > allocating its result data, and can use it for any temporary storage it > needs as well. pfree'ing such temporary data before return is possible > but not essential. > > Executor routines that switch the active CurrentMemoryContext may need > to copy data into their caller's current memory context before returning. > I think there will be relatively little need for that, if we use a > convention of resetting the per-tuple context at the *start* of an > execution cycle rather than at its end. With that rule, an execution > node can return a tuple that is palloc'd in its per-tuple context, and > the tuple will remain good until the node is called for another tuple > or told to end execution. This is pretty much the same state of affairs > that exists now, since a scan node can return a direct pointer to a tuple > in a disk buffer that is only guaranteed to remain good that long. > > A more common reason for copying data will be to transfer a result from > per-tuple context to per-run context; for example, a Unique node will > save the last distinct tuple value in its per-run context, requiring a > copy step. (Actually, Unique could use the same trick with two per-tuple > contexts as described above for Agg, but there will probably be other > cases where doing an extra copy step is the right thing.) > > > Other notes > ----------- > > It might be that the executor per-run contexts described above should > be tied directly to executor "EState" nodes, that is, one context per > EState. I'm not real clear on the lifespan of EStates or the situations > where we have just one or more than one, so I'm not sure. Comments? > > With so many contexts running around, I think it will be almost essential > to allow pfree() to work on chunks belonging to contexts other than the > current one. If we don't get rid of the notion of multiple allocation > context types then some other work will have to be expended to make this > possible. Also, should we allow prealloc() to work on a chunk not > belonging to the current context? I'm less excited about allowing that, > but it may prove useful. > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
pgsql-hackers by date: