Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Memory leak from ExecutorState context? |
Date | |
Msg-id | f532d33d-e4c1-ed00-e047-94f3aafa49db@enterprisedb.com Whole thread Raw |
In response to | Re: Memory leak from ExecutorState context? (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Responses |
Re: Memory leak from ExecutorState context?
|
List | pgsql-hackers |
On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote: > On Tue, 28 Mar 2023 00:43:34 +0200 > Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote: >>> Please, find in attachment a patch to allocate bufFiles in a dedicated >>> context. I picked up your patch, backpatch'd it, went through it and did >>> some minor changes to it. I have some comment/questions thought. >>> >>> 1. I'm not sure why we must allocate the "HashBatchFiles" new context >>> under ExecutorState and not under hashtable->hashCxt? >>> >>> The only references I could find was in hashjoin.h:30: >>> >>> /* [...] >>> * [...] (Exception: data associated with the temp files lives in the >>> * per-query context too, since we always call buffile.c in that >>> context.) >>> >>> And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this >>> original comment in the patch): >>> >>> /* [...] >>> * Note: it is important always to call this in the regular executor >>> * context, not in a shorter-lived context; else the temp file buffers >>> * will get messed up. >>> >>> >>> But these are not explanation of why BufFile related allocations must be >>> under a per-query context. >>> >> >> Doesn't that simply describe the current (unpatched) behavior where >> BufFile is allocated in the per-query context? > > I wasn't sure. The first quote from hashjoin.h seems to describe a stronger > rule about «**always** call buffile.c in per-query context». But maybe it ought > to be «always call buffile.c from one of the sub-query context»? I assume the > aim is to enforce the tmp files removal on query end/error? > I don't think we need this info for tempfile cleanup - CleanupTempFiles relies on the VfdCache, which does malloc/realloc (so it's not using memory contexts at all). I'm not very familiar with this part of the code, so I might be missing something. But you can try that - just stick an elog(ERROR) somewhere into the hashjoin code, and see if that breaks the cleanup. Not an explicit proof, but if there was some hard-wired requirement in which memory context to allocate BufFile stuff, I'd expect it to be documented in buffile.c. But that actually says this: * Note that BufFile structs are allocated with palloc(), and therefore * will go away automatically at query/transaction end. Since the underlying * virtual Files are made with OpenTemporaryFile, all resources for * the file are certain to be cleaned up even if processing is aborted * by ereport(ERROR). The data structures required are made in the * palloc context that was current when the BufFile was created, and * any external resources such as temp files are owned by the ResourceOwner * that was current at that time. which I take as confirmation that it's legal to allocate BufFile in any memory context, and that cleanup is handled by the cache in fd.c. >> I mean, the current code calls BufFileCreateTemp() without switching the >> context, so it's in the ExecutorState. But with the patch it very clearly is >> not. >> >> And I'm pretty sure the patch should do >> >> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt, >> "HashBatchFiles", >> ALLOCSET_DEFAULT_SIZES); >> >> and it'd still work. Or why do you think we *must* allocate it under >> ExecutorState? > > That was actually my very first patch and it indeed worked. But I was confused > about the previous quoted code comments. That's why I kept your original code > and decided to rise the discussion here. > IIRC I was just lazy when writing the experimental patch, there was not much thought about stuff like this. > Fixed in new patch in attachment. > >> FWIW The comment in hashjoin.h needs updating to reflect the change. > > Done in the last patch. Is my rewording accurate? > >>> 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context >>> switch seems fragile as it could be forgotten in futur code path/changes. >>> So I added an Assert() in the function to make sure the current memory >>> context is "HashBatchFiles" as expected. >>> Another way to tie this up might be to pass the memory context as >>> argument to the function. >>> ... Or maybe I'm over precautionary. >>> >> >> I'm not sure I'd call that fragile, we have plenty other code that >> expects the memory context to be set correctly. Not sure about the >> assert, but we don't have similar asserts anywhere else. > > I mostly sticked it there to stimulate the discussion around this as I needed > to scratch that itch. > >> But I think it's just ugly and overly verbose > > +1 > > Your patch was just a demo/debug patch by the time. It needed some cleanup now > :) > >> it'd be much nicer to e.g. pass the memory context as a parameter, and do >> the switch inside. > > That was a proposition in my previous mail, so I did it in the new patch. Let's > see what other reviewers think. > +1 >>> 3. You wrote: >>> >>>>> A separate BufFile memory context helps, although people won't see it >>>>> unless they attach a debugger, I think. Better than nothing, but I was >>>>> wondering if we could maybe print some warnings when the number of batch >>>>> files gets too high ... >>> >>> So I added a WARNING when batches memory are exhausting the memory size >>> allowed. >>> >>> + if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed) >>> + elog(WARNING, "Growing number of hash batch is exhausting >>> memory"); >>> >>> This is repeated on each call of ExecHashIncreaseNumBatches when BufFile >>> overflows the memory budget. I realize now I should probably add the >>> memory limit, the number of current batch and their memory consumption. >>> The message is probably too cryptic for a user. It could probably be >>> reworded, but some doc or additionnal hint around this message might help. >>> >> >> Hmmm, not sure is WARNING is a good approach, but I don't have a better >> idea at the moment. > > I stepped it down to NOTICE and added some more infos. > > Here is the output of the last patch with a 1MB work_mem: > > =# explain analyze select * from small join large using (id); > WARNING: increasing number of batches from 1 to 2 > WARNING: increasing number of batches from 2 to 4 > WARNING: increasing number of batches from 4 to 8 > WARNING: increasing number of batches from 8 to 16 > WARNING: increasing number of batches from 16 to 32 > WARNING: increasing number of batches from 32 to 64 > WARNING: increasing number of batches from 64 to 128 > WARNING: increasing number of batches from 128 to 256 > WARNING: increasing number of batches from 256 to 512 > NOTICE: Growing number of hash batch to 512 is exhausting allowed memory > (2164736 > 2097152) > WARNING: increasing number of batches from 512 to 1024 > NOTICE: Growing number of hash batch to 1024 is exhausting allowed memory > (4329472 > 2097152) > WARNING: increasing number of batches from 1024 to 2048 > NOTICE: Growing number of hash batch to 2048 is exhausting allowed memory > (8626304 > 2097152) > WARNING: increasing number of batches from 2048 to 4096 > NOTICE: Growing number of hash batch to 4096 is exhausting allowed memory > (17252480 > 2097152) > WARNING: increasing number of batches from 4096 to 8192 > NOTICE: Growing number of hash batch to 8192 is exhausting allowed memory > (34504832 > 2097152) > WARNING: increasing number of batches from 8192 to 16384 > NOTICE: Growing number of hash batch to 16384 is exhausting allowed memory > (68747392 > 2097152) > WARNING: increasing number of batches from 16384 to 32768 > NOTICE: Growing number of hash batch to 32768 is exhausting allowed memory > (137494656 > 2097152) > > QUERY PLAN > -------------------------------------------------------------------------- > Hash Join (cost=6542057.16..7834651.23 rows=7 width=74) > (actual time=558502.127..724007.708 rows=7040 loops=1) > Hash Cond: (small.id = large.id) > -> Seq Scan on small (cost=0.00..940094.00 rows=94000000 width=41) > (actual time=0.035..3.666 rows=10000 loops=1) > -> Hash (cost=6542057.07..6542057.07 rows=7 width=41) > (actual time=558184.152..558184.153 rows=700000000 loops=1) > Buckets: 32768 (originally 1024) > Batches: 32768 (originally 1) > Memory Usage: 1921kB > -> Seq Scan on large (cost=0.00..6542057.07 rows=7 width=41) > (actual time=0.324..193750.567 rows=700000000 loops=1) > Planning Time: 1.588 ms > Execution Time: 724011.074 ms (8 rows) > > Regards, OK, although NOTICE that may actually make it less useful - the default level is WARNING, and regular users are unable to change the level. So very few people will actually see these messages. thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: