Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker |
Date | |
Msg-id | 556ea57a-9cf3-4288-936e-bbe63f2bd29e@enterprisedb.com Whole thread Raw |
In response to | Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
|
List | pgsql-bugs |
On 3/3/24 23:12, Thomas Munro wrote: > On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> On 21/2/2024 19:52, Tomas Vondra wrote: >>> It's a bit weird it needs 1.8GB of memory, but perhaps that's also >>> linked to the number of batches, somehow? > >> I found one possible weak point in the code of PHJ: >> ExecParallelHashJoinSetUpBatches: >> >> pstate->batches = dsa_allocate0(hashtable->area, >> EstimateParallelHashJoinBatch(hashtable) * nbatch); >> >> It could explain why we have such a huge memory allocation with a size >> not bonded to a power of 2. > Maybe, but how many batches would that require for 1.8GB of memory? From what I see in my experiments, this uses ~432B per batch, so that's be about 4M batches. That's ... a lot. There may be some dependency on the number of workers, but it's going to be in this ballpark. OTOH I don't see any other dsa_allocate calls for the reproducer, so maybe it really is this. I really wonder how many batches are there and what's the memory per batch. > Hmm, a couple of short term ideas: > > 1. Maybe the planner should charge a high cost for exceeding some > soft limit on the expected number of batches; perhaps it could be > linked to the number of file descriptors we can open (something like > 1000), because during the build phase we'll be opening and closing > random file descriptors like crazy due to vfd pressure, which is not > free; that should hopefully discourage the planner from reaching cases > like this, but of course only in cases the planner can predict. > Will we really open the file descriptors like crazy? I think the 8kB buffers we keep for temporary files is a pretty good protection against that (but also one of the reasons for memory explosion in the serial hash join). Or does the parallel version use files differently? > 2. It sounds like we should clamp nbatches. We don't want the > per-partition state to exceed MaxAllocSize, which I guess is what > happened here if the above-quoted line produced the error? (The flag > that would allow "huge" allocations exceeding MaxAllocSize seems > likely to make the world worse while we have so many footguns -- I > think once a query has reached this stage of terrible execution, it's > better to limit the damage.) > > The performance and memory usage will still be terrible. We just > don't deal well with huge numbers of tiny partitions. Tiny relative > to input size, with input size being effectively unbounded. > Yeah. > Thoughts for later: A lower limit would of course be possible and > likely desirable. Once the partition-bookkeeping memory exceeds > work_mem * hash_mem_multiplier, it becomes stupid to double it just > because a hash table has hit that size, because that actually > increases total memory usage (and fast, because it's quadratic). Right. This is pretty much exactly the reason for the memory explosion in serial hash join, with a report maybe once or twice a year (who knows how many people don't report, though). > We'd be better off in practice giving up on the hash table size limit > and hoping for the best. Perhaps it'd be better not to give up entirely, but to relax it a bit. One of the proposals in [1] (from ~5 years ago) was to "balance" the hash table size and memory needed for the temporary files. That is, when things go wrong, double the hash table size every time the the temp files need the same amount of memory. That minimizes the total amount of memory needed by the join, I think. Back then it seemed like an ad hoc band-aid, but maybe it's better than nothing. It was far simpler than various other ideas, and these failures should be quite rare. Perhaps a similar emergency approach would work here too? > But the real long term question is what strategy > we're going to use to actually deal with this situation properly > *without* giving up our memory usage policies and hoping for the best, > and that remains an open question. To summarise the two main ideas > put forward so far: (1) allow very high number of batches, but process > at most N of M batches at a time, using a temporary "all-the-rest" > batch to be re-partitioned to feed the next N batches + the rest in a > later cycle, (2) fall back to looping over batches multiple times in > order to keep nbatches <= a small limit while also not exceeding a > hash table size limit. I'm not sure about (2), but (1) sounds very much like a recursive hash join, where we'd allow only a limited fan out at each stage. It's also a bit like the "spill" files in my other proposal - that worked fine, and it enforced the memory limit better than the memory balancing (which is more like best-effort heuristics). > Both have some tricky edge cases, especially > with parallelism in the picture but probably even without it. I'm > willing to work more on exploring this some time after the 17 cycle. I haven't thought very much about parallel hash joins, but what would be the corner cases for non-parallel cases? Sure, it may increase the amount of I/O, but only for cases with unexpectedly many batches (and then it's just a natural trade-off I/O vs. enforcing memory limit). regards [1] https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-bugs by date: