Re: Fix overflow of nbatch - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Fix overflow of nbatch |
Date | |
Msg-id | 5b26e4ca-0352-42bd-a98f-9a84c0436af5@vondra.me Whole thread Raw |
In response to | Re: Fix overflow of nbatch (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Fix overflow of nbatch
|
List | pgsql-hackers |
On 10/8/25 19:37, Melanie Plageman wrote: > On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 10/7/25 23:10, Melanie Plageman wrote: >> >> Hmm, so if I understand correctly you suggest stop when nbuckets gets >> too high, while my code allowed reducing nbatches further (and just >> capped nbatches). I'm fine with this change, if it makes the code >> simpler, that means we allow ~130M buckets, which seems rather unlikely >> to be a problem in practice. That means 1GB for buckets alone, and >> tuples tend to be a multiple of that. With 4GB total, that's ~256k >> batches. And multiplied by the 130M that would be 3.5e13 tuples ... >> >> However, I don't understand why the condition bothers about INT_MAX? >> >> /* Ensure that nbuckets * 2 doesn't overflow an int */ >> if (nbuckets > INT_MAX / 2) >> break; > > You're right. This can just be > if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2) > break; > >> It's however true that if we double nbuckets and nbatch at the same >> time, we don't need to bother doing this: >> >> max_pointers = hash_table_bytes / sizeof(HashJoinTuple); >> >> because we can never break. Although, is that really true? When picking >> the initial nbuckets value we do this: >> >> nbuckets = Max(nbuckets, 1024); >> >> What if this increased the nbuckets (it'd require extremely low work_mem >> of course)? Could it lead to unexpectedly high nbuckets in the loop? > > If we are worried about nbuckets exceeding what can be allocated, then > I think the proposed condition above takes care of that > > if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2) > break; > > As for whether or not bumping nbuckets to 1024 before the loop means > we can have very high values with low work_mem: it seems like even > with the minimum work_mem, the number of buckets is larger than that. > When could we hit this? Maybe if the number of skew buckets is very > large? > >> Similarly, why does this check care about number of buckets? >> >> if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2) >> break; >> >> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets >> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the >> size of the hash table in bytes, not in number of items. Also, see how >> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should >> continue doing that. > > Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant: > > hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize / > sizeof(HashJoinTuple) / 2 > But the hash table is not allocated as a single chunk of memory, so I think MaxAllocSize would be the wrong thing to use here, no? Hash table is a collection of tuples (allocated one by one), that's why get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for nbuckets, because that indeed is allocated as a contiguous array, ofc. Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it we get hash_table_bytes > MaxAllocSize / 2 but again, that doesn't make much sense - the hash table can be larger, it's not a single palloc. > but I don't think we need this because nbuckets should always be > bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start > out with, it was clamped to Max(1024, hash_table_bytes / > sizeof(HashJoinTuple)). > > The only exception would be if MaxAllocSize / sizeof(HashJoinTuple) > was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which > case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should > cover that anyway. > I'm confused about what this check was meant to do. The check said this /* * Ensure that hash_table_bytes * 2 doesn't exceed MaxAllocSize / * sizeof(HashJoinTuple) */ if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2) break; And I think it should be just if (hash_table_bytes > SIZE_MAX / 2) break; as a protection against hash_table_bytes overflowing SIZE_MAX (which on 64-bits seems implausible, but could happen on 32-bit builds). Also, how is this related to nbuckets? >> I like the idea of simplifying the stop condition, instead of >> calculating the old/new size, and comparing those. But is this actually >> correct? >> >> if (nbatch / 4 < hash_table_bytes / BLCKSZ) >> break; >> >> If we start with the "full" condition >> >> hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ) >> >> subtract hash_bytes from both sides >> >> (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ) >> >> subtract (nbatch * BLCKSZ) >> >> nbatch * BLCKSZ < hash_bytes >> >> that gives us >> >> nbatch < (hash_bytes / BLCKSZ) >> >> So where did the /4 come from? > > Yes, I made the mistake of doubling the batches and hash_table_bytes > again, forgetting that the original formula was already comparing the > hypothetical space to the current space. > > I think it should be > if (nbatch < hash_table_bytes / BLCKSZ) > as you say > >> Also, maybe it'd be easier to just do >> (nbatch * BLCKSZ) < hash_bytes >> >> i.e. without the division. > > I prefer the division to avoid as many potentially overflow causing > operations as possible. otherwise we would have to check that nbatch * > BLCKSZ doesn't overflow first. > Ah, I forgot about overflow again! Which is ironic, because this whole thing is about overflows. I really wish we had an infinite-precision-int ;-) >>> I'm still waiting for the 4billion row table to be created on my >>> machine, so I haven't verified that I get the same results as you yet. > > I have updated my patch to fix the mistakes above. I also noticed then > that I wasn't doubling space_allowed in the loop but instead setting > it to hash_table_bytes at the end. This doesn't produce a power of 2 > because we subtract skew_mcvs from the hash_table_bytes. So, we have > to keep using space_allowed if we want a power of 2 in the end. > > I've changed my patch to do this, but this made me wonder if we want > to be doing this or instead take hash_table_bytes at the end and round > it up to a power of 2 and set space_allowed to that. If the skew > hashtable is large, we may be allocating way more space_allowed than > we need for new hash_table_bytes + skew hashtable buckets. > > Oh, and I get the same logging output results as your patch with attached v2. > Cool, in the worst case we're both wrong in the same way ;-) regards -- Tomas Vondra
pgsql-hackers by date: