Re: Fix overflow of nbatch - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Fix overflow of nbatch |
Date | |
Msg-id | CAAKRu_bm1FnOSsXTy8jpG6=+syEqxrgMSjAVEaSsBUSUdGaEkA@mail.gmail.com Whole thread Raw |
In response to | Re: Fix overflow of nbatch (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Fix overflow of nbatch
Re: Fix overflow of nbatch |
List | pgsql-hackers |
On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 10/7/25 23:10, Melanie Plageman wrote: > > Hmm, so if I understand correctly you suggest stop when nbuckets gets > too high, while my code allowed reducing nbatches further (and just > capped nbatches). I'm fine with this change, if it makes the code > simpler, that means we allow ~130M buckets, which seems rather unlikely > to be a problem in practice. That means 1GB for buckets alone, and > tuples tend to be a multiple of that. With 4GB total, that's ~256k > batches. And multiplied by the 130M that would be 3.5e13 tuples ... > > However, I don't understand why the condition bothers about INT_MAX? > > /* Ensure that nbuckets * 2 doesn't overflow an int */ > if (nbuckets > INT_MAX / 2) > break; You're right. This can just be if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2) break; > It's however true that if we double nbuckets and nbatch at the same > time, we don't need to bother doing this: > > max_pointers = hash_table_bytes / sizeof(HashJoinTuple); > > because we can never break. Although, is that really true? When picking > the initial nbuckets value we do this: > > nbuckets = Max(nbuckets, 1024); > > What if this increased the nbuckets (it'd require extremely low work_mem > of course)? Could it lead to unexpectedly high nbuckets in the loop? If we are worried about nbuckets exceeding what can be allocated, then I think the proposed condition above takes care of that if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2) break; As for whether or not bumping nbuckets to 1024 before the loop means we can have very high values with low work_mem: it seems like even with the minimum work_mem, the number of buckets is larger than that. When could we hit this? Maybe if the number of skew buckets is very large? > Similarly, why does this check care about number of buckets? > > if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2) > break; > > I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets > (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the > size of the hash table in bytes, not in number of items. Also, see how > get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should > continue doing that. Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant: hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize / sizeof(HashJoinTuple) / 2 but I don't think we need this because nbuckets should always be bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start out with, it was clamped to Max(1024, hash_table_bytes / sizeof(HashJoinTuple)). The only exception would be if MaxAllocSize / sizeof(HashJoinTuple) was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should cover that anyway. > I like the idea of simplifying the stop condition, instead of > calculating the old/new size, and comparing those. But is this actually > correct? > > if (nbatch / 4 < hash_table_bytes / BLCKSZ) > break; > > If we start with the "full" condition > > hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ) > > subtract hash_bytes from both sides > > (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ) > > subtract (nbatch * BLCKSZ) > > nbatch * BLCKSZ < hash_bytes > > that gives us > > nbatch < (hash_bytes / BLCKSZ) > > So where did the /4 come from? Yes, I made the mistake of doubling the batches and hash_table_bytes again, forgetting that the original formula was already comparing the hypothetical space to the current space. I think it should be if (nbatch < hash_table_bytes / BLCKSZ) as you say > Also, maybe it'd be easier to just do > (nbatch * BLCKSZ) < hash_bytes > > i.e. without the division. I prefer the division to avoid as many potentially overflow causing operations as possible. otherwise we would have to check that nbatch * BLCKSZ doesn't overflow first. > > I'm still waiting for the 4billion row table to be created on my > > machine, so I haven't verified that I get the same results as you yet. I have updated my patch to fix the mistakes above. I also noticed then that I wasn't doubling space_allowed in the loop but instead setting it to hash_table_bytes at the end. This doesn't produce a power of 2 because we subtract skew_mcvs from the hash_table_bytes. So, we have to keep using space_allowed if we want a power of 2 in the end. I've changed my patch to do this, but this made me wonder if we want to be doing this or instead take hash_table_bytes at the end and round it up to a power of 2 and set space_allowed to that. If the skew hashtable is large, we may be allocating way more space_allowed than we need for new hash_table_bytes + skew hashtable buckets. Oh, and I get the same logging output results as your patch with attached v2. - Melanie
Attachment
pgsql-hackers by date: