Home > mailing lists
Re: Fix overflow of nbatch - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Fix overflow of nbatch
Date	October 8 23:46:34
Msg-id	5b26e4ca-0352-42bd-a98f-9a84c0436af5@vondra.me Whole thread Raw
In response to	Re: Fix overflow of nbatch (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: Fix overflow of nbatch
List	pgsql-hackers
Tree view

On 10/8/25 19:37, Melanie Plageman wrote:
> On Tue, Oct 7, 2025 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 10/7/25 23:10, Melanie Plageman wrote:
>>
>> Hmm, so if I understand correctly you suggest stop when nbuckets gets
>> too high, while my code allowed reducing nbatches further (and just
>> capped nbatches). I'm fine with this change, if it makes the code
>> simpler, that means we allow ~130M buckets, which seems rather unlikely
>> to be a problem in practice. That means 1GB for buckets alone, and
>> tuples tend to be a multiple of that. With 4GB total, that's ~256k
>> batches. And multiplied by the 130M that would be 3.5e13 tuples ...
>>
>> However, I don't understand why the condition bothers about INT_MAX?
>>
>>     /* Ensure that nbuckets * 2 doesn't overflow an int */
>>     if (nbuckets > INT_MAX / 2)
>>         break;
> 
> You're right. This can just be
>         if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>             break;
> 
>> It's however true that if we double nbuckets and nbatch at the same
>> time, we don't need to bother doing this:
>>
>>     max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
>>
>> because we can never break. Although, is that really true? When picking
>> the initial nbuckets value we do this:
>>
>>     nbuckets = Max(nbuckets, 1024);
>>
>> What if this increased the nbuckets (it'd require extremely low work_mem
>> of course)? Could it lead to unexpectedly high nbuckets in the loop?
> 
> If we are worried about nbuckets exceeding what can be allocated, then
> I think the proposed condition above takes care of that
> 
>         if (nbuckets > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>             break;
> 
> As for whether or not bumping nbuckets to 1024 before the loop means
> we can have very high values with low work_mem: it seems like even
> with the minimum work_mem, the number of buckets is larger than that.
> When could we hit this? Maybe if the number of skew buckets is very
> large?
> 
>> Similarly, why does this check care about number of buckets?
>>
>>     if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
>>         break;
>>
>> I mean, (MaxAllocSize / sizeof(HashJoinTuple)) is the number of buckets
>> (hj tuple pointers) we can fit into 1GB. But hash_table_bytes is the
>> size of the hash table in bytes, not in number of items. Also, see how
>> get_hash_memory_limit caps the limit by SIZE_MAX. So I think we should
>> continue doing that.
> 
> Yes, this was wrong. I forgot an extra / sizeof(HashJoinTuple). I meant:
> 
> hash_table_bytes / sizeof(HashJoinTuple) > MaxAllocSize /
> sizeof(HashJoinTuple) / 2
> 

But the hash table is not allocated as a single chunk of memory, so I
think MaxAllocSize would be the wrong thing to use here, no? Hash table
is a collection of tuples (allocated one by one), that's why
get_hash_memory_limit() uses SIZE_MAX. MaxAllocSize matters for
nbuckets, because that indeed is allocated as a contiguous array, ofc.

Also, why bother with /sizeof(HashJoinTuple) on both sides? Without it
we get

    hash_table_bytes > MaxAllocSize / 2

but again, that doesn't make much sense - the hash table can be larger,
it's not a single palloc.

> but I don't think we need this because nbuckets should always be
> bigger than hash_table_bytes / sizeof(HashJoinTuple) since to start
> out with, it was clamped to Max(1024, hash_table_bytes /
> sizeof(HashJoinTuple)).
> > The only exception would be if  MaxAllocSize / sizeof(HashJoinTuple)
> was smaller than hash_table_bytes / sizeof(HashJoinTuple) in which
> case, checking nbuckets > MaxAllocSize / sizeof(HashJoinTuple) should
> cover that anyway.
> 

I'm confused about what this check was meant to do. The check said this

    /*
     * Ensure that hash_table_bytes * 2 doesn't exceed MaxAllocSize /
     * sizeof(HashJoinTuple)
     */
    if (hash_table_bytes > MaxAllocSize / sizeof(HashJoinTuple) / 2)
        break;

And I think it should be just

    if (hash_table_bytes > SIZE_MAX / 2)
        break;

as a protection against hash_table_bytes overflowing SIZE_MAX (which on
64-bits seems implausible, but could happen on 32-bit builds).

Also, how is this related to nbuckets?


>> I like the idea of simplifying the stop condition, instead of
>> calculating the old/new size, and comparing those. But is this actually
>> correct?
>>
>>     if (nbatch / 4 < hash_table_bytes / BLCKSZ)
>>         break;
>>
>> If we start with the "full" condition
>>
>>  hash_bytes + (2 * nbatch * BLCKSZ) < hash_bytes * 2 + (nbatch * BLCKSZ)
>>
>> subtract hash_bytes from both sides
>>
>>   (2 * nbatch * BLCKSZ) < hash_bytes + (nbatch * BLCKSZ)
>>
>> subtract (nbatch * BLCKSZ)
>>
>>   nbatch * BLCKSZ < hash_bytes
>>
>> that gives us
>>
>>   nbatch < (hash_bytes / BLCKSZ)
>>
>> So where did the /4 come from?
> 
> Yes, I made the mistake of doubling the batches and hash_table_bytes
> again, forgetting that the original formula was already comparing the
> hypothetical space to the current space.
> 
> I think it should be
>         if (nbatch < hash_table_bytes / BLCKSZ)
> as you say
> 
>> Also, maybe it'd be easier to just do
>>   (nbatch * BLCKSZ) < hash_bytes
>>
>> i.e. without the division.
> 
> I prefer the division to avoid as many potentially overflow causing
> operations as possible. otherwise we would have to check that nbatch *
> BLCKSZ doesn't overflow first.
> 

Ah, I forgot about overflow again! Which is ironic, because this whole
thing is about overflows.

I really wish we had an infinite-precision-int ;-)

>>> I'm still waiting for the 4billion row table to be created on my
>>> machine, so I haven't verified that I get the same results as you yet.
> 
> I have updated my patch to fix the mistakes above. I also noticed then
> that I wasn't doubling space_allowed in the loop but instead setting
> it to hash_table_bytes at the end. This doesn't produce a power of 2
> because we subtract skew_mcvs from the hash_table_bytes. So, we have
> to keep using space_allowed if we want a power of 2 in the end.
> 
> I've changed my patch to do this, but this made me wonder if we want
> to be doing this or instead take hash_table_bytes at the end and round
> it up to a power of 2 and set space_allowed to that. If the skew
> hashtable is large, we may be allocating way more space_allowed than
> we need for new hash_table_bytes + skew hashtable buckets.
> 
> Oh, and I get the same logging output results as your patch with attached v2.
> 

Cool, in the worst case we're both wrong in the same way ;-)


regards

-- 
Tomas Vondra
pgsql-hackers by date:
From: Andrew Dunstan
Date: 08 October, 23:40:37
Subject: Re: Thoughts on a "global" client configuration?
From: Nathan Bossart
Date: 08 October, 23:48:46
Subject: Re: Expanding HOT updates for expression and partial indexes
Re: Fix overflow of nbatch - Mailing list pgsql-hackers

Previous

Next