Thread: Re: Adjusting hash join memory limit to handle batch explosion

Re: Adjusting hash join memory limit to handle batch explosion

From

Melanie Plageman

Date:

08 February, 01:14:45

On Thu, Feb 6, 2025 at 1:48 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> Here's a slightly simplified version of the "balancing" patch. I decided
> to stop increasing the nbucket value at runtime, even if the hashtable
> grows larger than the memory limit (which is what we used to calculate
> the initial nbucket value in ExecChooseHashTableSize).

I started looking at these.
First question is if the guc enable_hashjoin_adjust is for development
or you mean it for users (and for it to be off by default).

In 0001, in ExecChooseHashTableSize(), I would start the large block
comment with something that indicates this is a continuation of the
calculation above it for getting the required number of batches.

You say:
     * Optimize the total amount of memory consumed by the hash node.
     * The nbatch calculation above focuses on the size of the in-memory hash
     * table, ignoring the memory used by batch files. But that can be a lot
     * of memory - each batch file has a BLCKSZ buffer, and we may need two
     * files per batch (inner and outer side). So with enough batches this can
     * be significantly more memory than the hashtable itself, and it grows
     * quickly as we're adding more batches.

I might make it more explicit:
nbatch is calculated above purely on the size of the inner relation
and the bytes available for the hashtable, assuming no per-batch
overhead. Now, recalibrate the number of batches and the size of the
hashtable to optimize the total amount of memory consumed by the
hashnode.

Then go into the rest of your details.

This paragraph, while essential, could probably use a bit of massaging

     * This means we're only ever reducing nbatch values, we'll never increase
     * it (as we're not considering nbatch*2). We could counsider that too,
     * depending on which part of the [nbatch,work_mem] table we're in. And
     * for cases with high work_mem values, we would find that adding batches
     * reduces memory usage. But the hashtable size is what we consider when
     * calculating the initial nbatch value, and if it's dominating the memory
     * usage, if means we're not exceeding the expected memory limit (at least
     * not significantly). There is little risk of OOM or memory overruns. Our
     * goal is not to minimize the memory usage, but to enforce the limit set
     * by the user. Minimizing the memory usage would result in spilling many
     * more batch files, which does not seem great for performance. So we only
     * ever reduce nbatch, never increase it.

The point is that if we aren't exceeding the expected memory limit,
then we won't increase the number of batches to try and save memory
because it will probably hurt performance in other ways. All the other
details are useful, but I found myself a bit lost in them (the way
they are phrased now).

     * While growing the hashtable, we also adjust the number of buckets, to
     * not have more than one tuple per bucket. We can only do this during

What does "to not have more than one tuple per bucket" mean?

     * So after the initial sizing (here in ExecChooseHashTableSize), the
     * number of buckets is effectively fixed. ExecHashGetBucketAndBatch
     * could calculate batchno/bucketno in a different way, but that's
     * left as a separate improvement. To some extent this is a preexisting
     * issue - if we set growEnabled=false, this allows the hashtable to
     * exceed the memory limit too, and we don't adjust the bucket count.
     * However, that likely happens due to duplicate values and/or hash
     * collisions, so it's not clear if increasing the bucket count would
     * actually spread the tuples through the buckets. It would help with
     * skewed data sets, when we may disable the growth early, and then
     * add more tuples with distinct hash values.

After "is effectively fixed", I'm not sure how much more of this
detail I would include in this comment. There is already quite a lot
of information. Especially the sentence "ExecHashGetBucketAndBatch
     * could calculate batchno/bucketno in a different way, but that's
     * left as a separate improvement."
seems like it would need more information to be clear enough to the
reader -- so maybe just omit it.

If nothing else, I would move the discussion about why we don't
increase the number of buckets to a place where we are actually _not_
increasing the number of buckets (ExecHashIncreaseBatchSize()). In
this location, we are increasing nbuckets.

As for ExecHashIncreaseBatchSize()

     * XXX We're comparing the current spaceAllowed/batchSpace values, because
     * if we double either of those this is the new memory we'll use.

I don't get this. Firstly why is it XXX? Secondly, why are we using
the current spaceAllowed value?
In fact, I don't quite understand how this is actually increasing the
size of the hashtable at all. All it does is cause us to dump out of
ExecHashIncreaseNumBatches() without increasing the number of batches.
Is the reason you don't increase the actual spaceAllowed value because
you don't want it to be larger for other batches that might benefit
from a batch doubling?

- Melanie

Re: Adjusting hash join memory limit to handle batch explosion

From

Tomas Vondra

Date:

19 February, 23:21:59

I've pushed the first (and main) part of the patch series, after some
more cleanup and comment polishing.

As explained in my previous message, I'm not sure about 0002. I don't
know if we need to worry about it (no reports AFAICS). And while the
patch works I'm not sure it's the best fix, or whether we need to do
something about exhausting hash bits.

In any case, it's not PG18 material. And it's a separate issue, so I'm
marking this as committed.


Thanks everyone who helped with any of the many old patch versions!


-- 
Tomas Vondra

Re: Adjusting hash join memory limit to handle batch explosion

From

James Hunter

Date:

25 February, 19:30:17

On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> I've pushed the first (and main) part of the patch series, after some
> more cleanup and comment polishing.

Two comments on your merged patch --

First, it's easier to see what's going on if we overlook the logic to
round to nearest power of two, and solve the optimization problem
algebraically. Let T = the total memory needed to hash all input rows,
and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
typically 16 KB). Then, solving the optimization problem, the minimum
memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
SQRT(B * T).

(Here I am using "workmem" for the hash table's "space_allowed.")

The total working memory used, at the minimum, is always 2 * w: twice
the optimal "workmem" ("space_allowed").

This says that the maximum input size that can be (optimally) hashed
with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
GB, and the total working memory used would actually be 16 MB.

Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
MB of total working memory used. So "workmem" grows with the SQRT of
T, the total hash memory needed; and total working memory is 2x
"workmem."

Second -- the algebraic solution illustrates the difficulty in
tracking and restricting working memory usage for Hash Joins! Your
patch improves the "hash 64 GB" situation, because it eliminates 96 GB
of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
a cost of only 24 MB of workmem. Using the default 8 MB workmem,
*actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
only 64 MB; so we save 72 MB overall. This is a good thing, but--

The "but" is that the customer really should have set their workmem to
64 MB, in the first place; and we should have taken half of that for
the hash table, and left the other half for per-batch metadata.

-- OK,  but historically we have pretended that the per-batch metadata
used no memory. So the customer should have set their workmem to 32
MB, with the understanding that PostgreSQL would have actually used 64
MB...

-- OK, but the customer *didn't* set their workmem to 32 MB. (If they
had, we wouldn't need this patch -- but we *do* need this patch, which
means the customer hasn't set their workmem high enough.) Why not?
Well, because if they set it to 32 MB, they'd run OOM!

-- So we are (secretly!) increasing the customer's workmem to 32 MB,
but only for this particular Hash Join. The customer can't increase it
to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
for this Hash Join, in the hopes that by doing so we'll avoid running
OOM... which is good; but we don't *tell* the customer we've done
this, and we just hope that the customer actually has 64 MB (= 2x
workmem) free (because, if they don't, they'll run OOM anyway).

All of this is to say that this patch illustrates the need for
something like proposal [1], which allows PostgreSQL to set workmem
limits on individual execution nodes, based on the optimizer's memory
estimates. In the above patch, we're blindly making things better,
without knowing whether we've made them good enough. (The customer is
less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
possible since their workmem limit is 8 MB!)

In v.next of my patchset at [1] (should be done by end of day today) I
will deal with the case discussed above by:

1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track
the "workmem" needed by the hash table);
2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE),
inside EXPLAIN (work_mem on), (so we display to the customer our best
estimate of the effective workmem limit).

Thanks,
James

[1]
https://www.postgresql.org/message-id/flat/CAJVSvF6s1LgXF6KB2Cz68sHzk%2Bv%2BO_vmwEkaon%3DH8O9VcOr-tQ%40mail.gmail.com

Re: Adjusting hash join memory limit to handle batch explosion

From

Tomas Vondra

Date:

25 February, 20:39:09

On 2/25/25 17:30, James Hunter wrote:
> On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> I've pushed the first (and main) part of the patch series, after some
>> more cleanup and comment polishing.
> 
> Two comments on your merged patch --
> 
> First, it's easier to see what's going on if we overlook the logic to
> round to nearest power of two, and solve the optimization problem
> algebraically. Let T = the total memory needed to hash all input rows,
> and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
> typically 16 KB). Then, solving the optimization problem, the minimum
> memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
> SQRT(B * T).
> 
> (Here I am using "workmem" for the hash table's "space_allowed.")
> 
> The total working memory used, at the minimum, is always 2 * w: twice
> the optimal "workmem" ("space_allowed").
> 
> This says that the maximum input size that can be (optimally) hashed
> with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
> GB, and the total working memory used would actually be 16 MB.
> 
> Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
> MB of total working memory used. So "workmem" grows with the SQRT of
> T, the total hash memory needed; and total working memory is 2x
> "workmem."
> 

Yes, this is a nice way to explain the issue, and how we solve it. It's
probably better than the comment in my commit, I guess.

> Second -- the algebraic solution illustrates the difficulty in
> tracking and restricting working memory usage for Hash Joins! Your
> patch improves the "hash 64 GB" situation, because it eliminates 96 GB
> of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
> a cost of only 24 MB of workmem. Using the default 8 MB workmem,
> *actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
> MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
> only 64 MB; so we save 72 MB overall. This is a good thing, but--
> 

Agreed.

> The "but" is that the customer really should have set their workmem to
> 64 MB, in the first place; and we should have taken half of that for
> the hash table, and left the other half for per-batch metadata.
> 
> -- OK,  but historically we have pretended that the per-batch metadata
> used no memory. So the customer should have set their workmem to 32
> MB, with the understanding that PostgreSQL would have actually used 64
> MB...
> 

Sure, we could have considered the per-batch metadata during planning.
And if we find we can't run the hash join, we'd "disable" (penalize) it
in some way. No argument there.

But that assumes we correctly estimate the number of batches during
planning, and it's easy to get that wrong. E.g. the nbatch explosion
cases are a good example.

And that's what my patch was aiming to improve. It does not matter how
the user sets the work_mem GUC, really.

> -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> had, we wouldn't need this patch -- but we *do* need this patch, which
> means the customer hasn't set their workmem high enough.) Why not?
> Well, because if they set it to 32 MB, they'd run OOM!
> 

Not sure I follow the reasoning here :-( If the query completed with a
lower work_mem value, it should complete with work_mem = 32MB, because
that reduces the amount of memory needed. But yes, it's possible they
hit OOM in both cases, it's an attempt to reduce the impact.

> -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> but only for this particular Hash Join. The customer can't increase it
> to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> for this Hash Join, in the hopes that by doing so we'll avoid running
> OOM... which is good; but we don't *tell* the customer we've done
> this, and we just hope that the customer actually has 64 MB (= 2x
> workmem) free (because, if they don't, they'll run OOM anyway).
> 

Right. This is meant to be a best-effort mitigation for rare cases.

Maybe we should track/report it somehow, though. I mean, if 1% of hash
joins need this, you're probably fine. If 99% hash joins hit it, you
probably really need a higher work_mem value because the hashed relation
is just too large.

But you have a point - maybe we should track/report this somewhere.
First step would be to make the total memory usage better visible in
explain (it's not obvious it does not include the per-batch metadata).

> All of this is to say that this patch illustrates the need for
> something like proposal [1], which allows PostgreSQL to set workmem
> limits on individual execution nodes, based on the optimizer's memory
> estimates. In the above patch, we're blindly making things better,
> without knowing whether we've made them good enough. (The customer is
> less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> possible since their workmem limit is 8 MB!)
> 
> In v.next of my patchset at [1] (should be done by end of day today) I
> will deal with the case discussed above by:
> 
> 1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track
> the "workmem" needed by the hash table);
> 2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE),
> inside EXPLAIN (work_mem on), (so we display to the customer our best
> estimate of the effective workmem limit).
> 
> Thanks,
> James
> 
> [1]
https://www.postgresql.org/message-id/flat/CAJVSvF6s1LgXF6KB2Cz68sHzk%2Bv%2BO_vmwEkaon%3DH8O9VcOr-tQ%40mail.gmail.com

I'm not opposed to doing something like this, but I'm not quite sure how
could it help the cases I meant to address with my patch, where we plan
with low nbatch value, and then it explodes as execution time.


regards

-- 
Tomas Vondra

Re: Adjusting hash join memory limit to handle batch explosion

From

James Hunter

Date:

25 February, 21:02:01

On Tue, Feb 25, 2025 at 9:39 AM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 2/25/25 17:30, James Hunter wrote:
> > On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas@vondra.me> wrote:

> > -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> > had, we wouldn't need this patch -- but we *do* need this patch, which
> > means the customer hasn't set their workmem high enough.) Why not?
> > Well, because if they set it to 32 MB, they'd run OOM!
> >
>
> Not sure I follow the reasoning here :-( If the query completed with a
> lower work_mem value, it should complete with work_mem = 32MB, because
> that reduces the amount of memory needed. But yes, it's possible they
> hit OOM in both cases, it's an attempt to reduce the impact.

Yes, your patch is a Pareto improvement, because it means we use less
working memory than we would otherwise.

> > -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> > but only for this particular Hash Join. The customer can't increase it
> > to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> > for this Hash Join, in the hopes that by doing so we'll avoid running
> > OOM... which is good; but we don't *tell* the customer we've done
> > this, and we just hope that the customer actually has 64 MB (= 2x
> > workmem) free (because, if they don't, they'll run OOM anyway).
> >
>
> Right. This is meant to be a best-effort mitigation for rare cases.
>
> Maybe we should track/report it somehow, though. I mean, if 1% of hash
> joins need this, you're probably fine. If 99% hash joins hit it, you
> probably really need a higher work_mem value because the hashed relation
> is just too large.
>
> But you have a point - maybe we should track/report this somewhere.
> First step would be to make the total memory usage better visible in
> explain (it's not obvious it does not include the per-batch metadata).

Right -- my point is that mitigation is good, but tracking /
visibility is also necessary.

> > All of this is to say that this patch illustrates the need for
> > something like proposal [1], which allows PostgreSQL to set workmem
> > limits on individual execution nodes, based on the optimizer's memory
> > estimates. In the above patch, we're blindly making things better,
> > without knowing whether we've made them good enough. (The customer is
> > less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> > possible since their workmem limit is 8 MB!)
> >
> > In v.next of my patchset at [1] (should be done by end of day today) I
> > will deal with the case discussed above by:
> >  ...

> I'm not opposed to doing something like this, but I'm not quite sure how
> could it help the cases I meant to address with my patch, where we plan
> with low nbatch value, and then it explodes as execution time.

Your patch addresses two cases: (1) where we plan with high nbatch
value; and (2) where we plan with low nbatch value, and then it
explodes at execution time.

Case (2) can be solved only by taking some action at runtime, and that
action is always best effort (because if we really don't have enough
extra memory free, we have to run OOM). Once a query has started
running, we have fewer options.

Case (1) can be solved in various ways, because it occurs before we
started running the query. For example, we can:

1. Delay starting execution of the query until enough memory becomes
available; or
2. Take memory away from other execution nodes to give to this Hash Join.

But these case (1) solutions require access to the actual
working-memory calculation. That's all I'm saying -- by tracking our
"best-effort" decision, we make it possible to address case (1).

(Your patch solves case (2), as well as it can be solved, by giving
memory to this Hash Join at runtime, and hoping that no one else was
using it. It's hard to improve on that, because PG execution nodes
don't, in general, have the ability to give up memory after they've
started running. But we can do better for case (1), if other
components can basically see the results of your formula.)

Thanks,
James