Home > mailing lists

Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: Memory-Bounded Hash Aggregation
Date	December 5, 2019 02:55:43
Msg-id	2c8dc5c9178c391493dfbf6c2f595dc298dd477e.camel@j-davis.com Whole thread Raw
In response to	Re: Memory-Bounded Hash Aggregation (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Memory-Bounded Hash Aggregation
List	pgsql-hackers

Tree view

Thanks very much for a great review! I've attached a new patch.

There are some significant changes in the new version also:

In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.

I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.

I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.

With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.

There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.

My answers to your questions inline:

On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
> Could we instead track how many tuples we actually consumed for the
> the
> in-memory groups, and then use this information to improve the
> estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples
> which
> created 100 groups, then we know there's ~1:10 ratio.

That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.

> A couple of comments based on eye-balling the patch:
> 
> 
> 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
> maybe it should be enable_hashagg_mem_overflow or something similar?

The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).

> 
> assume the pointer really is NULL. Surely we'll get a segfault on the
> preceding line which does dereference it
> 
>      pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
> 
> Or am I missing anything?

That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.

> 
> 3) execGrouping.c
> 
> A couple of functions would deserve a comment, explaining what it
> does.
> 
>   - LookupTupleHashEntryHash
>   - prepare_hash_slot
>   - calculate_hash

Done, thank you.

> And it's not clear to me why we should remove part of the comment
> before
> TupleHashTableHash.

Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.

> 
> 4) I'm not sure I agree with this reasoning that
> HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that
> was
> generally the case we'd just use small hash tables all the time. It's
> a
> bit annoying to give user the capability to set work_mem and then
> kinda
> override that.

I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.

> 5) Not sure what "directly" means in this context?
> 
>   * partitions at the time we need to spill, and because this
> algorithm
>   * shouldn't depend too directly on the internal memory needs of a
>   * BufFile.
> 
> #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
> 
> Does that mean we don't want to link to PGAlignedBlock, or what?

That's what I meant, yes, but I reworded the comment to not say that.

> 6) I think we should have some protection against underflows in this
> piece of code:
> 
> - this would probably deserve some protection against underflow if
> HASH_PARTITION_MEM gets too big
> 
>      if (hashagg_mem_overflow)
>          aggstate->hash_mem_limit = SIZE_MAX;
>      else
>          aggstate->hash_mem_limit = (work_mem * 1024L) -
> HASH_PARTITION_MEM;
> 
> At the moment it's safe because work_mem is 64kB at least, and
> HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
> to
> bump HASH_MIN_PARTITIONS up, this can underflow.

Thank you, done.

> 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
> memory limit?

Improved.

> 
> 8) The comment before lookup_hash_entries says:
> 
>   ...
>   * Return false if hash table has exceeded its memory limit.
>   ..
> 
> But that's clearly bogus, because that's a void function.

Thank you, improved comment.

> 9) Shouldn't the hash_finish_initial_spills calls in
> agg_retrieve_direct
> have a comment, similar to the surrounding code? Might be an
> overkill,
> not sure.

Sure, done.

> 10) The comment for agg_refill_hash_table says
> 
>   * Should only be called after all in memory hash table entries have
> been
>   * consumed.
> 
> Can we enforce that with an assert, somehow?

It's a bit awkward. Simplehash doesn't expose the number of groups, and
we would also have to check each hash table. Not a bad idea to add an
interface to simplehash to make that work, though.

> 11) The hash_spill_npartitions naming seems a bit confusing, because
> it
> seems to imply it's about the "spill" while in practice it just
> choses
> number of spill partitions. Maybe hash_choose_num_spill_partitions
> would
> be better?

Done.

> 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
> reasoning behind the current value (256)? Not wanting to pick too
> many
> partitions? Comment?
> 
>      if (npartitions > HASH_MAX_PARTITIONS)
>          npartitions = HASH_MAX_PARTITIONS;

Added a comment. There's no deep reasoning there -- I just don't want
it to choose to create 5000 files and surprise a user.

> 13) As for this:
> 
>      /* make sure that we don't exhaust the hash bits */
>      if (partition_bits + input_bits >= 32)
>          partition_bits = 32 - input_bits;
> 
> We already ran into this issue (exhausting bits in a hash value) in
> hashjoin batching, we should be careful to use the same approach in
> both
> places (not the same code, just general approach).

Didn't investigate this yet, but will do.

Regards,
    Jeff Davis

Attachment

hashagg-20191204.diff

pgsql-hackers by date:

From: Kyotaro Horiguchi
Date: 05 December 2019, 02:37:23
Subject: Re: Session WAL activity

From: Kyotaro Horiguchi
Date: 05 December 2019, 03:06:54
Subject: Re: Increase footprint of %m and reduce strerror()

Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers

Attachment

Previous

Next