Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Memory-Bounded Hash Aggregation |
Date | |
Msg-id | 2c8dc5c9178c391493dfbf6c2f595dc298dd477e.camel@j-davis.com Whole thread Raw |
In response to | Re: Memory-Bounded Hash Aggregation (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Memory-Bounded Hash Aggregation
|
List | pgsql-hackers |
Thanks very much for a great review! I've attached a new patch. There are some significant changes in the new version also: In the non-spilling path, removed the extra nullcheck branch in the compiled evaltrans expression. When the first tuple is spilled, I the branch becomes necessary, so I recompile the expression using a new opcode that includes that branch. I also changed the read-from-spill path to use a slot with TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot right away), which means I need to recompile the evaltrans expression for that case, as well. I also improved the way we initialize the hash tables to use a better estimate for the number of groups. And I made it only initialize one hash table in the read-from-spill path. With all of the changes I made (thanks to some suggestions from Andres) the performance is looking pretty good. It's pretty easy to beat Sort+Group when the group size is 10+. Even for average group size of ~1, HashAgg is getting really close to Sort in some cases. There are still a few things to do, most notably costing. I also need to project before spilling to avoid wasting disk. And I'm sure my changes have created some more problems, so I have some significant work to do on quality. My answers to your questions inline: On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > Could we instead track how many tuples we actually consumed for the > the > in-memory groups, and then use this information to improve the > estimate > of number of groups? I mean, if we know we've consumed 1000 tuples > which > created 100 groups, then we know there's ~1:10 ratio. That would be a good estimate for an even distribution, but not necessarily for a skewed distribution. I'm not opposed to it, but it's generally my philosophy to overpartition as it seems there's not a big downside. > A couple of comments based on eye-balling the patch: > > > 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e. > maybe it should be enable_hashagg_mem_overflow or something similar? The enable_* naming is for planner GUCs. hashagg_mem_overflow is an execution-time GUC that disables spilling and overflows work_mem (that is, it reverts to the old behavior). > > assume the pointer really is NULL. Surely we'll get a segfault on the > preceding line which does dereference it > > pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno]; > > Or am I missing anything? That's not actually dereferencing anything, it's just doing a pointer calculation. You are probably right that it's not a good thing to rely on, or at least not quite as readable, so I changed the order to put the NULL check first. > > 3) execGrouping.c > > A couple of functions would deserve a comment, explaining what it > does. > > - LookupTupleHashEntryHash > - prepare_hash_slot > - calculate_hash Done, thank you. > And it's not clear to me why we should remove part of the comment > before > TupleHashTableHash. Trying to remember back to when I first did that, but IIRC the comment was not updated from a previous change, and I was cleaning it up. I will check over that again to be sure it's an improvement. > > 4) I'm not sure I agree with this reasoning that > HASH_PARTITION_FACTOR > making the hash tables smaller is desirable - it may be, but if that > was > generally the case we'd just use small hash tables all the time. It's > a > bit annoying to give user the capability to set work_mem and then > kinda > override that. I think adding some kind of headroom is reasonable to avoid recursively spilling, but perhaps it's not critical. I see this as a tuning question more than anything else. I don't see it as "overriding" work_mem, but I can see where you're coming from. > 5) Not sure what "directly" means in this context? > > * partitions at the time we need to spill, and because this > algorithm > * shouldn't depend too directly on the internal memory needs of a > * BufFile. > > #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ) > > Does that mean we don't want to link to PGAlignedBlock, or what? That's what I meant, yes, but I reworded the comment to not say that. > 6) I think we should have some protection against underflows in this > piece of code: > > - this would probably deserve some protection against underflow if > HASH_PARTITION_MEM gets too big > > if (hashagg_mem_overflow) > aggstate->hash_mem_limit = SIZE_MAX; > else > aggstate->hash_mem_limit = (work_mem * 1024L) - > HASH_PARTITION_MEM; > > At the moment it's safe because work_mem is 64kB at least, and > HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen > to > bump HASH_MIN_PARTITIONS up, this can underflow. Thank you, done. > 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the > memory limit? Improved. > > 8) The comment before lookup_hash_entries says: > > ... > * Return false if hash table has exceeded its memory limit. > .. > > But that's clearly bogus, because that's a void function. Thank you, improved comment. > 9) Shouldn't the hash_finish_initial_spills calls in > agg_retrieve_direct > have a comment, similar to the surrounding code? Might be an > overkill, > not sure. Sure, done. > 10) The comment for agg_refill_hash_table says > > * Should only be called after all in memory hash table entries have > been > * consumed. > > Can we enforce that with an assert, somehow? It's a bit awkward. Simplehash doesn't expose the number of groups, and we would also have to check each hash table. Not a bad idea to add an interface to simplehash to make that work, though. > 11) The hash_spill_npartitions naming seems a bit confusing, because > it > seems to imply it's about the "spill" while in practice it just > choses > number of spill partitions. Maybe hash_choose_num_spill_partitions > would > be better? Done. > 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the > reasoning behind the current value (256)? Not wanting to pick too > many > partitions? Comment? > > if (npartitions > HASH_MAX_PARTITIONS) > npartitions = HASH_MAX_PARTITIONS; Added a comment. There's no deep reasoning there -- I just don't want it to choose to create 5000 files and surprise a user. > 13) As for this: > > /* make sure that we don't exhaust the hash bits */ > if (partition_bits + input_bits >= 32) > partition_bits = 32 - input_bits; > > We already ran into this issue (exhausting bits in a hash value) in > hashjoin batching, we should be careful to use the same approach in > both > places (not the same code, just general approach). Didn't investigate this yet, but will do. Regards, Jeff Davis
Attachment
pgsql-hackers by date: