Re: Macro customizable hashtable / bitmapscan & aggregation perf - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Macro customizable hashtable / bitmapscan & aggregation perf |
Date | |
Msg-id | 0b71d15b-89ef-bd4f-d653-0a118b550601@2ndquadrant.com Whole thread Raw |
In response to | Re: Macro customizable hashtable / bitmapscan & aggregation perf (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Macro customizable hashtable / bitmapscan &
aggregation perf
|
List | pgsql-hackers |
On 10/01/2016 09:59 PM, Andres Freund wrote: > Hi, > > On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote: >> On 10/01/2016 02:44 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2016-07-26 17:43:33 -0700, Andres Freund wrote: >>>> In the attached patch I've attached simplehash.h, which can be >>>> customized by a bunch of macros, before being inlined. There's also a >>>> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via >>>> execGrouping.c. >>> >>> Attached is a significantly improved version of this series. The main >>> changes are: >>> >>> - The hash table now uses robin-hood style hash collision handling. See the >>> commit message of 0002 (or simplehash.h) for an introduction. That >>> significantly reduces the impact of "clustering" due to linear >>> addressing. >> >> Interesting. Have you looked at cuckoo hashing? > > Yes. I don't think it's a good fit for modern CPUs. Because it > doesn't probe linearly you constantly have random uncached memory > accesses. > > I've played with a few schemes, and they all seemed to be slower > than linear probing deviations, unless you intentionally used a > wrong hash-distribution, or intentionally "unbalanced" the hash-space > by iterating over either end of the keyspace and moved the entries > around. OK, understood. > >>> - Significant comment and correctness fixes, both in simplehash.h >>> - itself, and 0003/0004. >>> - a lot of other random performance improvements for the hashing code. >>> >>> >>> Unfortunately I'm running out battery right now, so I don't want to >>> re-run the benchmarks posted upthread (loading takes a while). But the >>> last time I ran them all the results after the patches were better than >>> before. >>> >> >> Well, I have rather bad experience with running benchmarks on laptops anyway >> - a lot of noise due to power management, etc. > > Well, with factor ~2x improvements thats not really a big detractor. > Using a new CPU makes some forms of analysis easier (better PMUs). > True. > > For longrunning tests I agree. > > >> What about running a bigger benchmark - say, TPC-DS - and evaluating >> the impact? > > Worthwhile, although obviously the impact will be a lot smaller, > since they're not entirely bottlenecked on hash-aggs and bitmap > scans. > Sure, the improvement won't be as significant as on the simple queries, but it's interesting IMHO. > >> I think a crucial part of the benchmarking will be identifying and >> measuring corner cases, e.g. a lot of rows with the same key, etc. >> Although that probably is not a major issue for the two places >> switched to the new implementation (e.g. execGrouping merges the >> duplicates to a single group, by definition). > > Hm. I don't really see a case where that's going to cause issues - all > the execGrouping.c users only store one key, and either ignore later > ones, or add the data to the initial tuple in the group. I don't really > see any case where repeated keys should cause an issue for a hash table? > Yep, that's pretty much what I suggested. I was wondering more about the other places where this hash table might be used - I've been thinking about hashjoins, but now that I think of it - that's a fairly specialized and tuned code. In any case, let's not complicate the discussion for now. > >>> This patch series currently consists out of four patches: >>> - 0001 - add unlikely/likely() macros. The use here is not entirely >>> mandatory, but they do provide a quite consistent improvement. There >>> are other threads where they proved to be beneficial, so I see little >>> reason not to introduce them. >>> - 0002 - the customizable hashtable itself. It now contains documentation. >>> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix >>> for the issue mentioned in [1], to improve peformance in heavily lossy >>> scenarios (otherwise we could regress in some extreme cases) >>> - 0004 - convert execGrouping.c, e.g. used by hash aggregates >>> >>> >>> While not quite perfect yet, I do think this is at a state where input >>> is needed. I don't want to go a lot deeper into this rabbit hole, >>> before we have some agreement that this is a sensible course of action. >>> >> >> So, is it the right time to do some benchmarking? > > That's one thing that's required, yes. The other is whether we can live > with the uglyness of implementing templates via macros. I do think we > can, but... > Hmmm ... not sure. If one of the points is to get rid of function calls determined at runtime (which make it impossible for the compiler to optimize the code), then I can think of three options: (a) copy-paste the code for each place (b) use some templating (c) use JIT I think (b) is way better than (a), and I don't think we're using JIT anywhere at this point. So while the macro-based templates look a bit awkward, I'm not aware of a better C thing. A few comments, after quickly skimming through the first two patches: 1) SH_CREATE does this: /* round up size to the next power of 2, eases lookups */if (size < 2) size = 2;else size = sh_pow2_int(size); I very much doubt a hash table with 2 buckets is very useful. What about using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, which might be a bit too much, but perhaps 256 would work? 2) tb->resize_above I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as a constant somewhere (not sure if it makes sense to make it part of SH_TYPE, so that hash tables may use different load factors). 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total size in bytes' or 'number of entries'. 4) SH_CONTAINS sounds more like a function checking whether a hash table contains a key, not as a 'type of hash table entry'. What about SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess. 5) SH_RESIZE Do we expect to shrink hash tables? If not, SH_RESIZE should probably check that newsize > oldsize. If yes, it should check that there's enough space for all entries (with the load factor). It's not entirely clear why is it guaranteed that there's always an element with optimal position (when load factor < 1)? Adding an explanation or a link to a paper would be helpful, I guess. 6) SH_STAT This bit is a bit suspicious: uint32 max_collisions = 0; ... /* single contained element is not a collision */ curcoll--; total_collisions += curcoll; if (curcoll > max_collisions) max_collisions = curcoll - 1; Shouldn't the last line be just "max_collisions = curcoll"? 7) Should hash agg size estimation for the planner consider the fillfactor? I think we should account for fillfactor - we should account for the allocated memory, even if there's free space in the hash table. After all, that's what hashjoins do. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: