Thread: tweaking perfect hash multipliers
Hi all, While playing around with Peter E.'s unicode normalization patch [1], I found that HEAD failed to build a perfect hash function for any of the four sets of 4-byte keys ranging from 1k to 17k in number. It probably doesn't help that codepoints have nul bytes and often cluster into consecutive ranges. In addition, I found that a couple of the candidate hash multipliers don't compile to shift-and-add instructions, although they were chosen with that intent in mind. It seems compilers will only do that if the number is exactly 2^n +/- 1. Using the latest gcc and clang, I tested all prime numbers up to 5000 (plus 8191 for good measure), and found a handful that are compiled into non-imul instructions. Dialing back the version, gcc 4.8 and clang 7.0 are the earliest I found that have the same behavior as newer ones. For reference: https://gcc.godbolt.org/z/bxcXHu In addition to shift-and-add, there are also a few using lea, lea-and-add, or 2 leas. Then I used the attached program to measure various combinations of compiled instructions using two constant multipliers iterating over bytes similar to a generated hash function. <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c ./a.out Median of 3 with clang 10: lea, lea 0.181s lea, lea+add 0.248s lea, shift+add 0.251s lea+add, shift+add 0.273s shift+add, shift+add 0.276s 2 leas, 2 leas 0.290s shift+add, imul 0.329s Taking this with a grain of salt, it nonetheless seems plausible that a single lea could be faster than any two instructions here. The only primes that compile to a single lea are 3 and 5, but I've found those multipliers can build hash functions for all our keyword lists, as demonstration. None of the others we didn't have already are particularly interesting from a performance point of view. With the unicode quick check, I found that the larger sets need (257, 8191) as multipliers to build the hash table, and none of the smaller special primes I tested will work. Keeping these two properties in mind, I came up with the scheme in the attached patch that tries adjacent pairs in this array: (3, 5, 17, 31, 127, 257, 8191) so that we try (3,5) first, next (5,17), and then all the pure shift-and-adds with (257,8191) last. The main motivation is to be able to build the unicode quick check tables, but if we ever use this functionality in a hot code path, we may as well try to shave a few more cycles while we're at it. [1] https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On 2020-03-30 21:33:14 +0800, John Naylor wrote: > Then I used the attached program to measure various combinations of > compiled instructions using two constant multipliers iterating over > bytes similar to a generated hash function. It looks like you didn't attach the program? > <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c > ./a.out > Median of 3 with clang 10: > > lea, lea 0.181s > > lea, lea+add 0.248s > lea, shift+add 0.251s > > lea+add, shift+add 0.273s > shift+add, shift+add 0.276s > > 2 leas, 2 leas 0.290s > shift+add, imul 0.329s > > Taking this with a grain of salt, it nonetheless seems plausible that > a single lea could be faster than any two instructions here. It's a bit complicated by the fact that there's more execution ports to execute shift/add than there ports to compute some form of leas. And some of that won't easily be measurable in a micro-benchmark, because there'll be dependencies between the instruction preventing any instruction level parallelism. I think the form of lea generated here is among the ones that can only be executed on port 1. Whereas e.g. an register+register/immediate add can be executed on four different ports. There's also a significant difference in latency that you might not see in your benchmark. E.g. on coffee lake the relevant form of lea has a latency of three cycles, but one independent lea can be "started" per cycle (agner calls this "reciprocal throughput). Whereas a shift has a latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better), add has a latency o 1 and a reciprocal throughput of 0.25. See the tables in https://www.agner.org/optimize/instruction_tables.pdf I'm not really sure my musings above matter terribly much, but I just wanted to point out why I'd not take too much stock in the above timings in isolation. Even a very high latency wouldn't necessarily be penalized in a benchmark with one loop iteration independent from each other, but would matter in the real world. Cool work! Greetings, Andres Freund
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-03-30 21:33:14 +0800, John Naylor wrote: > > Then I used the attached program to measure various combinations of > > compiled instructions using two constant multipliers iterating over > > bytes similar to a generated hash function. > > It looks like you didn't attach the program? Funny, I did, but then decided to rename the files. Here they are. I tried to make the loop similar to how it'd be in the actual hash function, but leaving out the post-loop modulus and array access. Each loop iteration is dependent on the last one's result. > It's a bit complicated by the fact that there's more execution ports to > execute shift/add than there ports to compute some form of leas. And > some of that won't easily be measurable in a micro-benchmark, because > there'll be dependencies between the instruction preventing any > instruction level parallelism. > > I think the form of lea generated here is among the ones that can only > be executed on port 1. Whereas e.g. an register+register/immediate add > can be executed on four different ports. That's interesting, I'll have to look into that. Thanks for the info! -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote: > I think the form of lea generated here is among the ones that can only > be executed on port 1. Whereas e.g. an register+register/immediate add > can be executed on four different ports. I looked into slow vs. fast leas, and I think the above are actually fast because they have 2 operands. leal (%rdi,%rdi,2), %eax A 3-op lea would look like this: leal 42(%rdi,%rdi,8), %ecx In other words, the scale doesn't count as an operand. Although I've seen in a couple places say that a non-1 scale adds a cycle of latency for some AMD chips. Some interesting discussion in these LLVM commits and discussion from 2017 about avoiding slow leas: https://reviews.llvm.org/D32277 https://reviews.llvm.org/D32352 -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 31, 2020 at 4:05 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote: > > I think the form of lea generated here is among the ones that can only > > be executed on port 1. Whereas e.g. an register+register/immediate add > > can be executed on four different ports. > > I looked into slow vs. fast leas, and I think the above are actually > fast because they have 2 operands. No, scratch that, it seems the two forms of lea are: leal (,%rdx,8), %ecx leal (%rdx,%rdx,8), %ecx The first operand in both is the implicit zero, so with 3 and 5 we do get the slow lea on some architectures. So I've only kept the shift-and-add multipliers in v2. I also changed the order of iteration of the parameters, for speed. Before, it took over 30 seconds to build the unicode quick check tables, now it takes under 2 seconds. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services