Thread: tweaking perfect hash multipliers

tweaking perfect hash multipliers

From

John Naylor

Date:

30 March 2020, 13:33:14

Hi all,

While playing around with Peter E.'s unicode normalization patch [1],
I found that HEAD failed to build a perfect hash function for any of
the four sets of 4-byte keys ranging from 1k to 17k in number. It
probably doesn't help that codepoints have nul bytes and often cluster
into consecutive ranges. In addition, I found that a couple of the
candidate hash multipliers don't compile to shift-and-add
instructions, although they were chosen with that intent in mind. It
seems compilers will only do that if the number is exactly 2^n +/- 1.

Using the latest gcc and clang, I tested all prime numbers up to 5000
(plus 8191 for good measure), and found a handful that are compiled
into non-imul instructions. Dialing back the version, gcc 4.8 and
clang 7.0 are the earliest I found that have the same behavior as
newer ones. For reference:

https://gcc.godbolt.org/z/bxcXHu

In addition to shift-and-add, there are also a few using lea,
lea-and-add, or 2 leas.

Then I used the attached program to measure various combinations of
compiled instructions using two constant multipliers iterating over
bytes similar to a generated hash function.

<cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
./a.out
Median of 3 with clang 10:

lea, lea 0.181s

lea, lea+add 0.248s
lea, shift+add 0.251s

lea+add, shift+add 0.273s
shift+add, shift+add 0.276s

2 leas, 2 leas 0.290s
shift+add, imul 0.329s

Taking this with a grain of salt, it nonetheless seems plausible that
a single lea could be faster than any two instructions here. The only
primes that compile to a single lea are 3 and 5, but I've found those
multipliers can build hash functions for all our keyword lists, as
demonstration. None of the others we didn't have already are
particularly interesting from a performance point of view.

With the unicode quick check, I found that the larger sets need (257,
8191) as multipliers to build the hash table, and none of the smaller
special primes I tested will work.

Keeping these two properties in mind, I came up with the scheme in the
attached patch that tries adjacent pairs in this array:

(3, 5, 17, 31, 127, 257, 8191)

so that we try (3,5) first, next (5,17), and then all the pure
shift-and-adds with (257,8191) last.

The main motivation is to be able to build the unicode quick check
tables, but if we ever use this functionality in a hot code path, we
may as well try to shave a few more cycles while we're at it.

[1] https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

tweak-perfect-hash-gen.patch

Re: tweaking perfect hash multipliers

From

Andres Freund

Date:

30 March 2020, 18:31:46

Hi,

On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> Then I used the attached program to measure various combinations of
> compiled instructions using two constant multipliers iterating over
> bytes similar to a generated hash function.

It looks like you didn't attach the program?

> <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
> ./a.out
> Median of 3 with clang 10:
> 
>             lea, lea 0.181s
> 
>         lea, lea+add 0.248s
>       lea, shift+add 0.251s
> 
>   lea+add, shift+add 0.273s
> shift+add, shift+add 0.276s
> 
>       2 leas, 2 leas 0.290s
>      shift+add, imul 0.329s
> 
> Taking this with a grain of salt, it nonetheless seems plausible that
> a single lea could be faster than any two instructions here.

It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.

I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.

There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.

See the tables in  https://www.agner.org/optimize/instruction_tables.pdf

I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.

Cool work!

Greetings,

Andres Freund

Re: tweaking perfect hash multipliers

From

John Naylor

Date:

30 March 2020, 19:10:59

On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> > Then I used the attached program to measure various combinations of
> > compiled instructions using two constant multipliers iterating over
> > bytes similar to a generated hash function.
>
> It looks like you didn't attach the program?

Funny, I did, but then decided to rename the files. Here they are. I
tried to make the loop similar to how it'd be in the actual hash
function, but leaving out the post-loop modulus and array access. Each
loop iteration is dependent on the last one's result.

> It's a bit complicated by the fact that there's more execution ports to
> execute shift/add than there ports to compute some form of leas. And
> some of that won't easily be measurable in a micro-benchmark, because
> there'll be dependencies between the instruction preventing any
> instruction level parallelism.
>
> I think the form of lea generated here is among the ones that can only
> be executed on port 1. Whereas e.g. an register+register/immediate add
> can be executed on four different ports.

That's interesting, I'll have to look into that.

Thanks for the info!

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: tweaking perfect hash multipliers

From

John Naylor

Date:

31 March 2020, 08:05:55

On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
> I think the form of lea generated here is among the ones that can only
> be executed on port 1. Whereas e.g. an register+register/immediate add
> can be executed on four different ports.

I looked into slow vs. fast leas, and I think the above are actually
fast because they have 2 operands.

leal (%rdi,%rdi,2), %eax

A 3-op lea would look like this:

leal 42(%rdi,%rdi,8), %ecx

In other words, the scale doesn't count as an operand. Although I've
seen in a couple places say that a non-1 scale adds a cycle of latency
for some AMD chips.

Some interesting discussion in these LLVM commits and discussion from
2017 about avoiding slow leas:

https://reviews.llvm.org/D32277
https://reviews.llvm.org/D32352

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tweaking perfect hash multipliers

From

John Naylor

Date:

01 April 2020, 18:54:43

On Tue, Mar 31, 2020 at 4:05 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
> > I think the form of lea generated here is among the ones that can only
> > be executed on port 1. Whereas e.g. an register+register/immediate add
> > can be executed on four different ports.
>
> I looked into slow vs. fast leas, and I think the above are actually
> fast because they have 2 operands.

No, scratch that, it seems the two forms of lea are:

leal (,%rdx,8), %ecx
leal (%rdx,%rdx,8), %ecx

The first operand in both is the implicit zero, so with 3 and 5 we do
get the slow lea on some architectures. So I've only kept the
shift-and-add multipliers in v2. I also changed the order of iteration
of the parameters, for speed. Before, it took over 30 seconds to build
the unicode quick check tables, now it takes under 2 seconds.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v2-tweak-perfect-hash-gen.patch