Re: [PATCHES] updated hash functions for postgresql v1 - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: [PATCHES] updated hash functions for postgresql v1 |
Date | |
Msg-id | Pine.LNX.4.64.0811042330110.15810@sn.sai.msu.ru Whole thread Raw |
In response to | Re: [PATCHES] updated hash functions for postgresql v1 (Kenneth Marshall <ktm@rice.edu>) |
Responses |
Re: [PATCHES] updated hash functions for postgresql v1
|
List | pgsql-hackers |
Just interested if you repeat your tests not with cracklib-dict, but using 8-bit words. From our experience we found many hash functions are optimized for 7-bit words and produce too many collisions for 8-bit words. That's why we use crc32. Oleg On Tue, 4 Nov 2008, Kenneth Marshall wrote: > Sorry about the delay for this update to the new hash > index implementation. I was trying to get the WAL logging > in place and forgot to post the actual patch. The WAL > for hash indexes will need to wait for 8.5, but I did > want to add back in the piece of the Bob Jenkins 2006 > hash function that was stripped out of the initial > patch on application due to concerns about the randomness > of the resulting hash values. Here is a re-post of my > initial findings comparing the old/new Jenkins hash > from lookup2 and lookup3. I have added a third column > containing the results for the hash_any() resulting > from the attached patch as well as simple timing test > for a DB reindex both before and after patching. > > Also attached is a simple documentation patch updating > the note attached to the hash index description. > > Regards, > Ken > ---------------------------------------------------- > Hi, > > I have finally had a chance to do some investigation on > the performance of the old hash mix() function versus > the updated mix()/final() in the new hash function. Here > is a table of my current results for both the old and the > new hash function. In this case cracklib refers to the > cracklib-dict containing 1648379 unique words massaged > in various ways to generate input strings for the hash > functions. The result is the number of collisions in the > hash values generated. > > hash input old new newv2 > ---------- --- --- ----- > cracklib 338 316 338 > cracklib x 2 (i.e. clibclib) 305 319 300 > cracklib x 3 (clibclibclib) 323 329 315 > cracklib x 10 302 310 329 > cracklib x 100 350 335 298 > cracklib x 1000 314 309 315 > cracklib x 100 truncated to char(100) 311 327 320 > > uint32 from 1-1648379 309 319 347 > (uint32 1-1948379)*256 309 314 304 > (uint32 1-1948379)*16 310 314 324 > "a"uint32 (i.e. a00001,a0002...) 320 321 312 > > uint32uint32 (i.e. uint64) 321 287 309 > > The different result columns are old = Jenkins 1996 hash > function(lookup2.c), new = Jenkins 2006 hash function > (lookup3.c), and newv2 = adaptation of current hash_any() > to incorporate the separate mix()/final() functions. As > you can see from the results, spliting the mix() and final() > apart does not result in any perceptible loss of randomness > in the hash assignment. I also ran a crude timing for a > reindex of the following database: > > CREATE TABLE dict (word text); > CREATE INDEX wordhash ON dict USING hash (word); > INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); > INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); > ... (21 times) > > REINDEX TABLE > ... > > The average time to reindex the table using our current > hash_any() without the separate mix()/final() was 1696ms > and 1482ms with the separate mix()/final() stages giving > almost 13% better performance for this stupid metric. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: