Re: 9.1 support for hashing arrays - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: 9.1 support for hashing arrays |
Date | |
Msg-id | BANLkTinGShCT-+bBDDuGPP=N3PQKYZ-6eQ@mail.gmail.com Whole thread Raw |
In response to | Re: 9.1 support for hashing arrays (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: 9.1 support for hashing arrays
|
List | pgsql-hackers |
On 19 May 2011 15:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> The algorithm for this was discussed in the original thread >> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php) >> but I don't that think a satisfactory conclusion was really reached. >> In particular, it is way too easy to come up with pathological cases >> that defeat the hashing algorithm, for example: >> >> CREATE TABLE foo(a int[][]); >> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8]) >> FROM generate_series(1,10000) g(i); >> >> All 10000 arrays are different, but they all have the same hash value >> (0), so if the query optimiser chooses to hash the arrays, the >> performance will be very poor. >> >> A few people on that thread (myself included - >> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php) >> suggested using the multiply-by-31 algorithm but I think I failed to >> properly make the case for it. Having given it some further thought, I >> think there are some very sound mathematical reasons why that >> algorithm performs well: >> >> The algorithm is to take the current hash total, multiply it by 31 and >> then add on the hash of the next element. The final result is a >> polynomial sum, where each element's hash value is multiplied by a >> different power of 31. >> >> Since this is all modulo 2^32 arithmetic, the powers of 31 will >> eventually start repeating, and at that point the hashing algorithm >> could be defeated by transpositions. However, the number 31 has the >> property that its powers don't repeat for a long time - the powers of >> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27 >> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are >> no smaller (strictly positive) powers of 31 for which this is the >> case. >> >> So the multiply-by-31 algorithm is only vulnerable to transpositions >> once the arrays reach 134217728 elements. >> >> For all smaller arrays, each array element's hash value is multiplied >> by a number different number from all the other elements, and since >> all the multipliers are odd numbers, *all* the individual bits from >> each element's hash value are distributed (differently) in the final >> value. >> >> Of course there are still going to be pathological cases, but they are >> very difficult to construct deliberately, and extremely unlikely to >> occur randomly. ISTM that this has all the properties of a good >> hashing algorithm (possibly the Java folks did a similar analysis and >> came to the same conclusion). > > Yes, I never was very happy with the way that we were doing this, and > I think you make a coherent argument for why we should do it > differently. > Doh! I forgot one important piece of this algorithm - it is necessary to initialise the result to something non-zero at the start so that adding leading nulls to an array changes the final result. Regards, Dean
Attachment
pgsql-hackers by date: