Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop |
Date | |
Msg-id | 29a04fd9-abe6-7c5f-dadc-a80459890f53@2ndquadrant.com Whole thread Raw |
In response to | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
|
List | pgsql-bugs |
On 12/07/2017 12:45 AM, Tomas Vondra wrote: > > > On 12/06/2017 11:55 PM, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> On 2017-12-06 16:38:18 -0500, Todd A. Cook wrote: >>>> I found this problem when I dropped 10.1 into a test environment to see >>>> what would happen. There was no deliberate attempt to break anything. >> >>> Read Thomas' message at: http://archives.postgresql.org/message-id/263b03b1-3e1c-49ca-165a-8ac6751419c4%402ndquadrant.com >> >> I'm confused by Tomas' claim that >> >>>> (essentially hashint8 only ever produces 60% of >>>> values from [0,1000000], which likely increases collision rate). >> >> This is directly contradicted by the simple experiments I've done, eg >> >> regression=# select count(distinct hashint8(v)) from generate_series(0,1000000::int8) v; >> count >> -------- >> 999879 >> (1 row) >> >> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,1000000::int8) v; >> count >> -------- >> 644157 >> (1 row) >> >> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,10000000::int8) v; >> count >> --------- >> 1048514 >> (1 row) >> >> It's certainly not perfect, but I'm not observing any major failure to >> span the output space. >> > > That is not the claim I was making, though. When generating data sets > I've been considering only values where hashint8(v) is between 0 and > 1000000 *without* the masking. Otherwise growing the hash table would > break the "chain" of values in the hash table, which would resolve the > issue with SH_GROW_MAX_MOVE. > > So you'd have to do something like this: > > select count(distinct hashint8(v)) > from generate_series(0,1000000::int8) v > where hashint8(v) between 0 and 1024*1024; > > but to get some numbers you'd have to also increase the value passed to > generate_series (because it'll filter most of the values). > > Which is why I generated the data by an ugly C program, similar to the > attached one. It keeps a small bitmap for generated hashes in the [0,1M] > range, and prints the number of bits set after every 1e9 values processed. > > What I do see is this: > > i=1000000000 nset=217666 > i=2000000000 nset=389526 > i=3000000000 nset=525135 > i=4000000000 nset=632305 > i=5000000000 nset=659574 > i=6000000000 nset=659574 > i=7000000000 nset=659574 > i=8000000000 nset=659574 > i=9000000000 nset=659574 > ... > > So somewhere between 4e9 and 5e9 we generate 659574 hashes in the range, > and then never increase it further. > > I don't know if this an issue in practice, but it seems that for large > hash tables (on bigint), growing the hash table may not be that great > improvement because we're not really doubling the number of slots we can > hit directly. We'll probably find an empty slot nearby, though. > > It was just an observation - I only noticed that because I tried to > construct the adversarial data set on bigint, and couldn't make it work > no matter what. > I've done a simple experiment today - computing a hash for every uint32 value, and checking how many distinct hash values that produces. For the 4.2 billion values the results look like this: second key ndistinct ndistinct/nkeys 3.380 i=100000000 nset=98829159 0.99 6.240 i=200000000 nset=195351181 0.98 9.106 i=300000000 nset=289623103 0.97 11.932 i=400000000 nset=381695003 0.95 14.814 i=500000000 nset=471621355 0.94 17.706 i=600000000 nset=559452278 0.93 20.577 i=700000000 nset=645242762 0.92 23.496 i=800000000 nset=729036217 0.91 26.460 i=900000000 nset=810879821 0.90 29.380 i=1000000000 nset=890818778 0.89 32.331 i=1100000000 nset=968898478 0.88 35.282 i=1200000000 nset=1045164189 0.87 38.262 i=1300000000 nset=1119654810 0.86 41.251 i=1400000000 nset=1192424766 0.85 44.258 i=1500000000 nset=1263482593 0.84 47.268 i=1600000000 nset=1332897449 0.83 50.305 i=1700000000 nset=1400717923 0.82 53.356 i=1800000000 nset=1466962823 0.81 56.425 i=1900000000 nset=1531660191 0.81 59.489 i=2000000000 nset=1594856205 0.80 62.593 i=2100000000 nset=1656588855 0.79 65.706 i=2200000000 nset=1716895530 0.78 68.829 i=2300000000 nset=1775809288 0.77 71.966 i=2400000000 nset=1833353377 0.76 75.123 i=2500000000 nset=1889558599 0.76 78.271 i=2600000000 nset=1944463039 0.75 81.418 i=2700000000 nset=1998096496 0.74 84.574 i=2800000000 nset=2050490745 0.73 87.711 i=2900000000 nset=2101666393 0.72 90.852 i=3000000000 nset=2151669155 0.72 93.986 i=3100000000 nset=2200517580 0.71 97.084 i=3200000000 nset=2248232772 0.70 100.172 i=3300000000 nset=2294849331 0.70 103.232 i=3400000000 nset=2340389131 0.69 106.273 i=3500000000 nset=2384875319 0.68 109.272 i=3600000000 nset=2428339639 0.67 112.260 i=3700000000 nset=2470798655 0.67 115.199 i=3800000000 nset=2512271730 0.66 118.125 i=3900000000 nset=2552788321 0.65 121.029 i=4000000000 nset=2592379529 0.65 123.895 i=4100000000 nset=2631056297 0.64 126.726 i=4200000000 nset=2668843329 0.64 129.397 loops 4294967295 set 2703915179 (0.63) That means we only ever generate about 64% of the possible hash keys. That doesn't seem particularly healthy I guess, but for small hash tables (with fewer than 2^32 buckets) that may not be an issue, as some of the values will wrap because of the modulo. For comparison, murmurhash3 and xxhash (both fairly popular hash functions) end with 1:1 ratio of values and hashes, that is not having any collisions at all. Of course, both are somewhat slower than the simple hash functions we use for int/bigint values. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: