Re: WIP: BRIN multi-range indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: BRIN multi-range indexes |
Date | |
Msg-id | 1fdcbd2a-ff0f-fb81-d227-22279d5cbcb7@enterprisedb.com Whole thread Raw |
In response to | Re: WIP: BRIN multi-range indexes (John Naylor <john.naylor@enterprisedb.com>) |
Responses |
Re: WIP: BRIN multi-range indexes
|
List | pgsql-hackers |
On 11/9/20 3:29 PM, John Naylor wrote: > On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > wrote: > >> Overall, I think there's very little difference, particularly in the >> "match" cases when we're searching for a value that we know is in the >> table. The one-hash variant seems to perform a bit better, but the >> difference is fairly small. >> >> In the "mismatch" cases (searching for value that is not in the table) >> the differences are more significant, but it might be noise. It does >> seem much more "green" than "red", i.e. the one-hash variant seems to be >> faster (although this does depend on the values for formatting). >> >> To sum this up, I think the one-hash approach seems interesting. It's >> not going to give us huge speedups because we're only hashing int32 >> values anyway (not the source data), but it's worth exploring. > > Thanks for testing! It seems you tested against the version with two > moduli, and not the alternative discussed in > > https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development > <https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development> > > which would in fact be rehashing the 32 bit values. I think that would > be the way to go if we don't use the one-hashing approach. > Yeah. I forgot about this detail, and I may try again with the two-hash variant, but I wonder how much difference would it make, considering the results match the expected results (that is, the scan fraction" results for fill_factor=100 match the target fpr almost perfectly). I think there's a possibly-more important omission in the testing - I forgot about the "sort mode" used initially, when the filter keeps the actual hash values and only switches to hashing later. I wonder if that plays role for some of the cases. I'll investigate this a bit in the next round of tests. >> a) set_bloom_partitions does this: >> >> while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0) >> pidx++; >> >> which is broken, because the second part of the condition only checks >> the current index - we may end up using nhashes primes after that, and >> some of them may be 0. So this needs to be: >> >> while (primes[pidx + nhashes - 1] <= target && >> primes[pidx + nhashes] > 0) >> pidx++; > > Good catch. > >> b) set_bloom_partitions does this to generate primes: >> >> /* >> * Increase the limit to ensure we have some primes higher than >> * the target partition length. The 100 value is arbitrary, but >> * should be well over what we need. >> */ >> primes = generate_primes(target_partlen + 100); >> >> It's not clear to me why 100 is sufficient, particularly for large page >> sizes. AFAIK the primes get more and more sparse, so how does this >> guarantee we'll get enough "sufficiently large" primes? > > This value is not rigorous and should be improved, but I started with > that by looking at the table in section 3 in > > https://primes.utm.edu/notes/gaps.html > <https://primes.utm.edu/notes/gaps.html> > > I see two ways to make a stronger guarantee: > > 1. Take the average gap between primes near n, which is log(n), and > multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target > seems a better heuristic than a constant, and is still simple to calculate. > > With the pathological example you gave of n=575104, k=3 (target_partlen > = 191701), the number to add is log(191701) * 10 = 122. By the table > referenced above, the largest prime gap under 360653 is 95, so we're > guaranteed to find at least one prime in the space of 122 above the > target. That will likely be enough to find the closest-to-target filter > size for k=3. Even if it weren't, nbits is so large that the relative > difference is tiny. I'd say a heuristic like this is most likely to be > off precisely when it matters the least. At this size, even if we find > zero primes above our target, the relative filter size is close to > > (575104 - 3 * 95) / 575104 = 0.9995 > > For a more realistic bad-case target partition length, log(1327) * 10 = > 72. There are 33 composites after 1327, the largest such gap below 9551. > That would give five primes larger than the target > 1361 1367 1373 1381 1399 > > which is more than enough for k<=10: > > 1297 + 1301 + 1303 + 1307 + 1319 + 1321 + 1327 + 1361 + 1367 + > 1373 = 13276 > > 2. Use a "segmented range" algorithm for the sieve and iterate until we > get k*2 primes, half below and half above the target. This would be an > absolute guarantee, but also more code, so I'm inclined against that. > Thanks, that makes sense. While investigating the failures, I've tried increasing the values a lot, without observing any measurable increase in runtime. IIRC I've even used (10 * target_partlen) or something like that. That tells me it's not very sensitive part of the code, so I'd suggest to simply use something that we know is large enough to be safe. For example, the largest bloom filter we can have is 32kB, i.e. 262kb, at which point the largest gap is less than 95 (per the gap table). And we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use BLOOM_MAX_NUM_PARTITIONS * 100 on the basis that we may need BLOOM_MAX_NUM_PARTITIONS partitions before/after the target. We could consider the actual target being lower (essentially 1/npartions of the nbits) which decreases the maximum gap, but I don't think that's the extra complexity here. FWIW I wonder if we should do something about bloom filters that we know can get larger than page size. In the example I used, we know that nbits=575104 is larger than page, so as the filter gets more full (and thus more random and less compressible) it won't possibly fit. Maybe we should reject that right away, instead of "delaying it" until later, on the basis that it's easier to fix at CREATE INDEX time (compared to when inserts/updates start failing at a random time). The problem with this is of course that if the index is multi-column, this may not be strict enough (i.e. each filter would fit independently, but the whole index row is too large). But it's probably better to do at least something, and maybe improve that later with some whole-row check. >> c) generate_primes uses uint16 to store the primes, so it can only >> generate primes up to 32768. That's (probably) enough for 8kB pages, but >> for 32kB pages it's clearly insufficient. > > Okay. > >> As for the original question how expensive this naive sieve is, I >> haven't been able to measure any significant timings. The logging aroung >> generate_primes usually looks like this: >> >> 2020-11-07 20:36:10.614 CET [232789] LOG: generating primes nbits >> 575104 nhashes 3 target_partlen 191701 >> 2020-11-07 20:36:10.614 CET [232789] LOG: primes generated >> >> So it takes 0.000 second for this extreme page size. I don't think we >> need to invent anything more elaborate. > > Okay, good to know. If we were concerned about memory, we could have it > check only odd numbers. That's a common feature of sieves, but also > makes the code a bit harder to understand if you haven't seen it before. > IMO if we were concerned about memory we'd use Bitmapset instead of an array of bools. That's 1:8 compression, not just 1:2. > Also to fill in something I left for later, the reference for this > > /* upper bound of number of primes below limit */ > /* WIP: reference for this number */ > int numprimes = 1.26 * limit / log(limit); > > is > > Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for > some functions of prime numbers". Illinois J. Math. 6: 64–94. > doi:10.1215/ijm/1255631807 > > More precisely, it's 30*log(113)/113 rounded up. > Thanks, I was wondering where that came from. -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: