Re: [HACKERS] A design for amcheck heapam verification - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] A design for amcheck heapam verification |
Date | |
Msg-id | CAH2-Wz=d-n0yJksB_EyjB76vpSwWJf+qEhGUk-CME9rGi=0mXg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] A design for amcheck heapam verification (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: [HACKERS] A design for amcheck heapam verification
|
List | pgsql-hackers |
On Mon, Dec 11, 2017 at 9:35 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > + /* > + * Generate random seed, or use caller's. Seed should always be a positive > + * value less than or equal to PG_INT32_MAX, to ensure that any random seed > + * can be recreated through callerseed if the need arises. (Don't assume > + * that RAND_MAX cannot exceed PG_INT32_MAX.) > + */ > + seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed; > This could use pg_backend_random() instead. I don't see the need for a cryptographically secure source of randomness for any current Bloom filter caller, including the test harness. There are many uses of random() just like this throughout the backend, and far more random() calls than pg_backend_random() calls. srandom() seeds random per backend, ensuring non-deterministic behavior across backends. > +-- > +-- These tests don't produce any interesting output, unless they fail. For an > +-- explanation of the arguments, and the values used here, see README. > +-- > +SELECT test_bloomfilter(power => 23, > + nelements => 838861, > + seed => -1, > + tests => 1); > This could also test the reproducibility of the tests with a fixed > seed number and at least two rounds, a low number of elements could be > more appropriate to limit the run time. The runtime is already dominated by pg_regress overhead. As it says in the README, using a fixed seed in the test harness is pointless, because it won't behave in a fixed way across platforms. As long as we cannot ensure deterministic behavior, we may as well fully embrace non-determinism. > I would vote for simplifying the initialization routine and just > directly let the caller decide it. Are there implementation > complications if this is not a power of two? I cannot guess one by > looking at the code. Yes, there are. It's easier to reason about the implementation with these restrictions. > So I would expect people using this API > are smart enough to do proper sizing. Note that some callers may > accept even a larger false positive rate. In short, this basically > brings us back to Thomas' point upthread with a size estimation > routine, and an extra routine doing the initialization: > https://www.postgresql.org/message-id/CAEepm=0Dy53X1hG5DmYzmpv_KN99CrXzQBTo8gmiosXNyrx7+Q@mail.gmail.com > Did you consider it? Splitting the size estimation and the actual > initialization has a lot of benefits in my opinion. Callers don't actually need to worry about these details. I think it's the wrong call to complicate the interface to simplify the implementation. Thomas was not arguing for the caller being able to specify a false positive rate to work backwards from -- that's not an unreasonable thing, but it seems unlikely to be of use to amcheck, or parallel hash join. Thomas was arguing for making the Bloom filter suitable for use with DSM. I ended up incorporating most of his feedback. The only things I were not added were a DSM-orientated routine for getting the amount of memory required, as well as a separate DSM-orientated routine for initializing caller's shared memory buffer. That's likely inconvenient for most callers, and could be restrictive if and when Bloom filters use resources other than memory, such as temp files. > +/* > + * What proportion of bits are currently set? > + * > + * Returns proportion, expressed as a multiplier of filter size. > + * > [...] > + */ > +double > +bloom_prop_bits_set(bloom_filter *filter) > Instead of that, having a function that prints direct information > about the filter's state would be enough with the real number of > elements and the number of bits set in the filter. I am not sure that > using rates is a good idea, sometimes rounding can cause confusion. The bloom filter doesn't track the number of elements itself. Callers that care can do it themselves. bloom_prop_bits_set() isn't very important, even compared to other types of instrumentation. It's moderately useful as a generic indicator of whether or not the Bloom filter was totally overwhelmed. That's about it. -- Peter Geoghegan
pgsql-hackers by date: