Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: B-Tree support function number 3 (strxfrm() optimization) |
Date | |
Msg-id | CAM3SWZSxhpL56AEXSzWFVVyxpxT8cBQZVtF7nUgS0tmH-g8Yxw@mail.gmail.com Whole thread Raw |
In response to | Re: B-Tree support function number 3 (strxfrm() optimization) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: B-Tree support function number 3 (strxfrm() optimization)
|
List | pgsql-hackers |
On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: > < Various points on style> Okay, fair enough. > n_distinct is a cardinality estimate, but AFAIK not using hyperloglog. > What is different about that problem vs. this one that makes each > method correct for its respective case? Or should analyze be using > hyperloglog too? HyperLogLog isn't sample-based - it's useful for streaming a set and accurately tracking its cardinality with fixed overhead. I think that the first place that it found use was for things like network switches (aside: it's a pretty recent, trendy algorithm, but FWIW I think that actually makes sense. It's really quite impressive.). Whereas, ../commands/analyze.c uses an "Estimate [of] the number of distinct values using the estimator proposed by Haas and Stokes in IBM Research Report RJ 10025", which is based on sampling (I think that there are reasons to be skeptical of sample-based cardinality estimators in general). Right now, given the way ANALYZE does random sampling, we probably save little to no actual I/O by doing random sampling as opposed to reading everything. I certainly hope that that doesn't stay true forever. Simon has expressed interest in working on block-based sampling. n_distinct is only one ouput that we need to produce as part of ANALYZE, of course. So, the way we estimate n_distinct is currently not very good, but maybe that's inevitable with random sampling. I was surprised by the fact that you didn't seem to consider it that much of a problem in your pgCon talk on the planner, since I saw it a couple of times back when I was a consultant. I haven't seen it that many times, though. The main practical benefit is that HLL isn't going to give you an answer that's wildly wrong, and the main disadvantage is that it expects to observe every element in the set, which in this instance is no disadvantage at all. There are guarantees around the accuracy of estimates, typically very good guarantees. > Is it the right decision to suppress the abbreviated-key optimization > unconditionally on 32-bit systems and on Darwin? There's certainly > more danger, on those platforms, that the optimization could fail to > pay off. But it could also win big, if in fact the first character or > two of the string is enough to distinguish most rows, or if Darwin > improves their implementation in the future. If the other defenses > against pathological cases in the patch are adequate, I would think > it'd be OK to remove the hard-coded checks here and let those cases > use the optimization or not according to its merits in particular > cases. We'd want to look at what the impact of that is, of course, > but if it's bad, maybe those other defenses aren't adequate anyway. I'm not sure. Perhaps the Darwin thing is a bad idea because no one is using Macs to run real database servers. Apple haven't had a server product in years, and typically people only use Postgres on their Macs for development. We might as well have coverage of the new code for the benefit of Postgres hackers that favor Apple machines. Or, to look at it another way, the optimization is so beneficially that it's probably worth the risk, even for more marginal cases. 8 primary weights (the leading 8 bytes, frequently isomorphic to the first 8 Latin characters, regardless of whether or not they have accents/diacritics, or punctuation/whitespace) is twice as many as 4. But every time you add a byte of space to the abbreviated representation that can resolve a comparison, the number of unresolvable-without-tiebreak comparisons (in general) is, I imagine, reduced considerably. Basically, 8 bytes is way better than twice as good as 4 bytes in terms of its effect on the proportion of comparisons that are resolved only with abbreviated keys. Even still, I suspect it's still worth it to apply the optimization with only 4. You've seen plenty of suggestions on assessing the applicability of the optimization from me. Perhaps you have a few of your own. > Does the lengthy comment in btsortsupport_worker() survive pgindent? I'll look into it. > Reading from an uninitialized byte could provide a valgrind warning > even if it's harmless, I think. That wouldn't be harmless - it would probably result in incorrect answers in practice, and would certainly be unspecified. However, I'm not reading uninitialized bytes. I call memset() so that in the event of the final strxfrm() blob being less than 8 bytes (which can happen even on glibc with en_US.UTF-8). It cannot be harmful to memcmp() every Datum byte if the remaining bytes are always initialized to NUL. -- Peter Geoghegan
pgsql-hackers by date: