Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Abbreviated keys for Numeric |
Date | |
Msg-id | CAM3SWZS2gfxcct7oyEWcNsjXzEeCkKH6rATtUPv6fb_9tFmyRQ@mail.gmail.com Whole thread Raw |
In response to | Re: Abbreviated keys for Numeric (Andrew Gierth <andrew@tao11.riddles.org.uk>) |
Responses |
Re: Abbreviated keys for Numeric
Re: Abbreviated keys for Numeric |
List | pgsql-hackers |
On Sun, Mar 22, 2015 at 1:01 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > The substance of the code is unchanged from my original patch. I didn't > add diagnostic output to numeric_abbrev_abort, see my separate post > about the suggestion of a GUC for that. I don't think that V2 really changed the substance, which seems to be the implication of your remarks here. You disagreed with my decision on NULL values - causing me to reconsider my position (so that's now irrelevant) - and you disagreed with not including support for 32-bit platforms. Those were the only non-stylistic changes, though. I certainly didn't change any details of the algorithm that you proposed, which, FWIW, I think is rather clever. I added a few defensive assertions to the encoding/conversion routine (which I see you've removed in V3, a long with a couple of other helpful assertions), and restructured and expanded upon the comments, but that's all. You haven't really taken into my account my V2 feedback with this V3 revision. Even after you yourself specifically called out your non-explanation of excess-44 as a possible point of confusion for readers of your patch, you didn't change or expand upon your remarks on that one iota. I see that code like this (from your V1) appears in your V3, as if V2 never happened: +/* + * non-fmgr interface to the comparison routine to allow sortsupport to elide + * the fmgr call. The saving here is small given how slow numeric + * comparisons are, but there's no reason not to implement it. + */ This comment is totally redundant at best, and misleading at worst. Of course we have a non-fmgr interface - it's needed to make abbreviation work. And of course *any* opclass that performs abbreviation won't get any great saving from cases where only the fmgr interface is elided (e.g. numeric is not the leading sorttuple attribute). Text is unusual in that there is a small additional saving over fmgr-elision for obscure reasons. Ditto for comments like this, which re-appear in V3: +/* + * Ordinary (non-sortsupport) comparisons follow. + */ Your V3 has obsolete comments here: + nss = palloc(sizeof(NumericSortSupport)); + + /* + * palloc a buffer for handling unaligned packed values in addition to + * the support struct + */ + nss->buf = palloc(VARATT_SHORT_MAX + VARHDRSZ + 1); I still don't think you should be referencing the text opclass behavior here: + * number. We make no attempt to estimate the cardinality of the real values, + * since it plays no part in the cost model here (if the abbreviation is equal, + * the cost of comparing equal and unequal underlying values is comparable). + * We discontinue even checking for abort (saving us the hashing overhead) if + * the estimated cardinality gets to 100k; that would be enough to support many + * billions of rows while doing no worse than breaking even. This is dubious: +#if DEC_DIGITS != 4 +#error "Numeric bases other than 10000 are no longer supported" +#endif Because there is a bunch of code within numeric.c that deals with the DEC_DIGITS != 4 case. For example, this code has been within numeric.c forever: #if DEC_DIGITS == 4 || DEC_DIGITS == 2 static NumericDigit const_ten_data[1] = {10}; static NumericVar const_ten = {1, 0, NUMERIC_POS, 0, NULL, const_ten_data}; #elif DEC_DIGITS == 1 static NumericDigit const_ten_data[1] = {1}; static NumericVar const_ten = {1, 1, NUMERIC_POS, 0, NULL, const_ten_data}; #endif As has this: while (ndigits-- > 0) { #if DEC_DIGITS == 4 *digits++ = ((decdigits[i] * 10 + decdigits[i + 1]) * 10 + decdigits[i + 2]) * 10 + decdigits[i + 3]; #elif DEC_DIGITS == 2 *digits++ = decdigits[i] * 10 + decdigits[i + 1]; #elif DEC_DIGITS == 1 *digits++ = decdigits[i]; #else #error unsupported NBASE #endif i += DEC_DIGITS; } I tend to think that when Tom wrote this code back in 2003, he thought it might be useful to change DEC_DIGITS on certain builds. And so, we ought to continue to support it to the extent that we already do, allowing these cases to opt out of abbreviation in an informed manner (since it seems hard to make abbreviation work with DEC_DIGITS != 4 builds). In any case, you should have deleted all this code (which there is rather a lot of) in proposing to not support DEC_DIGITS != 4 builds generally. Attached revision, V4, incorporates some of your V3 changes. As I mentioned, I changed my mind on the counting of non-NULL values, which V4 reflects...but there are also comments that now make it clear why that might be useful. I've retained your new allocate once approach to buffer sizing, which seems sound if a little obscure. This abort function code (from your V1 + V3) seems misleading: + if (memtupcount < 10000 || nss->input_count < 10000 || !nss->estimating) + return false; It's impossible for "memtupcount < 10000" while "nss->input_count < 10000" as well, since the latter counts a subset of what the former counts. So I only check the latter now. I have not added back 32-bit support, which, IMV isn't worth it at this time - it is, after all, almost a fully independent abbreviation scheme to the 64-bit scheme. Right now we ought to be cutting scope, or 9.5 will never reach feature freeze. I have already spent considerably more time on this patch than I had intended to. I need to catch a flight to New York at the crack of dawn tomorrow, to get to pgConf US, and I have much work to do on my UPSERT patch, so I've simply run out of time. If the committer that picks this up wants to look at 32-bit support, that may make sense. I think that given the current lack of cycles from everyone, we'll be doing well to even get 64-bit numeric abbreviation support in 9.5. I ask that you have a some perspective on cutting 32-bit support, Andrew. In any case, I've personally run out of time for this for 9.5. We may add a patch to change things so that a GUC controls abbreviation debug output, and we may add a patch that standardizes that INT64_MIN and INT64_MAX are available everywhere. But unless and until those other patches are committed, I see no reason to assume that they will be, and so those aspects remain unchanged from my V2. Unless there is strong opposition, or bugs come to light, or the GUC/macros are added by independent commits to the master branch, I expect to mark this revision "Ready for Committer" shortly. Thanks -- Peter Geoghegan
Attachment
pgsql-hackers by date: