Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: B-Tree support function number 3 (strxfrm() optimization) |
Date | |
Msg-id | CAM3SWZQ=Trmg-_5rmZ73qWVQnOH-CrHVV1iX7X5Qir--krMqsA@mail.gmail.com Whole thread Raw |
In response to | Re: B-Tree support function number 3 (strxfrm() optimization) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: B-Tree support function number 3 (strxfrm() optimization)
Re: B-Tree support function number 3 (strxfrm() optimization) Re: B-Tree support function number 3 (strxfrm() optimization) Re: B-Tree support function number 3 (strxfrm() optimization) |
List | pgsql-hackers |
On Thu, Jun 12, 2014 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote: > Thanks for looking into this. Is anyone going to look at this? I attach a new revision. The only real change to the code is that I fixed an open item concerning what to do on WIN32 with UTF-8, where the UTF-16 hacks that we do there cannot be worked around to make the optimization still work, and yet we're still obligated to set up a sort support state since there is a cataloged sort support function (unfortunately I wasn't able to test it on Windows since I no longer own a Windows machine, but it should be fine). I set up a comparison shim within varlena.c for that case only. This is a kludge, but I decided that it was better than preparing the sort support infrastructure to not get a sane sort support state from all opclasses just because of some platform-specific issue. I think that's justified by the fact that it's very unlikely to happen again. text is a datatype that is unusually tied to the operating system. This does necessitate having the sort support struct record which sort operator was originally used to lookup the sort support function, but it seems reasonable to store that in all instances. What may be of more interest to reviewers is the revised AC_TRY_RUN test program that "configure" consults. While the code is unchanged, there is now a detailed rationale for why it's reasonable to expect that a significant amount of entropy will be concentrated in the first 8 bytes of a strxfrm() blob, with reference to how those algorithms are more or less required to behave by the Unicode consortium. The basic reason is that the algorithm for building binary sort keys (described by a Unicode standard [1] defining the behavior of Unicode collation algorithms, and implemented by strxfrm()) specifically works by storing primary level, secondary level and tertiary level weights in turn in the returned blob. There may be additional levels, too. As the standard says: """ The primary level (L1) is for the basic sorting of the text, and the non-primary levels (L2..Ln) are for adjusting string weights for other linguistic elements in the writing system that are important to users in ordering, but less important than the order of the basic sorting. """ Robert pointed out a case where varying character case of an English word did not alter the primary level bytes (and thus the poor man's normalized key was the same). He acknowledged that most of the entropy of the first 8 bytes of the string went into the first 8 bytes of the blob/key. This can actually be very useful to the optimization in some cases. In particular, with most Latin alphabets you'll see the same pattern when diacritics are used. This means that even though the original string has (say) an accented character that would take 2 bytes to store in UTF-8, the weight in the primary level is the same as an equivalent unaccented character (and so only takes one byte to store at that level, with differences only in subsequent levels). Whole strxfrm() blobs are the same length regardless of how many accents appear in otherwise equivalent original Latin string, and you get exactly the same high concentrations of entropy in the first 8 bytes in pretty much all Latin alphabets (the *absence* of diacritics is stored in later weight levels too, even with the "en_US.UTF-8" collation). As the standard says: """ By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to: alphabetic ordering diacritic ordering case ordering. A final level may be used for tie-breaking between strings not otherwise distinguished. """ In practice a huge majority of comparisons are expected to be resolved at the primary weight level (in the case of a straight-up sort of complete, all-unique strxfrm() blobs), which at least in the case of glibc appear to require (at most) as many bytes to store as an original UTF-8 string did. Performance of a sort using a strcoll()-based collation could be *faster* than the "C" location with this patch if there were plenty of Latin characters with diacritics. The diacritics would effectively be removed from the poor man's normalized key representations, and would only be considered when a tie-breaker is required. [1] http://www.unicode.org/reports/tr10/ -- Peter Geoghegan
Attachment
pgsql-hackers by date: