Re: [HACKERS] ICU integration - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] ICU integration |
Date | |
Msg-id | CAEepm=3W4kUoAMYG3XOKPqv-=k1TL5A8zJvF6M93DNaHRoQeyw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] ICU integration (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: [HACKERS] ICU integration
|
List | pgsql-hackers |
On Tue, Jan 10, 2017 at 9:45 AM, Peter Geoghegan <pg@heroku.com> wrote: > * I think it's worth looking into ucol_nextSortKeyPart(), and using > that as an alternative to ucol_getSortKey(). It doesn't seem any > harder, and when I tested it it was clearly faster. (I think that > ucol_nextSortKeyPart() is more or less intended to be used for > abbreviated keys.) +1 I assume (but haven't checked) that ucol_nextSortKeyPart accesses only the start of the string via the UCharIterator passed in, unless you have the rare reverse-accent-sort feature enabled (maybe used only in fr_CA, it looks like it is required to scan the whole string looking for the last accent). So I assume that uiter_setUTF8 and ucol_nextSortKeyPart would usually do a small fixed amount of work, whereas this patch's icu_to_uchar allocates space and converts the whole variable length string every time. On the other hand, I see that this patch's icu_to_uchar can deal with encodings other than UTF8. At first glance, the UCharIterator interface provides a way to make a UCharIterator that iterates over a string of UChar or a string of UTF8, but I'm not sure how to make it do character-by-character transcoding from arbitrary encodings via the C API. It seems like that should be possible using ucnv_getNextUChar as the source of transcoded characters, but I'm not sure how to wire those two things together (in C++ I think you'd subclass icu::CharacterIterator). That's about abbreviation, but I note that you can also compare strings using iterators with ucol_strcollIter, avoiding the need to allocate and transcode up front. I have no idea whether that'd pay off. + A change in collation definitions can lead to corrupt indexes and other + problems where the database system relies on stored objects having a + certain sort order. Generally, this should be avoided, but it can happen + in legitimate circumstances, such as when + using <command>pg_upgrade</command> to upgrade to server binaries linked + with a newer version of ICU. When this happens, all objects depending on + the collation should be rebuilt, for example, + using <command>REINDEX</command>. When that is done, the collation version + can be refreshed using the command <literal>ALTER COLLATION ... REFRESH + VERSION</literal>. This will update the system catalog to record the + current collator version and will make the warning go away. Note that this + does not actually check whether all affected objects have been rebuilt I think this is a pretty reasonable first approach to this problem. It's a simple way to flag up a problem to the DBA, but leaves all the responsibility for figuring out how to fix it to the DBA. I think we should considering going further in later patches (tracking the version used at last rebuild per index etc as discussed, so that the condition is cleared only by rebuilding the affected things). (REPARTITION anyone?) As far as I know there are two things moving: ICU code and CLDR data. Here we can see the CLDR versions being pulled into ICU: http://bugs.icu-project.org/trac/log/icu/trunk/source/data/locales?rev=39273 Clearly when you upgrade your system from (say) Debian 8 to Debian 9 and the ICU major version changes we expect to have to REINDEX, but does anyone know if such data changes are ever pulled into the minor version package upgrades you get from regular apt-get update of (say) a Debian 8 or CentOS 7 or FreeBSD 11 system? In other words, do we expect collversion changes to happen basically any time in response to regular system updates, or only when you're doing a major upgrade of some kind, as the above-quoted documentation patch implies? + collversion = ntohl(*((uint32 *) versioninfo)); UVersionInfo is an array of four uint8_t. That doesn't sound like something that needs to be bit-swizzled... is it really? Even if it is arranged differently on different architectures, I'm not sure why you care since we only ever use it to compare for equality on the same system. But aside from that, I don't love this cast to uint32. I wonder if we should use u_versionToString to respect boundaries a bit better? I have another motivation for wanting to model collation versions as strings: I have been contemplating a version check for system-provided collations too, piggy-backing on your work here. Obviously PostgreSQL can't directly know anything about system collation versions, but I'm thinking of a GUC system_collation_version_command which you could optionally set to a script that knows how to inspect the local operating system. For example, a package maintainer might be interested in writing such a script that knows how to ask the package system for a locale data version. A brute force approach that works on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'. A string would provide more leeway than a measly int32. That's independent of this patch and you might hate the whole idea, but seems to be the kind of thing you anticipated when you described collversion as "[p]rovider-specific version of the collation". -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: