ICU integration - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | ICU integration |
Date | |
Msg-id | 85364fde-091f-bbc0-fec2-e3ede39840a6@2ndquadrant.com Whole thread Raw |
Responses |
Re: ICU integration
Re: ICU integration Re: ICU integration Re: ICU integration Re: ICU integration Re: [HACKERS] ICU integration |
List | pgsql-hackers |
Here is a patch I've been working on to allow the use of ICU for sorting and other locale things. This is mostly complementary to the existing FreeBSD ICU patch, most recently discussed in [0]. While that patch removes the POSIX locale use and replaces it with ICU, my interest was on allowing the use of both. I think that is necessary for upgrading, compatibility, and maybe because someone likes it. What I have done is extend collation objects with a collprovider column that tells whether the collation is using POSIX (appropriate name?) or ICU facilities. The pg_locale_t type is changed to a struct that contains the provider-specific locale handles. Users of locale information are changed to look into that struct for the appropriate handle to use. In initdb, I initialize the default collation set as before from the `locale -a` output, but also add all available ICU locales with a "%icu" appended (so "fr_FR%icu"). I suppose one could create a configuration option perhaps in initdb to change the default so that, say, "fr_FR" uses ICU and "fr_FR%posix" uses the old stuff. That all works well enough for named collations and for sorting. The thread about the FreeBSD ICU patch discusses some details of how to best use the ICU APIs to do various aspects of the sorting, so I didn't focus on that too much. I took the existing collate.linux.utf8.sql test and ported it to the ICU setup, and it passes except for the case noted below. I'm not sure how well it will work to replace all the bits of LIKE and regular expressions with ICU API calls. One problem is that ICU likes to do case folding as a whole string, not by character. I need to do more research about that. Another problem, which was also previously discussed is that ICU does case folding in a locale-agnostic manner, so it does not consider things such as the Turkish special cases. This is per Unicode standard modulo weasel wording, but it breaks existing tests at least. So right now the entries in collcollate and collctype need to be valid for ICU *and* POSIX for everything to work. Also note that ICU locales are encoding-independent and don't support a separate collcollate and collctype, so the existing catalog structure is not optimal. Where it gets really interesting is what to do with the database locales. They just set the global process locale. So in order to port that to ICU we'd need to check every implicit use of the process locale and tweak it. We could add a datcollprovider column or something. But we also rely on the datctype setting to validate the encoding of the database. Maybe we wouldn't need that anymore, but it sounds risky. We could have a datcollation column that by OID references a collation defined inside the database. With a background worker, we can log into the database as it is being created and make adjustments, including defining or adjusting collation definitions. This would open up interesting new possibilities. What is a way to go forward here? What's a minimal useful feature that is future-proof? Just allow named collations referencing ICU for now? Is throwing out POSIX locales even for the process locale reasonable? Oh, that case folding code in formatting.c needs some refactoring. There are so many ifdefs there and it's repeated almost identically three times, it's crazy to work in that. [0]: https://www.postgresql.org/message-id/flat/789A2F56-0E42-409D-A840-6AF5110D6085%40pingpong.net -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: