Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it? - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it? |
Date | |
Msg-id | 4e66ffa4-97db-d656-5b2a-12f9e8f50d91@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it? (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
|
List | pgsql-hackers |
On 9/21/17 16:55, Peter Geoghegan wrote: >> I strongly prefer if there, as much as possible, is only one format for >> inputting ICU locales. > I agree, but the bigger issue is that we're *half way* between > supporting only one format, and supporting two formats. AFAICT, there > is no reason that we can't simply support one format on all ICU > versions, and keep what ends up within pg_collation at initdb time > essentially the same across ICU versions (except for those that are > due to cultural/political developments). After reviewing this thread and testing around a bit, I think the code is mostly fine as it is, but the documentation is lacking. Hence, attached is a patch to expand the documentation quite a bit, especially to document in more detail what ICU locale strings are accepted. The documentation has always stated, albeit perhaps in obscure ways, that we accept for locales whatever ICU accepts. I don't think we should restrict or override that in any way. That would cause existing documentation and lore on ICU to be invalid for PostgreSQL. I specifically disagree that we should, as appears to be suggested here, restrict the input locale strings to BCP 47 and reject or transform the traditional ICU-specific locale syntax. Again, that would cause existing ICU documentation material to become less applicable to PostgreSQL. And basically, there is no reason for it, because I am not aware that ICU plans to get rid of that syntax. Moreover, we need to support that syntax for older ICU versions anyway. A patch has been posted that, as I understand it, would allow BCP 47 syntax to be used with older versions as well. That's a nice idea, but it's a new feature and not appropriate for PG10 at this time. (Note that we also don't canonicalize libc locale names.) The attached documentation patch documents both locale naming forms in parallel. The other attached patch contains a test suite that verifies that the examples in the documentation actually work. It's not meant for committing into the mainline, but it was useful for me. During testing with various versions I have also discovered the following things: - The current code appears to be of the opinion that BCP 47 locale names are accepted as of ICU 54. That appears to be incorrect; I find that they work fine in ICU 52, but they don't work in ICU 4.2. I don't know where the cutoff is. That might be worth changing in the code if we can obtain more information. - What might have led to the above mistake is the following in the ucol_open() documentation: q{Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in language tag syntax ("el-u-kf-upper").} That is correct. If you use that syntax in earlier versions, the case-first specification in this example is ignored. You need to use ucol_setAttribute() then. (Note that the phonebook stuff still works, because that is not a "collation attribute".) (One of my plans for PG11 had been to allow specifying such collation attributes via additional CREATE COLLATION clauses and pg_collation columns, but that might be obsolete now.) - Moreover, it is not the case that ICU accepts just any sort of nonsense as a locale name. For example, "el@colCaseFirst=whatever" is rejected with U_ILLEGAL_ARGUMENT_ERROR. Now, it might in other cases be more liberal than we might be hoping for. But I think they have reasons for what they do. One conclusion here is that there are multiple dimensions to what sort of thing is accepted as a locale in different versions of ICU and what the canonical format is. If we insist that everything has to be written in the form that is preferred today, then we'll have awkward problems if a future version of ICU establishes even more variants that will then be preferred. I think there is room for incremental refinement here. Features like optionally checking or printing the canonical or preferred locale format or making the locale description available via a function or system view might all be valuable additions to a future PostgreSQL release. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: