[HACKERS] Patch to address concerns about ICU collcollate stability in v10(Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | [HACKERS] Patch to address concerns about ICU collcollate stability in v10(Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?) |
Date | |
Msg-id | CAH2-WzksxL=rbdVgvCQSruav5ssn+=WBbRGzbchAH2w8WxNyQg@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] Patch to address concerns about ICU collcollatestability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?)
[HACKERS] Re: Patch to address concerns about ICU collcollate stability in v10(Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?) |
List | pgsql-hackers |
On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > After reviewing this thread and testing around a bit, I think the code > is mostly fine as it is, but the documentation is lacking. Hence, > attached is a patch to expand the documentation quite a bit, especially > to document in more detail what ICU locale strings are accepted. Attached is my counterproposal. This patch has us always make sure that collcollate is in BCP 47 format, even on ICU versions prior to ICU 54. Other improvements/bug fixes: * Sanitizes locale names within CREATE COLLATION. This has been tested on ICU 42 (earliest supported version) and ICU 55. * Enforces a NAMEDATALEN restriction for collcollate during CREATE COLLATION, forbidding collcollate truncation. This is useful because truncating can allow subtly wrong answers later on. * Adds DEBUG1 message with ICU display name, so we can satisfy ourselves that we're getting the expected behavior, to some degree. I used this to confirm that we get consistent behavior between ICU 42 and 55 for CREATE COLLATION. On ICU 42, keyword collation attributes (e.g., the emoji keyword, numeric ordering for natural sort order) still don't work, just as before, but the locale string is still considered valid. (This is because ucol_setAttribute() is supposed to be used there). * Documents the aforementioned keyword collation attribute restriction on ICU versions before ICU 54. This was needed anyway. We only claim for Postgres collations what the ICU docs claim for ICU collators, even though there is reason to believe that some ICU versions before ICU 54 actually can do better. * When using ICU 4.2, the examples in the docs (variant collations like German Phonebook order) now actually work. The examples are completely broken right now IMV, because the user has to know that they are on ICU 4.2, which only accepts the old style locale strings as things stand. And, they'd have no obvious indication that things were broken without this patch, because there would have been no sanitization or other feedback. * Creates root collation as root-x-icu (collcollate "root"), not und-x-icu. "und" means undefined language. * Moves all knowledge of ICU version issues to within a few pg_locale.c routines, leaving the code reasonably well encapsulated. * Does an encoding conversion when getting a display name for the initdb collation comment. This needs to be ascii-safe due to the initdb/template0 DB encoding restriction, but I suspect that the way we do it now is subtly wrong. This does imply that SQL_ASCII databases will never get ICU pg_collation entries that come with comments added by initdb, but such databases were already unable to use the collations anyway, so this is no loss at all. (SQL_ASCII is mentioned here as an example of a database collation that ICU doesn't support, all of which are affected in this way.) I decided to implement CREATE COLLATION sanitization based on whether or not a display string can be generated (if not, or if it's empty, we reject). This seems like about the right level of strictness to me, because we're still very forgiving. I admit that that's a little bit arbitrary, but it does seem to be a good match for Postgres; it is forgiving of things that could plausibly make sense on another ICU version to some user at some time, but rejects most things that are inherently wrong, IMHO. You can still ask for Japanese as spoken in Madagascar, or even specify a language that ICU has never heard of, and there is no error. It catches syntax errors only. See the slightly expanded tests for details. I'm very open to negotiating the exact details of how we sanitize, but any level of sanitization will be at least a little bit arbitrary (including what we have right now, which is no sanitization). Aside from the specific problems for Postgres that I've noted that the patch prevents or fixes, there is another reason to do this. The old style locale name format is officially deprecated by ICU, which makes it seem like we should never expose it to users in the first place. Per ICU docs: "Starting with ICU 54, the following naming scheme and its API functions are deprecated. Use ucol_open() with language tag collation keywords instead (see Collation API Details)" [1] [1] http://userguide.icu-project.org/collation/concepts#TOC-Collator-naming-scheme -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: