Re: encoding affects ICU regex character classification - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: encoding affects ICU regex character classification |
Date | |
Msg-id | 40bb6ee11ab2679024d625224f026f9e198cc7f7.camel@j-davis.com Whole thread Raw |
In response to | Re: encoding affects ICU regex character classification (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: encoding affects ICU regex character classification
|
List | pgsql-hackers |
On Thu, 2023-11-30 at 15:10 +1300, Thomas Munro wrote: > > On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis <pgsql@j-davis.com> > > wrote: > > > > Character classification is not localized at all in libc or ICU > > > > as > > far > > > > as I can tell. > > > > Really? POSIX isalpha()/isalpha_l() and friends clearly depend on > > a > > locale. See eg d522b05c for a case where that broke something. I believe we're using different definitions of "localized". What I mean is "varies from region to region or language to language". I think you mean "varies for any reason at all [perhaps for no reason?]". For instance, that commit indirectly links to: https://github.com/evanj/isspace_locale Which says "Mac OS X in a UTF-8 locale...". I don't see any fundamental locale-based concern there. I wrote a test program (attached) which compares any two given libc locales using both the ordinary isalpha() family of functions, and also using the iswalpha() family of functions. For the former, I only test up to 0x7f. For the latter, I went to some effort to properly translate the code point to a wchar_t (encode as UTF8, then mbstowcs using a UTF- 8 locale), and I test all unicode code points except the surrogate range. Using the test program, I compared the C.UTF-8 locale to every other installed locale on my system (attached list for reference) and the only ones that show any differences are "C" and "POSIX". That, combined with the fact that ICU doesn't even accept a locale argument to the character classification functions, gives me a high degree of confidence that character classification is not localized on my system according to my definition of "localized". If someone else wants to run the test program on their system, I'd be interested to see the results (some platform-specific modification may be required, e.g. handling 16- bit whcar_t, etc.). Your definition is too wide in my opinion, because it mixes together different sources of variation that are best left separate: a. region/language b. technical requirements c. versioning d. implementation variance (a) is not a true source of variation (please correct me if I'm wrong) (b) is perhaps interesting. The "C" locale is one example, and perhaps there are others, but I doubt very many others that we want to support. (c) is not a major concern in my opinion. The impact of Unicode changes is usually not dramatic, and it only affects regexes so it's much more contained than collation, for example. And if you really care, just use the "C" locale. (d) is mostly a bug. Most users would prefer standardization, platform- independence, documentability, and testability. There are users who might care a lot about compatibility, and I don't want to disrupt such users, but long term I don't see a lot of value in bubbling up semantics from libc into expressions when there's not a clear reason to do so. (Note: getting semantics from libc is a bit dubious in the case of collation, as well, but at least for collation there are regional and linguistic differences that we can't handle internally.) I think we only need 2 main character classification schemes: "C" and Unicode (TR #18 Compatibility Properties[1], either the "Standard" variant or the "POSIX Compatible" variant or both). The libc and ICU ones should be there only for compatibility and discouraged and hopefully eventually removed. > > Not knowing anything about how glibc generates its charmaps, > > Unicode > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they > > have an old hand-crafted table, but for UTF-8 encoding it's fully > > outsourced to Unicode, and that's why you see a difference. No, the problem is that we're passing a pg_wchar to an ICU function that expects a 32-bit code point. Those two things are equivalent in the UTF8 encoding, but not in the LATIN9 encoding. See the comment at the top of regc_pg_locale.c, which should probably be updated to describe what happens with ICU collations. > > Another > > problem seen in a few parts of our tree is that we sometimes feed > > individual UTF-8 bytes to the isXXX() functions which is about as > > > well > > defined as trying to pay for a pint with the left half of a $10 > > bill. If we have built-in character classification systems as I propose ("C" and Unicode), then the callers can simply choose which well-defined one to use. > > also > > might disagree with tables built by PostgreSQL. The patch I provided (new version attached) exhaustively tests all the new Unicode property tables, and also the class assignments based on [1] and [2]. Everything is identical for all assigned code points. The test will run whenever you "ninja update-unicode", so any inconsistencies will be highly visible before release. Additionally, because the tables are checked in, you'll be able to see (in the diff) the impact from a Unicode version update and consider that impact when writing the release notes. You may be wondering about differences in the version of Unicode between Postgres and ICU while the test is running. It only tests code points that are assigned in both Unicode versions, and reports the number of code points that are skipped due to this check. The person running "update-unicode" may see a failing test or a large number of skipped codepoints if the Unicode versions don't match, in which case they should try running against a more closely-matching version of ICU. Regards, Jeff Davis [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
Attachment
pgsql-hackers by date: