Thread: Built-in CTYPE provider
CTYPE, which handles character classification and upper/lowercasing behavior, may be simpler than it first appears. We may be able to get a net decrease in complexity by just building in most (or perhaps all) of the functionality. Unicode offers relatively simple rules for CTYPE-like functionality based on data files. There are a few exceptions and a few options, which I'll address below. (In contrast, collation varies a lot from locale to locale, and has a lot more options and nuance than ctype.) === Proposal === Parse some Unicode data files into static lookup tables in .h files (similar to what we already do for normalization) and provide functions to perform the right lookups according to Unicode recommentations[1][2]. Then expose the functionality as either a specially-named locale for the libc provider, or as part of the built-in collation provider which I previously proposed[3]. (Provided patches don't expose the functionality yet; I'm looking for feedback first.) Using libc or ICU for a CTYPE provider would still be supported, but as I explain below, there's not nearly as much reason to do so as you might expect. As far as I can tell, using an external provider for CTYPE functionality is mostly unnecessary complexity and magic. There's still plenty of reason to use the plain "C" semantics, if desired, but those semantics are already built-in. === Benefits === * platform-independent ctype semantics based on Unicode, not tied to any dependency's implementation * ability to combine fast memcmp() collation with rich ctype semantics * user-visible semantics can be documented and tested * stability within a PG major version * transparency of changes: tables would be checked in to .h files, so whoever runs the "update-unicode" build target would see if there are unexpected or impactful changes that should be addressed in the release notes * the built-in tables themselves can be tested exhaustively by comparing with ICU so we can detect trivial parsing errors and the like === Character Classification === Character classification is used for regexes, e.g. whether a character is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" class. Unicode defines what character properties map into these classes in TR #18 [1], specifying both a "Standard" variant and a "POSIX Compatible" variant. The main difference with the POSIX variant is that symbols count as punctuation. Character classification in Unicode does not vary from locale to locale. The same character is considered to be a member of the same classes regardless of locale (in other words, there's no "tailoring"). There is no strong compatibility guarantee around the classification of characters, but it doesn't seem to change much in practice (I could collect more data here if it matters). In glibc, character classification is not affected by the locale as far as I can tell -- all non-"C" locales behave like "C.UTF-8" (perhaps other libc implementations or versions or custom locales behave differently -- corrections welcome). There are some differences between "C.UTF-8" and what Unicode seems to recommend, and I'm not entirely sure why those differences exist or whether those differences are important for anything other than compatibility. Note: ICU offers character classification based on Unicode standards, too, but the fact that it's an external dependency makes it a difficult-to-test black box that is not tied to a PG major version. Also, we currently don't use the APIs that Unicode recommends; so in Postgres today, ICU-based character classification is further from Unicode than glibc character classification. === LOWER()/INITCAP()/UPPER() === The LOWER() and UPPER() functions are defined in the SQL spec with surprising detail, relying on specific Unicode General Category assignments. How to map characters seems to be left (implicitly) up to Unicode. If the input string is normalized, the output string must be normalized, too. Weirdly, there's no room in the SQL spec to localize LOWER()/UPPER() at all to handle issues like [1]. Also, the standard specifies one example, which is that "ß" becomes "SS" when folded to upper case. INITCAP() is not in the SQL spec. In Unicode, lowercasing and uppercasing behavior is a mapping[2], and also backed by a strong compatibility guarantee that "case pairs" will always remain case pairs[4]. The mapping may be "simple" (context-insensitive, locale-insensitive, not adding any code points), or "full" (may be context-sensitive, locale-sensitive, and one code point may turn into 1-3 code points). Titlecasing (INITCAP() in Postgres) in Unicode is similar to upper/lowercasing, except that it has the additional complexity of finding word boundaries, which have a non-trivial definition. To simplify, we'd either use the Postgres definition (alphanumeric) or the "word" character class specified in [1]. If someone wants more sophisticated word segmentation they could use ICU. While "full" case mapping sounds more complex, there are actually very few cases to consider and they are covered in another (small) data file. That data file covers ~100 code points that convert to multiple code points when the case changes (e.g. "ß" -> "SS"), 7 code points that have context-sensitive mappings, and three locales which have special conversions ("lt", "tr", and "az") for a few code points. ICU can do the simple case mapping (u_tolower(), etc.) or full mapping (u_strToLower(), etc.). I see one difference in ICU that I can't yet explain for the full titlecase mapping of a singular \+000345. glibc in UTF8 (at least in my tests) just does the simple upper/lower case mapping, extended with simple mappings for the locales with special conversions (which I think are exactly the same 3 locales mentioned above). libc doesn't do titlecase. If the resuling character isn't representable in the server encoding, I think libc just maps the character to itself, though I should test this assumption. === Encodings === It's easiest to implement these rules in UTF8, but possible for any encoding where we can decode to a Unicode code point. === Patches === 0001 & 0002 are just cleanup. I intend to commit them unless someone has a comment. 0003 implements character classification ("Standard" and "POSIX Compatible" variants) but doesn't actually use them for anything. 0004 implements "simple" case mapping, and a partial implementation of "full" case mapping. Again, does not use them yet. === Questions === * Is a built-in ctype provider a reasonable direction for Postgres as a project? * Does it feel like it would be simpler or more complex than what we're doing now? * Do we want to just try to improve our ICU support instead? * Do we want the built-in provider to be one thing, or have a few options (e.g. "standard" or "posix" character classification; "simple" or "full" case mapping)? Regards, Jeff Davis [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties [2] https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992 [3] https://www.postgresql.org/message-id/flat/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com [4] https://www.unicode.org/policies/stability_policy.html#Case_Pair -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 12/5/23 3:46 PM, Jeff Davis wrote: > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. > > === Questions === > > * Is a built-in ctype provider a reasonable direction for Postgres as > a project? > * Does it feel like it would be simpler or more complex than what > we're doing now? > * Do we want to just try to improve our ICU support instead? > * Do we want the built-in provider to be one thing, or have a few > options (e.g. "standard" or "posix" character classification; > "simple" or "full" case mapping)? Generally, I am in favor of this - I think we need to move in the direction of having an in-database option around unicode for PG users, given how easy it is for administrators to mis-manage dependencies. Especially when OS admins can be different from DB admins, and when nobody really understands risks of changing libs with in-place moves to new operating systems - except for like 4 of us on the mailing lists. My biggest concern is around maintenance. Every year Unicode is assigning new characters to existing code points, and those existing code points can of course already be stored in old databases before libs are updated. When users start to notice that regex [[:digit:]] or upper/lower functions aren't working correctly with characters in their DB, they'll probably come asking for fixes. And we may end up with something like the timezone database where we need to periodically add a more current ruleset - albeit alongside as a new version in this case. Here are direct links to charts of newly assigned characters from the last few Unicode updates: 2022: https://www.unicode.org/charts/PDF/Unicode-15.0/ 2021: https://www.unicode.org/charts/PDF/Unicode-14.0/ 2020: https://www.unicode.org/charts/PDF/Unicode-13.0/ 2019: https://www.unicode.org/charts/PDF/Unicode-12.0/ If I'm reading the Unicode 15 update correctly, PostgreSQL regex expressions with [[:digit:]] will not correctly identify Kaktovik or Nag Mundari or Kawi digits without that update to character type specs. If I'm reading the Unicode 12 update correctly, then upper/lower functions aren't going to work correctly on Latin Glottal A and I and U characters without that update to character type specs. Overall I see a lot fewer Unicode updates involving upper/lower than I do with digits - especially since new scripts often involve their own numbering characters which makes new digits more common. But lets remember that people like to build indexes on character classification functions like upper/lower, for case insensitive searching. It's another case where the index will be corrupted if someone happened to store Latin Glottal vowels in their database and then we update libs to the latest character type rules. So even with something as basic as character type, if we're going to do it right, we still need to either version it or definitively decide that we're not going to every support newly added Unicode characters like Latin Glottals. -Jeremy -- http://about.me/jeremy_schneider
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: > My biggest concern is around maintenance. Every year Unicode is > assigning new characters to existing code points, and those existing > code points can of course already be stored in old databases before > libs > are updated. Is the concern only about unassigned code points? I already committed a function "unicode_assigned()" to test whether a string contains only assigned code points, which can be used in a CHECK() constraint. I also posted[5] an idea about a per-database option that could reject the storage of any unassigned code point, which would make it easier for users highly concerned about compatibility. > And we may end up with > something like the timezone database where we need to periodically > add a > more current ruleset - albeit alongside as a new version in this > case. There's a build target "update-unicode" which is run to pull in new Unicode data files and parse them into static C arrays (we already do this for the Unicode normalization tables). So I agree that the tables should be updated but I don't understand why that's a problem. > If I'm reading the Unicode 15 update correctly, PostgreSQL regex > expressions with [[:digit:]] will not correctly identify Kaktovik or > Nag > Mundari or Kawi digits without that update to character type specs. Yeah, if we are behind in the Unicode version, then results won't be the most up-to-date. But ICU or libc could also be behind in the Unicode version. > But lets remember that people like to build indexes on character > classification functions like upper/lower, for case insensitive > searching. UPPER()/LOWER() are based on case mapping, not character classification. I intend to introduce a SQL-level CASEFOLD() function that would obey Unicode casefolding rules, which have very strong compatibility guarantees[6] (essentially, if you are only using assigned code points, you are fine). > It's another case where the index will be corrupted if > someone happened to store Latin Glottal vowels in their database and > then we update libs to the latest character type rules. I don't agree with this characterization at all. (a) It's not "another case". Corruption of an index on LOWER() can happen today. My proposal makes the situation better, not worse. (b) These aren't libraries, I am proposing built-in Unicode tables that only get updated in a new major PG version. (c) It likely only affects a small number of indexes and it's easier for an administrator to guess which ones might be affected, making it easier to just rebuild those indexes. (d) It's not a problem if you stick to assigned code points. > So even with something as basic as character type, if we're going to > do > it right, we still need to either version it or definitively decide > that > we're not going to every support newly added Unicode characters like > Latin Glottals. If, by "version it", you mean "update the data tables in new Postgres versions", then I agree. If you mean that one PG version would need to support many versions of Unicode, I don't agree. Regards, Jeff Davis [5] https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com [6] https://www.unicode.org/policies/stability_policy.html#Case_Folding
Jeff Davis wrote: > While "full" case mapping sounds more complex, there are actually > very few cases to consider and they are covered in another (small) > data file. That data file covers ~100 code points that convert to > multiple code points when the case changes (e.g. "ß" -> "SS"), 7 > code points that have context-sensitive mappings, and three locales > which have special conversions ("lt", "tr", and "az") for a few code > points. But there are CLDR mappings on top of that. According to the Unicode FAQ https://unicode.org/faq/casemap_charprop.html#5 Q: Does the default case mapping work for every language? What about the default case folding? [...] To make case mapping language sensitive, the Unicode Standard specificially allows implementations to tailor the mappings for each language, but does not provide the necessary data. The file SpecialCasing.txt is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to CLDR and other resources. In particular "el" (modern greek) has case mapping rules that ICU seems to implement, but "el" is missing from the list ("lt", "tr", and "az") you identified. The CLDR case mappings seem to be found in https://github.com/unicode-org/cldr/tree/main/common/transforms in *-Lower.xml and *-Upper.xml Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > But there are CLDR mappings on top of that. I see, thank you. Would it still be called "full" case mapping to only use the mappings in SpecialCasing.txt? And would that be useful? Regards, Jeff Davis
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > In particular "el" (modern greek) has case mapping rules that > ICU seems to implement, but "el" is missing from the list > ("lt", "tr", and "az") you identified. I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the ctype semantics match C.UTF-8 for all code points. glibc is not doing this additional tailoring for "el". Therefore I believe the builtin CTYPE would be very useful for case mapping (both "simple" and "full") even without this additional tailoring. You are correct that ICU will still have some features that won't be supported by the builtin provider. Better word boundary semantics in INITCAP() are another advantage. Regards, Jeff Davis
On 12/13/23 5:28 AM, Jeff Davis wrote: > On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: >> My biggest concern is around maintenance. Every year Unicode is >> assigning new characters to existing code points, and those existing >> code points can of course already be stored in old databases before >> libs >> are updated. > > Is the concern only about unassigned code points? > > I already committed a function "unicode_assigned()" to test whether a > string contains only assigned code points, which can be used in a > CHECK() constraint. I also posted[5] an idea about a per-database > option that could reject the storage of any unassigned code point, > which would make it easier for users highly concerned about > compatibility. I didn't know about this. Did a few smoke tests against today's head on git and it's nice to see the function working as expected. :) test=# select unicode_version(); unicode_version ----------------- 15.1 test=# select chr(3212),unicode_assigned(chr(3212)); chr | unicode_assigned -----+------------------ ಌ | t -- unassigned code point inside assigned block test=# select chr(3213),unicode_assigned(chr(3213)); chr | unicode_assigned -----+------------------ | f test=# select chr(3214),unicode_assigned(chr(3214)); chr | unicode_assigned -----+------------------ ಎ | t -- unassigned block test=# select chr(67024),unicode_assigned(chr(67024)); chr | unicode_assigned -----+------------------ | f test=# select chr(67072),unicode_assigned(chr(67072)); chr | unicode_assigned -----+------------------ 𐘀 | t Looking closer, patches 3 and 4 look like an incremental extension of this earlier idea; the perl scripts download data from unicode.org and we've specifically defined Unicode version 15.1 and the scripts turn the data tables inside-out into C data structures optimized for lookup. That C code is then checked in to the PostgreSQL source code files unicode_category.h and unicode_case_table.h - right? Am I reading correctly that these two patches add C functions pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we don't yet reference these functions anywhere? So this is just getting some plumbing in place? >> And we may end up with >> something like the timezone database where we need to periodically >> add a >> more current ruleset - albeit alongside as a new version in this >> case. > > There's a build target "update-unicode" which is run to pull in new > Unicode data files and parse them into static C arrays (we already do > this for the Unicode normalization tables). So I agree that the tables > should be updated but I don't understand why that's a problem. I don't want to get stuck on this. I agree with the general approach of beginning to add a provider for locale functions inside the database. We have awhile before Unicode 16 comes out. Plenty of time for bikeshedding My prediction is that updating this built-in provider eventually won't be any different from ICU or glibc. It depends a bit on how we specifically built on this plumbing - but when Unicode 16 comes out, i I'll try to come up with a simple repro on a default DB config where changing the Unicode version causes corruption (it was pretty easy to demonstrate for ICU collation, if you knew where to look)... but I don't think that discussion should derail this commit, because for now we're just starting the process of getting Unicode 15.1 into the PostgreSQL code base. We can cross the "update" bridge when we come to it. Later on down the road, from a user perspective, I think we should be careful about confusion where providers are used inconsistently. It's not great if one function follow built-in Unicode 15.1 rules but another function uses Unicode 13 rules because it happened to call an ICU function or a glibc function. We could easily end up with multiple providers processing different parts of a single SQL statement, which could lead to strange results in some cases. Ideally a user just specifies a default provider their database, and the rules for that version of Unicode are used as consistently as possible - unless a user explicitly overrides their choice in a table/column definition, query, etc. But it might take a little time and work to get to this point. -Jeremy -- http://about.me/jeremy_schneider
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote: > Looking closer, patches 3 and 4 look like an incremental extension of > this earlier idea; Yes, it's essentially the same thing extended to a few more files. I don't know if "incremental" is the right word though; this is a substantial extension of the idea. > the perl scripts download data from unicode.org and > we've specifically defined Unicode version 15.1 and the scripts turn > the > data tables inside-out into C data structures optimized for lookup. > That > C code is then checked in to the PostgreSQL source code files > unicode_category.h and unicode_case_table.h - right? Yes. The standard build process shouldn't be downloading files, so the static tables are checked in. Also, seeing the diffs of the static tables improves the visibility of changes in case there's some mistake or big surprise. > Am I reading correctly that these two patches add C functions > pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we > don't yet reference these functions anywhere? So this is just getting > some plumbing in place? Correct. Perhaps I should combine these into the builtin provider thread, but these are independently testable and reviewable. > > > My prediction is that updating this built-in provider eventually > won't > be any different from ICU or glibc. The built-in provider will have several advantages because it's tied to a PG major version: * A physical replica can't have different semantics than the primary. * Easier to document and test. * Changes are more transparent and can be documented in the release notes, so that administrators can understand the risks and blast radius at pg_upgrade time. > Later on down the road, from a user perspective, I think we should be > careful about confusion where providers are used inconsistently. It's > not great if one function follow built-in Unicode 15.1 rules but > another > function uses Unicode 13 rules because it happened to call an ICU > function or a glibc function. We could easily end up with multiple > providers processing different parts of a single SQL statement, which > could lead to strange results in some cases. The whole concept of "providers" is that they aren't consistent with each other. ICU, libc, and the builtin provider will all be based on different versions of Unicode. That's by design. The built-in provider will be a bit better in the sense that it's consistent with the normalization functions, and the other providers aren't. Regards, Jeff Davis
On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis <pgsql@j-davis.com> wrote: > The whole concept of "providers" is that they aren't consistent with > each other. ICU, libc, and the builtin provider will all be based on > different versions of Unicode. That's by design. > > The built-in provider will be a bit better in the sense that it's > consistent with the normalization functions, and the other providers > aren't. FWIW, the idea that we're going to develop a built-in provider seems to be solid, for the reasons Jeff mentions: it can be stable, and under our control. But it seems like we might need built-in providers for everything rather than just CTYPE to get those advantages, and I fear we'll get sucked into needing a lot of tailoring rather than just being able to get by with one "vanilla" implementation. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote: > FWIW, the idea that we're going to develop a built-in provider seems > to be solid, for the reasons Jeff mentions: it can be stable, and > under our control. But it seems like we might need built-in providers > for everything rather than just CTYPE to get those advantages, and I > fear we'll get sucked into needing a lot of tailoring rather than > just > being able to get by with one "vanilla" implementation. For the database default collation, I suspect a lot of users would jump at the chance to have "vanilla" semantics. Tailoring is more important for individual collation objects than for the database-level collation. There are reasons you might select a tailored database collation, like if the set of users accessing it are mostly from a single locale, or if the application connected to the database is expecting it in a certain form. But there are a lot of users for whom neither of those things are true, and it makes zero sense to order all of the text indexes in the database according to any one particular locale. I think these users would prioritize stability and performance for the database collation, and then use COLLATE clauses with ICU collations where necessary. The question for me is how good the "vanilla" semantics need to be to be useful as a database-level collation. Most of the performance and stability problems come from collation, so it makes sense to me to provide a fast and stable memcmp collation paired with richer ctype semantics (as proposed here). Users who want something more probably want the Unicode "root" collation, which can be provided by ICU today. I am also still concerned that we have the wrong defaults. Almost nobody thinks libc is a great provider, but that's the default, and there were problems trying to change that default to ICU in 16. If we had a builtin provider, that might be a better basis for a default (safe, fast, always available, and documentable). Then, at least if someone picks a different locale at initdb time, they would be doing so intentionally, rather than implicitly accepting index corruption risks based on an environment variable. Regards, Jeff Davis
Jeff Davis wrote: > But there are a lot of users for whom neither of those things are true, > and it makes zero sense to order all of the text indexes in the > database according to any one particular locale. I think these users > would prioritize stability and performance for the database collation, > and then use COLLATE clauses with ICU collations where necessary. +1 > I am also still concerned that we have the wrong defaults. Almost > nobody thinks libc is a great provider, but that's the default, and > there were problems trying to change that default to ICU in 16. If we > had a builtin provider, that might be a better basis for a default > (safe, fast, always available, and documentable). Then, at least if > someone picks a different locale at initdb time, they would be doing so > intentionally, rather than implicitly accepting index corruption risks > based on an environment variable. Yes. The introduction of the bytewise-sorting, locale-agnostic C.UTF-8 in glibc is also a step in the direction of providing better defaults for apps like Postgres, that need both long-term stability in sorts and Unicode coverage for ctype-dependent functions. But C.UTF-8 is not available everywhere, and there's still the problem that Unicode updates through libc are not aligned with Postgres releases. ICU has the advantage of cross-OS compatibility, but it does not provide any collation with bytewise sorting like C or C.UTF-8, and we don't allow a combination like "C" for sorting and ICU for ctype operations. When opting for a locale provider, it has to be for both sorting and ctype, so an installation that needs cross-OS compatibility, good Unicode support and long-term stability of indexes cannot get that with ICU as we expose it today. If the Postgres default was bytewise sorting+locale-agnostic ctype functions directly derived from Unicode data files, as opposed to libc/$LANG at initdb time, the main annoyance would be that "ORDER BY textcol" would no longer be the human-favored sort. For the presentation layer, we would have to write for instance ORDER BY textcol COLLATE "unicode" for the root collation or a specific region-country if needed. But all the rest seems better, especially cross-OS compatibity, truly immutable and faster indexes for fields that don't require linguistic ordering, alignment between Unicode updates and Postgres updates. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > If the Postgres default was bytewise sorting+locale-agnostic > ctype functions directly derived from Unicode data files, > as opposed to libc/$LANG at initdb time, the main > annoyance would be that "ORDER BY textcol" would no > longer be the human-favored sort. > For the presentation layer, we would have to write for instance > ORDER BY textcol COLLATE "unicode" for the root collation > or a specific region-country if needed. > But all the rest seems better, especially cross-OS compatibity, > truly immutable and faster indexes for fields that > don't require linguistic ordering, alignment between Unicode > updates and Postgres updates. Thank you, that summarizes exactly the compromise that I'm trying to reach. Regards, Jeff Davis
On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > > annoyance would be that "ORDER BY textcol" would no > > longer be the human-favored sort. > > For the presentation layer, we would have to write for instance > > ORDER BY textcol COLLATE "unicode" for the root collation > > or a specific region-country if needed. > > But all the rest seems better, especially cross-OS compatibity, > > truly immutable and faster indexes for fields that > > don't require linguistic ordering, alignment between Unicode > > updates and Postgres updates. > > Thank you, that summarizes exactly the compromise that I'm trying to > reach. This makes sense to me, too, but it feels like it might work out better for speakers of English than for speakers of other languages. Right now, I tend to get databases that default to en_US.utf8, and if the default changed to C.utf8, then the case-comparison behavior might be different but the letters would still sort in the right order. For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a change to C.utf8 would be a much bigger problem, I would think. Their alphabet isn't in code point order, and so things would be alphabetized wrongly. That might be OK if they don't care about ordering for any purpose other than equality lookups, but otherwise it's going to force them to change the default, where today they don't have to do that. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote: > This makes sense to me, too, but it feels like it might work out > better for speakers of English than for speakers of other languages. There's very little in the way of locale-specific tailoring for ctype behaviors in ICU or glibc -- only for the 'az', 'el', 'lt', and 'tr' locales. While English speakers like us may benefit from being aligned with the default ctype behaviors, those behaviors are not at all specific to 'en' locales in ICU or glibc. Collation varies a lot more between locales. I wouldn't call memcmp ideal for English ('Zebra' comes before 'apple', which seems wrong to me). If memcmp sorting does favor any particular group, I would say it favors programmers more than English speakers. But that could just be my perspective and I certainly understand the point that memcmp ordering is more tolerable for some languages than others. > Right now, I tend to get databases that default to en_US.utf8, and if > the default changed to C.utf8, then the case-comparison behavior > might > be different en_US.UTF-8 and C.UTF-8 have the same ctype behavior. > For > someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a > change to C.utf8 would be a much bigger problem, I would think. Those locales all have the same ctype behavior. It turns out that that en_US.UTF-8 and fr_FR.UTF-8 also have the same collation order -- no tailoring beyond root collation according to CLDR files for 'en' and 'fr' (though note that 'fr_CA' does have tailoring). That doesn't mean the experience of switching to memcmp order is exactly the same for a French speaker and an English speaker, but I think it's interesting. > That might be OK if they don't care about > ordering for any purpose other than equality lookups, but otherwise > it's going to force them to change the default, where today they > don't > have to do that. To be clear, I haven't proposed changing the initdb default. This thread is about adding a builtin provider with builtin ctype, which I believe a lot of users would like. It also might be the best chance we have to get to a reasonable default behavior at some point in the future. It would be always available, fast, stable, better semantics than "C" for many locales, and we can document it. In any case, we don't need to decide that now. If the builtin provider is useful, we should do it. Regards, Jeff Davis
On 12/5/23 3:46 PM, Jeff Davis wrote: > CTYPE, which handles character classification and upper/lowercasing > behavior, may be simpler than it first appears. We may be able to get > a net decrease in complexity by just building in most (or perhaps all) > of the functionality. > > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. I'll be honest, even though this is primarily about CTYPE and not collation, I still need to keep re-reading the initial email slowly to let it sink in and better understand it... at least for me, it's complex to reason through. 🙂 I'm trying to make sure I understand clearly what the user impact/change is that we're talking about: after a little bit of brainstorming and looking through the PG docs, I'm actually not seeing much more than these two things you've mentioned here: the set of regexp_* functions PG provides, and these three generic functions. That alone doesn't seem highly concerning. I haven't checked the source code for the regexp_* functions yet, but are these just passing through to an external library? Are we actually able to easily change the CTYPE provider for them? If nobody knows/replies then I'll find some time to look. One other thing that comes to mind: how does the parser do case folding for relation names? Is that using OS-provided libc as of today? Or did we code it to use ICU if that's the DB default? I'm guessing libc, and global catalogs probably need to be handled in a consistent manner, even across different encodings. (Kindof related... did you ever see the demo where I create a user named '🏃' and then I try to connect to a database with non-unicode encoding? 💥😜 ...at least it seems to be able to walk the index without decoding strings to find other users - but the way these global catalogs work scares me a little bit) -Jeremy -- http://about.me/jeremy_schneider
On 12/20/23 3:47 PM, Jeremy Schneider wrote: > On 12/5/23 3:46 PM, Jeff Davis wrote: >> CTYPE, which handles character classification and upper/lowercasing >> behavior, may be simpler than it first appears. We may be able to get >> a net decrease in complexity by just building in most (or perhaps all) >> of the functionality. > > I'll be honest, even though this is primarily about CTYPE and not > collation, I still need to keep re-reading the initial email slowly to > let it sink in and better understand it... at least for me, it's complex > to reason through. 🙂 > > I'm trying to make sure I understand clearly what the user impact/change > is that we're talking about: after a little bit of brainstorming and > looking through the PG docs, I'm actually not seeing much more than > these two things you've mentioned here: the set of regexp_* functions PG > provides, and these three generic functions. That alone doesn't seem > highly concerning. I missed citext, which extends impact to replace(), split_part(), strpos() and translate(). There are also the five *_REGEX() functions from the SQL standard which I assume are just calling the PG functions. I just saw the krb_caseins_users GUC, which reminds me that PLs also have their own case functions. And of course extensions. I'm not saying any of this is in scope for the change here, but I'm just trying to wrap my brain around all the places we've got CTYPE processing happening, to better understand the big picture. It might help tease out unexpected small glitches from changing one thing but not another one. -Jeremy -- http://about.me/jeremy_schneider
On 12/20/23 4:04 PM, Jeremy Schneider wrote: > On 12/20/23 3:47 PM, Jeremy Schneider wrote: >> On 12/5/23 3:46 PM, Jeff Davis wrote: >>> CTYPE, which handles character classification and upper/lowercasing >>> behavior, may be simpler than it first appears. We may be able to get >>> a net decrease in complexity by just building in most (or perhaps all) >>> of the functionality. >> >> I'll be honest, even though this is primarily about CTYPE and not >> collation, I still need to keep re-reading the initial email slowly to >> let it sink in and better understand it... at least for me, it's complex >> to reason through. 🙂 >> >> I'm trying to make sure I understand clearly what the user impact/change >> is that we're talking about: after a little bit of brainstorming and >> looking through the PG docs, I'm actually not seeing much more than >> these two things you've mentioned here: the set of regexp_* functions PG >> provides, and these three generic functions. That alone doesn't seem >> highly concerning. > > I missed citext, which extends impact to replace(), split_part(), > strpos() and translate(). There are also the five *_REGEX() functions > from the SQL standard which I assume are just calling the PG functions. found some more. here's my running list of everything user-facing I see in core PG code so far that might involve case: * upper/lower/initcap * regexp_*() and *_REGEXP() * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~* * citext + replace(), split_part(), strpos() and translate() * full text search - everything is case folded * unaccent? not clear to me whether CTYPE includes accent folding * ltree * pg_trgm * core PG parser, case folding of relation names -- http://about.me/jeremy_schneider
On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis <pgsql@j-davis.com> wrote: > Those locales all have the same ctype behavior. Sigh. I keep getting confused about how that works... -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote: > found some more. here's my running list of everything user-facing I > see > in core PG code so far that might involve case: > > * upper/lower/initcap > * regexp_*() and *_REGEXP() > * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~* > * citext + replace(), split_part(), strpos() and translate() > * full text search - everything is case folded > * unaccent? not clear to me whether CTYPE includes accent folding No, ctype has nothing to do with accents as far as I can tell. I don't know if I'm using the right terminology, but I think "case" is a variant of a character whereas "accent" is a modifier/mark, and the mark is a separate concept from the character itself. > * ltree > * pg_trgm > * core PG parser, case folding of relation names Let's separate it into groups. (1) Callers that use a collation OID or pg_locale_t: * collation & hashing * upper/lower/initcap * regex, LIKE, formatting * pg_trgm (which uses regexes) * maybe postgres_fdw, but might just be a passthrough * catalog cache (always uses DEFAULT_COLLATION_OID) * citext (always uses DEFAULT_COLLATION_OID, but probably shouldn't) (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are set to, or use ad-hoc ASCII-only semantics: * core SQL parser downcase_identifier() * callers of pg_strcasecmp() (DDL, etc.) * GUC name case folding * full text search ("mylocale = 0 /* TODO */") * a ton of stuff uses isspace(), isdigit(), etc. * various callers of tolower()/toupper() * some selfuncs.c stuff * ... Might have missed some places. The user impact of a new builtin provider would affect (1), but only for those actually using the provider. So there's no compatibility risk there, but it's good to understand what it will affect. We can, on a case-by-case basis, also consider using the new APIs I'm proposing for instances of (2). There would be some compatibility risk there for existing callers, and we'd have to consider whether it's worth it or not. Ideally, new callers would either use the new APIs or use the pg_ascii_* APIs. Regards, Jeff Davis
On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote: > One other thing that comes to mind: how does the parser do case > folding > for relation names? Is that using OS-provided libc as of today? Or > did > we code it to use ICU if that's the DB default? I'm guessing libc, > and > global catalogs probably need to be handled in a consistent manner, > even > across different encodings. The code is in downcase_identifier(): /* * SQL99 specifies Unicode-aware case normalization, which we don't * yet have the infrastructure for... */ if (ch >= 'A' && ch <= 'Z') ch += 'a' - 'A'; else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch)) ch = tolower(ch); result[i] = (char) ch; My proposal would add the infrastructure that the comment above says is missing. It seems like we should be using the database collation at this point because you don't want inconsistency between the catalogs and the parser here. Then again, the SQL spec doesn't seem to support tailoring of case conversions, so maybe we are avoiding it for that reason? Or maybe we're avoiding catalog access? Or perhaps the work for ICU just wasn't done here yet? > (Kindof related... did you ever see the demo where I create a user > named > '🏃' and then I try to connect to a database with non-unicode > encoding? > 💥😜 ...at least it seems to be able to walk the index without > decoding > strings to find other users - but the way these global catalogs work > scares me a little bit) I didn't see that specific demo, but in general we seem to change between pg_wchar and unicode code points too freely, so I'm not surprised that something went wrong. Regards, Jeff Davis
Robert Haas wrote: > For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, > a change to C.utf8 would be a much bigger problem, I would > think. Their alphabet isn't in code point order, and so things would > be alphabetized wrongly. > That might be OK if they don't care about ordering for any purpose > other than equality lookups, but otherwise it's going to force them > to change the default, where today they don't have to do that. Sure, in whatever collation setup we expose, we need to keep it possible and even easy to sort properly with linguistic rules. But some reasons to use $LANG as the default locale/collation are no longer as good as they used to be, I think. Starting with v10/ICU we have many pre-created ICU locales with fixed names, and starting with v16, we can simply write "ORDER BY textfield COLLATE unicode" which is good enough in most cases. So the configuration "bytewise sort by default" / "linguistic sort on-demand" has become more realistic. By contrast in the pre-v10 days with only libc collations, an application could have no idea which collations were going to be available on the server, and how they were named precisely, as this varies across OSes and across installs even with the same OS. On Windows, I think that before v16 initdb did not create any libc collation beyond C/POSIX and the default language/region of the OS. In that libc context, if a db wants the C locale by default for performance and truly immutable indexes, but the client app needs to occasionally do in-db linguistic sorts, the app needs to figure out which collation name will work for that. This is hard if you don't target a specific installation that guarantees that such or such collation is going to be installed. Whereas if the linguistic locale is the default, the app never needs to know its name or anything about it. So it's done that way, linguistic by default. But that leaves databases with many indexes sorted linguistically instead of bytewise for fields that semantically never need any linguistic sort. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > > But C.UTF-8 is not available everywhere, and there's still the > problem that Unicode updates through libc are not aligned > with Postgres releases. Attached is an implementation of a built-in provider for the "C.UTF-8" locale. That way applications (and tests!) can count on C.UTF-8 always being available on any platform; and it also aligns with the Postgres Unicode updates. Documentation is sparse and the patch is a bit rough, but feedback is welcome -- it does have some basic tests which can be used as a guide. The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple collation semantics (code point order) but rich ctype semantics (lower/upper/initcap and regexes). This locale is for users who want proper Unicode semantics for character operations (upper/lower, regexes), but don't need a specific natural-language string sort order to apply to all queries and indexes in their system. One might use it as the database default collation, and use COLLATE clauses (i.e. COLLATE UNICODE) where more specific behavior is needed. The builtin C.UTF-8 locale has the following advantages over using the libc C.UTF-8 locale: * Collation performance: the builtin provider uses memcmp and abbreviated keys. In libc, these advantages are only available for the C locale. * Unicode version is aligned with other parts of Postgres, like normalization. * Available on all platforms with exactly the same semantics. * Testable and documentable. * Avoids index corruption risks. In theory libc C.UTF-8 should also have stable collation, but that is not 100% true. In the builtin provider it is 100% stable. Regards, Jeff Davis
Attachment
On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote: > Attached is an implementation of a built-in provider for the "C.UTF- > 8" Attached a more complete version that fixes a few bugs, stabilizes the tests, and improves the documentation. I optimized the performance, too -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both collation and case mapping (numbers below). It's really nice to finally be able to have platform-independent tests that work on any UTF-8 database. Simple character classification: SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8; Case mapping is more interesting (note that accented characters are being properly mapped, and it's using the titlecase variant "Dž"): SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); initcap -------------------------- Axxe Áxxé Džxxdž Džxxx Džxxx Even more interesting -- test that non-latin characters can still be a member of a case-insensitive range: -- capital delta is member of lowercase range gamma to lambda SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8; -- small delta is member of uppercase range gamma to lambda SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8; Moreover, a lot of this behavior is locked in by strong Unicode guarantees like [1] and [2]. Behavior that can change probably won't change very often, and in any case will be tied to a PG major version. All of these behaviors are very close to what glibc "C.utf8" does on my machine. The case transformations are identical (except titlecasing because libc doesn't support it). The character classifications have some differences, which might be worth discussing, but I didn't see anything terribly concerning (I am following the unicode recommendations[3] on this topic). Performance: Sotring 10M strings: libc "C" 14s builtin C_UTF8 14s libc "C.utf8" 20s ICU "en-US-x-icu" 31s Running UPPER() on 10M strings: libc "C" 03s builtin C_UTF8 07s libc "C.utf8" 08s ICU "en-US-x-icu" 15s I didn't investigate or optimize regexes / pattern matching yet, but I can do similar optimizations if there's any gap. Note that I implemented the "simple" case mapping (which is what glibc does) and the "posix compatible"[3] flavor of character classification (which is closer to what glibc does than the "standard" flavor"). I opted to use title case mapping for initcap(), which is a difference from libc and I may go back to just upper/lower. These seem like reasonable choices if we're going to name the locale after C.UTF-8. Regards, Jeff Davis [1] https://www.unicode.org/policies/stability_policy.html#Case_Pair [2] https://www.unicode.org/policies/stability_policy.html#Identity [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
Attachment
On 12/28/23 6:57 PM, Jeff Davis wrote: > > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping (numbers below). > > It's really nice to finally be able to have platform-independent tests > that work on any UTF-8 database. Thanks for all your work on this, Jeff I didn't know about the Unicode stability policy. Since it's formal policy, I agree this provides some assumptions we can safely build on. I'm working my way through these patches but it's taking a little time for me. I hadn't tracked with the "builtin" thread last summer so I'm coming up to speed on that now too. I'm hopeful that something along these lines gets into pg17. The pg17 cycle is going to start heating up pretty soon. I agree with merging the threads, even though it makes for a larger patch set. It would be great to get a unified "builtin" provider in place for the next major. I also still want to parse my way through your email reply about the two groups of callers, and what this means for user experience. https://www.postgresql.org/message-id/7774b3a64f51b3375060c29871cf2b02b3e85dab.camel%40j-davis.com > Let's separate it into groups. > (1) Callers that use a collation OID or pg_locale_t: > (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are > set to, or use ad-hoc ASCII-only semantics: In the first list it seems that some callers might be influenced by a COLLATE clause or table definition while others always take the database default? It still seems a bit odd to me if different providers can be used for different parts of a single SQL. But it might not be so bad - I haven't fully thought through it yet and I'm still kicking the tires on my test build over here. Is there any reason we couldn't commit the minor cleanup (patch 0001) now? It's less than 200 lines and pretty straightforward. I wonder if, after a year of running the builtin provider in production, whether we might consider adding to the builtin provider a few locales with simple but more reasonable ordering for european and asian languages? Maybe just grabbing a current version of DUCET with no intention of ever updating it? I don't know how bad sorting is with plain DUCET for things like french or spanish or german, but surely it's not as unusable as code point order? Anyone who needs truly accurate or updated or customized linguistic sorting can always go to ICU, and take responsibility for their ICU upgrades, but something basic and static might meet the needs of 99% of postgres users indefinitely. By the way - my coworker Josh (who I don't think posts much on the hackers list here, but shares an unhealthy inability to look away from database unicode disasters) passed along this link today which I think is a fantastic list of surprises about programming and strings (generally unicode). https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main Make sure to click the link to show the counterexamples and discussion, that's the best part. -Jeremy PS. I was joking around today that the the second best part is that it's proof that people named Jeremy are always brilliant within their field. 😂 Josh said its just a subset of "always trust people whose names start with J" which seems fair. Unfortunately I can't yet think of a way to shoehorn the rest of the amazing PG hackers on this thread into the joke. -- http://about.me/jeremy_schneider
On 12/28/23 6:57 PM, Jeff Davis wrote: > On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote: > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping (numbers below). > > It's really nice to finally be able to have platform-independent tests > that work on any UTF-8 database. I think we missed something in psql, pretty sure I applied all the patches but I see this error: =# \l ERROR: 42703: column d.datlocale does not exist LINE 8: d.datlocale as "Locale", ^ HINT: Perhaps you meant to reference the column "d.daticulocale". LOCATION: errorMissingColumn, parse_relation.c:3720 ===== This is interesting. Jeff your original email didn't explicitly show any other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see different results: =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx'); initcap -------------------------- Axxe Áxxé DŽxxdž DŽxxx DŽxxx =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); initcap -------------------------- Axxe Áxxé Džxxdž Džxxx Džxxx The COLLATE sql syntax feels awkward to me. In this example, we're just using it to attach locale info to the string, and there's not actually any collation involved here. Not sure if COLLATE comes from the standard, and even if it does I'm not sure whether the standard had upper/lowercase in mind. That said, I think the thing that mainly matters will be the CREATE DATABASE syntax and the database default. I want to try a few things with table-level defaults that differ from database-level defaults, especially table-level ICU defaults because I think a number of PostgreSQL users set that up in the years before we supported DB-level ICU. Some people will probably keep using their old/existing schema-creation scripts even after they begin provisioning new systems with new database-level defaults. -Jeremy -- http://about.me/jeremy_schneider
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > Thank you. I'll fix this in the next patch set. > This is interesting. Jeff your original email didn't explicitly show > any > other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see > different results: > > =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx'); > initcap > -------------------------- > Axxe Áxxé DŽxxdž DŽxxx DŽxxx > > =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); > initcap > -------------------------- > Axxe Áxxé Džxxdž Džxxx Džxxx The reason for this difference is because libc doesn't support titlecase. In the next patch set, I'll not use titlecase (only uppercase/lowercase even for initcap()), and then it will match libc 100%. > The COLLATE sql syntax feels awkward to me. In this example, we're > just > using it to attach locale info to the string, and there's not > actually > any collation involved here. Not sure if COLLATE comes from the > standard, and even if it does I'm not sure whether the standard had > upper/lowercase in mind. The standard doesn't use the COLLATE clause for case mapping, but also doesn't offer any other mechanism to, e.g., get case mapping according to the "tr_TR" locale. I think what Postgres does here, re-purposing the COLLATE clause to allow tailoring of case mapping, is imperfect but reasonable. My proposal doesn't change that. Regards, Jeff Davis
On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote: > I agree with merging the threads, even though it makes for a larger > patch set. It would be great to get a unified "builtin" provider in > place for the next major. I believe that's possible and that this proposal is quite close (hoping to get something in this 'fest). The tables I'm introducing have exhaustive test coverage, so there's not a lot of risk there. And the builtin provider itself is an optional feature, so it won't be disruptive. > > In the first list it seems that some callers might be influenced by a > COLLATE clause or table definition while others always take the > database > default? It still seems a bit odd to me if different providers can be > used for different parts of a single SQL. Right, that can happen today, and my proposal doesn't change that. Basically those are cases where the caller was never properly onboarded to our collation system, like the ts_locale.c routines. > Is there any reason we couldn't commit the minor cleanup (patch 0001) > now? It's less than 200 lines and pretty straightforward. Sure, I'll commit that fairly soon then. > I wonder if, after a year of running the builtin provider in > production, > whether we might consider adding to the builtin provider a few > locales > with simple but more reasonable ordering for european and asian > languages? I won't rule that out completely, but there's a lot we would need to do to get there. Even assuming we implement that perfectly, we'd need to make sure it's a reasonable scope for Postgres as a project and that we have more than one person willing to maintain it. Similar things have been rejected before for similar reasons. What I'm proposing for v17 is much simpler: basically some lookup tables, which is just an extension of what we're already doing for normalization. > https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main > > Make sure to click the link to show the counterexamples and > discussion, > that's the best part. Yes, it can be hard to reason about this stuff but I believe Unicode provides a lot of good data and guidance to work from. If you think my proposal relies on one of those assumptions let me know. To the extent that I do rely on any of those assumptions, it's mostly to match libc's "C.UTF-8" behavior. Regards, Jeff Davis
On 1/9/24 2:31 PM, Jeff Davis wrote: > On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: >> I think we missed something in psql, pretty sure I applied all the >> patches but I see this error: >> >> =# \l >> ERROR: 42703: column d.datlocale does not exist >> LINE 8: d.datlocale as "Locale", >> > > Thank you. I'll fix this in the next patch set. Very minor nitpick/request. Not directly with this patch set but with the category_test which is related and also recently committed. I'm testing out "make update-unicode" scripts, and due to my system ICU version being different from the PostgreSQL unicode version I get the expected warnings from category_test: Postgres Unicode Version: 15.1 ICU Unicode Version: 14.0 Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch. category_test: All tests successful! I know it's minor, but I saw the 5116 skipped codepoints and saw "all tests succeeded" but I thought the output might be a little nicer if we showed the count of tests that succeeded. For example: Postgres Unicode Version: 15.1 ICU Unicode Version: 14.0 Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch. category_test: All 1108996 tests successful! It's a minor tweak to a script that I don't think even runs in the standard build; any objections? Patch attached. -Jeremy -- http://about.me/jeremy_schneider
Attachment
Jeff Davis wrote: > Attached a more complete version that fixes a few bugs [v15 patch] When selecting the builtin provider with initdb, I'm getting the following setup: $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata The database cluster will be initialized with this locale configuration: default collation provider: builtin default collation locale: C.UTF-8 LC_COLLATE: C.UTF-8 LC_CTYPE: C.UTF-8 LC_MESSAGES: C.UTF-8 LC_MONETARY: C.UTF-8 LC_NUMERIC: C.UTF-8 LC_TIME: C.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". This is from an environment where LANG=fr_FR.UTF-8 I would expect all LC_* variables to be fr_FR.UTF-8, and the default text search configuration to be "french". It is what happens when selecting ICU as the provider in the same environment: $ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata Using language tag "en" for ICU locale "en". The database cluster will be initialized with this locale configuration: default collation provider: icu default collation locale: en LC_COLLATE: fr_FR.UTF-8 LC_CTYPE: fr_FR.UTF-8 LC_MESSAGES: fr_FR.UTF-8 LC_MONETARY: fr_FR.UTF-8 LC_NUMERIC: fr_FR.UTF-8 LC_TIME: fr_FR.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "french". The collation setup does not influence the rest of the localization. The problem AFAIU is that --locale has two distinct meanings in the v15 patch: --locale-provider=X --locale=Y means use "X" as the provider with "Y" as datlocale, and it means use "Y" as the locale for all localized libc functionalities. I wonder what would happen if invoking bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata on a system where C.UTF-8 does not exist as a libc locale. Would it fail? (I don't have an OS like this to test ATM, will try later). A related comment is about naming the builtin locale C.UTF-8, the same name as in libc. On one hand this is semantically sound, but on the other hand, it's likely to confuse people. What about using completely different names, like "pg_unicode" or something else prefixed by "pg_" both for the locale name and the collation name (currently C.UTF-8/c_utf8)? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote: > $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata > > The database cluster will be initialized with this locale > configuration: > default collation provider: builtin > default collation locale: C.UTF-8 > LC_COLLATE: C.UTF-8 > LC_CTYPE: C.UTF-8 > LC_MESSAGES: C.UTF-8 > LC_MONETARY: C.UTF-8 > LC_NUMERIC: C.UTF-8 > LC_TIME: C.UTF-8 > The default database encoding has accordingly been set to "UTF8". > The default text search configuration will be set to "english". > > This is from an environment where LANG=fr_FR.UTF-8 > > I would expect all LC_* variables to be fr_FR.UTF-8, and the default > text search configuration to be "french". You can get the behavior you want by doing: initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \ -D/tmp/pgdata where "--builtin-locale" is analogous to "--icu-locale". It looks like I forgot to document the new initdb option, which seems to be the source of the confusion. Sorry, I'll fix that in the next patch set. (See examples in the initdb tests.) I think this answers some of your follow-up questions as well. > A related comment is about naming the builtin locale C.UTF-8, the > same > name as in libc. On one hand this is semantically sound, but on the > other hand, it's likely to confuse people. What about using > completely > different names, like "pg_unicode" or something else prefixed by > "pg_" > both for the locale name and the collation name (currently > C.UTF-8/c_utf8)? I'm flexible on naming, but here are my thoughts: * A "pg_" prefix makes sense. * If we named it something like "pg_unicode" someone might expect it to sort using the root collation. * The locale name "C.UTF-8" is nice because it implies things about both the collation and the character behavior. It's also nice because on at least some platforms, the behavior is almost identical to the libc locale of the same name. * UCS_BASIC might be a good name, because it also seems to carry the right meanings, but that name is already taken. * We also might to support variations, such as full case mapping (which uppercases "ß" to "SS", as the SQL standard requires), or perhaps the "standard" flavor of regexes (which don't count all symbols as punctuation). Leaving some room to name those variations would be a good idea. Regards, Jeff Davis
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > ^ > HINT: Perhaps you meant to reference the column "d.daticulocale". > LOCATION: errorMissingColumn, parse_relation.c:3720 I think you're connecting to a patched server with an older version of psql, so it doesn't know the catalog column was renamed. pg_dump and pg_upgrade don't have that problem because they throw an error when connecting to a newer server. But for psql, that's perfectly reasonable to connect to a newer server. Have we dropped or renamed catalog columns used by psql backslash commands before, and if so, how do we handle that? I can just not rename that column, but that's a bit frustrating because it means I'd need to add a new column to pg_database, which seems redundant. Regards, Jeff Davis
On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote: > A related comment is about naming the builtin locale C.UTF-8, the > same > name as in libc. On one hand this is semantically sound, but on the > other hand, it's likely to confuse people. What about using > completely > different names, like "pg_unicode" or something else prefixed by > "pg_" > both for the locale name and the collation name (currently > C.UTF-8/c_utf8)? New version attached. Changes: * Named collation object PG_C_UTF8, which seems like a good idea to prevent name conflicts with existing collations. The locale name is still C.UTF-8, which still makes sense to me because it matches the behavior of the libc locale of the same name so closely. * Added missing documentation for initdb --builtin-locale * Refactored the upper/lower/initcap implementations * Improved tests for case conversions where the byte length of the UTF8-encoded string changes (the string length doesn't change because we don't do full case mapping). * No longer uses titlecase mappings -- libc doesn't do that, so it was an unnecessary difference in case mapping behavior. * Improved test report per Jeremy's suggestion: now it reports the number of codepoints tested. Jeremy also raised a problem with old versions of psql connecting to a new server: the \l and \dO won't work. Not sure exactly what to do there, but I could work around it by adding a new field rather than renaming (though that's not ideal). Regards, Jeff Davis
Attachment
On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote: > Jeremy also raised a problem with old versions of psql connecting to > a > new server: the \l and \dO won't work. Not sure exactly what to do > there, but I could work around it by adding a new field rather than > renaming (though that's not ideal). I did a bit of research for a precedent, and the closest I could find is that \dp was broken between 14 and 15 by commit 07eee5a0dc. That is some precedent, but it's more narrow. I think that might justify breaking \dO in older clients, but \l is used frequently. It seems safer to just introduce new columns "datbuiltinlocale" and "collbuiltinlocale". They'll be nullable anyway. If we want to clean this up we can do so as a separate commit. There are other parts of the catalog representation related to collation that we might want to clean up as well. Regards, Jeff Davis
Jeff Davis wrote: > > Jeremy also raised a problem with old versions of psql connecting to > > a > > new server: the \l and \dO won't work. Not sure exactly what to do > > there, but I could work around it by adding a new field rather than > > renaming (though that's not ideal). > > I did a bit of research for a precedent, and the closest I could find > is that \dp was broken between 14 and 15 by commit 07eee5a0dc. Another one is that version 12 broke \d in older psql by removing pg_class.relhasoids. ISTM that in general the behavior of old psql vs new server does not weight much against choosing optimal catalog changes. There's also that warning at start informing users about it: WARNING: psql major version X, server major version Y. Some psql features might not work. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite <daniel@manitou-mail.org> wrote: > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. +1. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote: > Another one is that version 12 broke \d in older psql by > removing pg_class.relhasoids. > > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. > > There's also that warning at start informing users about it: > WARNING: psql major version X, server major version Y. > Some psql features might not work. Good point, I'll leave it as-is then. If someone complains I can rework it. Also, the output of \l changes from version to version, so if there are automated tools processing the output then they'd have to change anyway. Regards, Jeff Davis
On Fri, Jan 12, 2024 at 01:13:04PM -0500, Robert Haas wrote: > On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite <daniel@manitou-mail.org> wrote: >> ISTM that in general the behavior of old psql vs new server does >> not weight much against choosing optimal catalog changes. > > +1. +1. There is a good amount of effort put in maintaining downward compatibility in psql. Upward compatibility would require more manipulations of the stable branches to make older versions of psql compatible with newer server versions. Brr. -- Michael
Attachment
Jeff Davis wrote: > New version attached. [v16] Concerning the target category_test, it produces failures with versions of ICU with Unicode < 15. The first one I see with Ubuntu 22.04 (ICU 70.1) is: category_test: Postgres Unicode version: 15.1 category_test: ICU Unicode version: 14.0 category_test: FAILURE for codepoint 0x000c04 category_test: Postgres property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 1/0/0/0/0/0 category_test: ICU property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 0/0/0/0/0/0 U+0C04 is a codepoint added in Unicode 11. https://en.wikipedia.org/wiki/Telugu_(Unicode_block) In Unicode.txt: 0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;;;;;N;;;;; In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt $ grep 0C04 PropList.txt 0C04 ; Other_Alphabetic # Mn TELUGU SIGN COMBINING ANUSVARA ABOVE But in Unicode 14 it was not there. As a result its binary property UCHAR_ALPHABETIC has changed from false to true in ICU 72 vs previous versions. As I understand, the stability policy says that such things happen. From https://www.unicode.org/policies/stability_policy.html Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character. The Consortium will endeavor to keep the values of the other properties as stable as possible, but some circumstances may arise that require changing them. Particularly in the situation where the Unicode Standard first encodes less well-documented characters and scripts, the exact character properties and behavior initially may not be well known. As more experience is gathered in implementing the characters, adjustments in the properties may become necessary. Examples of such properties include, but are not limited to, the following: - General_Category - Case mappings - Bidirectional properties [...] I've commented the exit(1) in category_test to collect all errors, and built it with versions of ICU from 74 down to 60 (that is Unicode 10.0). Results are attached. As expected, the older the ICU version, the more differences are found against Unicode 15.1. I find these results interesting because they tell us what contents can break regexp-based check constraints on upgrades. But about category_test as a pass-or-fail kind of test, it can only be used when the Unicode version in ICU is the same as in Postgres. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote: > Concerning the target category_test, it produces failures with > versions of ICU with Unicode < 15. The first one I see with Ubuntu > 22.04 (ICU 70.1) is: ... > I find these results interesting because they tell us what contents > can break regexp-based check constraints on upgrades. Thank you for collecting and consolidating this information. > But about category_test as a pass-or-fail kind of test, it can only > be > used when the Unicode version in ICU is the same as in Postgres. The test has a few potential purposes: 1. To see if there is some error in parsing the Unicode files and building the arrays in the .h file. For instance, let's say the perl parser I wrote works fine on the Unicode 15.1 data file, but does something wrong on the 16.0 data file: the test would fail and we'd investigate. This is the most important reason for the test. 2. To notice any quirks between how we interpret Unicode vs how ICU does. 3. To help see "interesting" differences between different Unicode versions. For #1 and #2, the best way to test is by using a version of ICU that uses the same Unicode version as Postgres. The one running update- unicode can try to recompile with the right one for the purposes of the test. NB: There might be no version of ICU where the Unicode version exactly matches what we'd like to update to. In that case, we'd need to use the closest version and do some manual validation that the generated tables are sane. For #3, that is also interesting information to know about, but it's not directly actionable. As you point out, Unicode does not guarantee that these properties are static forever, so regexes can change behavior when we update Unicode for the next PG version. That is a much lower risk than a collation change, but as you point out, is a risk for regexes inside of a CHECK constraint. If a user needs zero risk of semantic changes for regexes, the only option is "C". Perhaps there should be a separate test target for this mode so that it doesn't exit early? (Note: case mapping has much stronger guarantees than the character classification.) I will update the README to document how someone running update-unicode should interpret the test results. Regards, Jeff Davis
On 12.01.24 03:02, Jeff Davis wrote: > New version attached. Changes: > > * Named collation object PG_C_UTF8, which seems like a good idea to > prevent name conflicts with existing collations. The locale name is > still C.UTF-8, which still makes sense to me because it matches the > behavior of the libc locale of the same name so closely. I am catching up on this thread. The discussions have been very complicated, so maybe I didn't get it all. The patches look pretty sound, but I'm questioning how useful this feature is and where you plan to take it. Earlier in the thread, the aim was summarized as > If the Postgres default was bytewise sorting+locale-agnostic > ctype functions directly derived from Unicode data files, > as opposed to libc/$LANG at initdb time, the main > annoyance would be that "ORDER BY textcol" would no > longer be the human-favored sort. I think that would be a terrible direction to take, because it would regress the default sort order from "correct" to "useless". Aside from the overall message this sends about how PostgreSQL cares about locales and Unicode and such. Maybe you don't intend for this to be the default provider? But then who would really use it? I mean, sure, some people would, but how would you even explain, in practice, the particular niche of users or use cases? Maybe if this new provider would be called "minimal", it might describe the purpose better. I could see a use for this builtin provider if it also included the default UCA collation (what COLLATE UNICODE does now). Then it would provide a "common" default behavior out of the box, and if you want more fine-tuning, you can go to ICU. There would still be some questions about making sure the builtin behavior and the ICU behavior are consistent (different Unicode versions, stock UCA vs CLDR, etc.). But for practical purposes, it might work. There would still be a risk with that approach, since it would permanently marginalize ICU functionality, in the sense that only some locales would need ICU, and so we might not pay the same amount of attention to the ICU functionality. I would be curious what your overall vision is here? Is switching the default to ICU still your goal? Or do you want the builtin provider to be the default? Or something else?
Peter Eisentraut wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > > annoyance would be that "ORDER BY textcol" would no > > longer be the human-favored sort. > > I think that would be a terrible direction to take, because it would > regress the default sort order from "correct" to "useless". Aside from > the overall message this sends about how PostgreSQL cares about > locales and Unicode and such. Well, offering a viable solution to avoid as much as possible the dreaded: "WARNING: collation "xyz" has version mismatch ... HINT: Rebuild all objects affected by this collation..." that doesn't sound like a bad message to send. Currently, to have in codepoint order the indexes that don't need a linguistic order, you're supposed to use collate "C", which then means that upper(), lower() etc.. don't work beyond ASCII. Here our Unicode support is not good enough, and the proposal addresses that. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: > I think that would be a terrible direction to take, because it would > regress the default sort order from "correct" to "useless". I don't agree that the current default is "correct". There are a lot of ways it can be wrong: * the environment variables at initdb time don't reflect what the users of the database actually want * there are so many different users using so many different applications connected to the database that no one "correct" sort order exists * libc has some implementation quirks * the version of Unicode that libc is based on is not what you expect * the version of libc is not what you expect > Aside from > the overall message this sends about how PostgreSQL cares about > locales > and Unicode and such. Unicode is primarily about the semantics of characters and their relationships. The patches I propose here do a great job of that. Collation (relationships between *strings*) is a part of Unicode, but not the whole thing or even the main thing. > Maybe you don't intend for this to be the default provider? I am not proposing that this provider be the initdb-time default. > But then > who would really use it? I mean, sure, some people would, but how > would > you even explain, in practice, the particular niche of users or use > cases? It's for users who want to respect Unicode support text from international sources in their database; but are not experts on the subject and don't know precisely what they want or understand the consequences. If and when such users do notice a problem with the sort order, they'd handle it at that time (perhaps with a COLLATE clause, or sorting in the application). > Maybe if this new provider would be called "minimal", it might > describe > the purpose better. "Builtin" communicates that it's available everywhere (not a dependency), that specific behaviors can be documented and tested, and that behavior doesn't change within a major version. I want to communicate all of those things. > I could see a use for this builtin provider if it also included the > default UCA collation (what COLLATE UNICODE does now). I won't rule that out, but I'm not proposing that right now and my proposal is already offering useful functionality. > There would still be a risk with that approach, since it would > permanently marginalize ICU functionality Yeah, ICU already does a good job offering the root collation. I don't think the builtin provider needs to do so. > I would be curious what your overall vision is here? Vision: * The builtin provider will offer Unicode character semantics, basic collation, platform-independence, and high performance. It can be used on its own or in combination with ICU via the COLLATE clause. * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive matching, and customization with rules. It's the solution for everything from "slightly more advanced" to "very advanced". * libc would be for databases serving applications on the same machine where a matching sort order is helpful, risks to indexes are acceptable, and performance is not important. > Is switching the > default to ICU still your goal? Or do you want the builtin provider > to > be the default? It's hard to answer this question while initdb chooses the database default collation based on the environment. Neither ICU nor the builtin provider can reasonably handle whatever those environment variables might be set to. Stepping back from the focus on what initdb does, we should be providing the right encouragement in documentation and packaging to guide users toward the right provider based their needs and the vision outlined above. Regards, Jeff Davis
On 18.01.24 23:03, Jeff Davis wrote: > On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: >> I think that would be a terrible direction to take, because it would >> regress the default sort order from "correct" to "useless". > > I don't agree that the current default is "correct". There are a lot of > ways it can be wrong: > > * the environment variables at initdb time don't reflect what the > users of the database actually want > * there are so many different users using so many different > applications connected to the database that no one "correct" sort order > exists > * libc has some implementation quirks > * the version of Unicode that libc is based on is not what you expect > * the version of libc is not what you expect These are arguments why the current defaults are not universally perfect, but I'd argue that they are still most often the right thing as the default. >> Aside from >> the overall message this sends about how PostgreSQL cares about >> locales >> and Unicode and such. > > Unicode is primarily about the semantics of characters and their > relationships. The patches I propose here do a great job of that. > > Collation (relationships between *strings*) is a part of Unicode, but > not the whole thing or even the main thing. I don't get this argument. Of course, people care about sorting and sort order. Whether you consider this part of Unicode or adjacent to it, people still want it. >> Maybe you don't intend for this to be the default provider? > > I am not proposing that this provider be the initdb-time default. ok >> But then >> who would really use it? I mean, sure, some people would, but how >> would >> you even explain, in practice, the particular niche of users or use >> cases? > > It's for users who want to respect Unicode support text from > international sources in their database; but are not experts on the > subject and don't know precisely what they want or understand the > consequences. If and when such users do notice a problem with the sort > order, they'd handle it at that time (perhaps with a COLLATE clause, or > sorting in the application). > Vision: > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive > matching, and customization with rules. It's the solution for > everything from "slightly more advanced" to "very advanced". I am astonished by this. In your world, do users not want their text data sorted? Do they not care what the sort order is? You consider UCA sort order an "advanced" feature?
On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote: > > > I don't get this argument. Of course, people care about sorting and > sort order. Whether you consider this part of Unicode or adjacent to > it, people still want it. You said that my proposal sends a message that we somehow don't care about Unicode, and I strongly disagree. The built-in provider I'm proposing does implement Unicode semantics. Surely a database that offers UCS_BASIC (a SQL spec feature) isn't sending a message that it doesn't care about Unicode, and neither is my proposal. > > > > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive > > matching, and customization with rules. It's the solution for > > everything from "slightly more advanced" to "very advanced". > > I am astonished by this. In your world, do users not want their text > data sorted? Do they not care what the sort order is? I obviously care about Unicode and collation. I've put a lot of effort recently into contributions in this area, and I wouldn't have done that if I thought users didn't care. You've made much greater contributions and I thank you for that. The logical conclusion of your line of argument would be that libc's "C.UTF-8" locale and UCS_BASIC simply should not exist. But they do exist, and for good reason. One of those good reasons is that only *human* users care about the human-friendliness of sort order. If Postgres is just feeding the results to another system -- or an application layer that re-sorts the data anyway -- then stability, performance, and interoperability matter more than human-friendliness. (Though Unicode character semantics are still useful even when the data is not going directly to a human.) > You consider UCA > sort order an "advanced" feature? I said "slightly more advanced" compared with "basic". "Advanced" can be taken in either a positive way ("more useful") or a negative way ("complex"). I'm sorry for the misunderstanding, but my point was this: * The builtin provider is for people who are fine with code point order and no tailoring, but want Unicode character semantics, collation stability, and performance. * ICU is the right solution for anyone who wants human-friendly collation or tailoring, and is willing to put up with some collation stability risk and lower collation performance. Both have their place and the user is free to mix and match as needed, thanks to the COLLATE clause for columns and queries. Regards, Jeff Davis
Review of the v16 patch set: (Btw., I suppose you started this patch series with 0002 because some 0001 was committed earlier. But I have found this rather confusing. I think it's ok to renumber from 0001 for each new version.) * v16-0002-Add-Unicode-property-tables.patch Various comments are updated to include the term "character class". I don't recognize that as an official Unicode term. There are categories and properties. Let's check this. Some files need heavy pgindent and perltidy. You were surely going to do this eventually, but maybe you want to do this sooner to check whether you like the results. - src/common/unicode/Makefile This patch series adds some new post-update-unicode tests. Should we have a separate target for each or just one common "unicode test" target? Not sure. - .../generate-unicode_category_table.pl The trailing commas handling ($firsttime etc.) is not necessary with C99. The code can be simplified. For this kind of code: +print $OT <<"HEADER"; let's use a common marker like EOS instead of a different one for each block. That just introduces unnecessary variations. - src/common/unicode_category.c The mask stuff at the top could use more explanation. It's impossible to figure out exactly what, say, PG_U_PC_MASK does. Line breaks in the different pg_u_prop_* functions are gratuitously different. Is it potentially confusing that only some pg_u_prop_* have a posix variant? Would it be better for a consistent interface to have a "posix" argument for each and just ignore it if not used? Not sure. Let's use size_t instead of Size for new code. * v16-0003-Add-unicode-case-mapping-tables-and-functions.patch Several of the above points apply here analogously. * v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch This is mostly a straightforward renaming patch, but there are some changes in initdb and pg_dump that pre-assume the changes in the next patch, like which locale columns apply for which providers. I think it would be better for the historical record to make this a straight renaming patch and move those semantic changes to the next patch (or a separate intermediate patch, if you prefer). - src/bin/psql/describe.c - src/test/regress/expected/psql.out This would be a good opportunity to improve the output columns for collations. The updated view is now: + Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | Deterministic? +--------+------+----------+---------+-------+--------+-----------+---------------- This is historically grown but suboptimal. Why is Locale after Collate and Ctype, and why does it show both? I think we could have just the Locale column, and if the libc provider is used with different collate/ctype (very rare!), we somehow write that into the single locale column. (A change like this would be a separate patch.) * v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch About this initdb --builtin-locale option and analogous options elsewhere: Maybe we should flip this around and provide a --libc-locale option, and have all the other providers just use the --locale option. This would be more consistent with the fact that it's libc that is special in this context. Do we even need the "C" locale? We have established that "C.UTF-8" is useful, but if that is easily available, who would need "C"? Some changes in this patch appear to be just straight renamings, like in src/backend/utils/init/postinit.c and src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put into the previous patch instead. On the collation naming: My expectation would have been that the "C.UTF-8" locale would be exposed as the UCS_BASIC collation. And the "C" locale as some other name (or not at all, see above). You have this the other way around.
On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote: > Various comments are updated to include the term "character class". > I > don't recognize that as an official Unicode term. There are > categories > and properties. Let's check this. It's based on https://www.unicode.org/reports/tr18/#Compatibility_Properties so I suppose the right name is "properties". > Is it potentially confusing that only some pg_u_prop_* have a posix > variant? Would it be better for a consistent interface to have a > "posix" argument for each and just ignore it if not used? Not sure. I thought about it but didn't see a clear advantage one way or another. > About this initdb --builtin-locale option and analogous options > elsewhere: Maybe we should flip this around and provide a --libc- > locale > option, and have all the other providers just use the --locale > option. > This would be more consistent with the fact that it's libc that is > special in this context. Would --libc-locale affect all the environment variables or just LC_CTYPE/LC_COLLATE? How do we avoid breakage? I like the general direction here but we might need to phase in the option or come up with a new name. Suggestions welcome. > Do we even need the "C" locale? We have established that "C.UTF-8" > is > useful, but if that is easily available, who would need "C"? I don't think we should encourage its use generally but I also don't think it will disappear any time soon. Some people will want it on simplicity grounds. I hope fewer people will use "C" when we have a better builtin option. > Some changes in this patch appear to be just straight renamings, like > in > src/backend/utils/init/postinit.c and > src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put > into > the previous patch instead. > > On the collation naming: My expectation would have been that the > "C.UTF-8" locale would be exposed as the UCS_BASIC collation. I'd like that. We have to sort out a couple things first, though: 1. The SQL spec mentions the capitalization of "ß" as "SS" specifically. Should UCS_BASIC use the unconditional mappings in SpecialCasing.txt? I already have some code to do that (not posted yet). 2. Should UCS_BASIC use the "POSIX" or "Standard" variant of regex properties? (The main difference seems to be whether symbols get treated as punctuation or not.) 3. What do we do about potential breakage for existing users of UCS_BASIC who might be expecting C-like behavior? Regards, Jeff Davis
On 13.02.24 03:01, Jeff Davis wrote: > 1. The SQL spec mentions the capitalization of "ß" as "SS" > specifically. Should UCS_BASIC use the unconditional mappings in > SpecialCasing.txt? I already have some code to do that (not posted > yet). It is my understanding that "correct" Unicode case conversion needs to use at least some parts of SpecialCasing.txt. The header of the file says "For compatibility, the UnicodeData.txt file only contains simple case mappings for characters where they are one-to-one and independent of context and language. The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings [...]" I read this as, just using UnicodeData.txt by itself is incomplete. I think we need to use the "Unconditional" mappings and the "Conditional Language-Insensitive" mappings (which is just Greek sigma). Obviously, skip the "Language-Sensitive" mappings.
On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote: > Review of the v16 patch set: > > (Btw., I suppose you started this patch series with 0002 because some > 0001 was committed earlier. But I have found this rather confusing. > I > think it's ok to renumber from 0001 for each new version.) Fixed. > Various comments are updated to include the term "character class". > I > don't recognize that as an official Unicode term. There are > categories > and properties. Let's check this. Changed to "properties" or "compatibility properties", except for a couple places in the test. The test compares against ICU, which does use the term "character classes" when discussing regexes: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details > Some files need heavy pgindent and perltidy. Done. > This patch series adds some new post-update-unicode tests. Should we > have a separate target for each or just one common "unicode test" > target? Not sure. I didn't make a change here. I suspect anyone updating unicode would want to run them all, but I don't have a strong opinion. > - .../generate-unicode_category_table.pl > > The trailing commas handling ($firsttime etc.) is not necessary with > C99. The code can be simplified. Thank you, fixed. > For this kind of code: > > +print $OT <<"HEADER"; Done. I used the <<"EOS" style which is more friendly to emacs, but I'm not sure if that's right for the project style. > Is it potentially confusing that only some pg_u_prop_* have a posix > variant? Would it be better for a consistent interface to have a > "posix" argument for each and just ignore it if not used? Not sure. I don't have a strong opinion here, so I didn't make a change. I can if you think it's cleaner. > Let's use size_t instead of Size for new code. Done. > * v16-0003-Add-unicode-case-mapping-tables-and-functions.patch > > Several of the above points apply here analogously. Fixed, I think. > * v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch > > This is mostly a straightforward renaming patch, but there are some > changes in initdb and pg_dump that pre-assume the changes in the next > patch, like which locale columns apply for which providers. I think > it > would be better for the historical record to make this a straight > renaming patch and move those semantic changes to the next patch (or > a > separate intermediate patch, if you prefer). Agreed, put those non-renaming changes in the next patch. > - src/bin/psql/describe.c > - src/test/regress/expected/psql.out > > This would be a good opportunity to improve the output columns for > collations. The updated view is now: > > + Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | > Deterministic? > +--------+------+----------+---------+-------+--------+-----------+-- > -------------- > > This is historically grown but suboptimal. Why is Locale after > Collate > and Ctype, and why does it show both? I think we could have just the > Locale column, and if the libc provider is used with different > collate/ctype (very rare!), we somehow write that into the single > locale > column. > > (A change like this would be a separate patch.) I didn't do this, yet. > * v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch > > About this initdb --builtin-locale option and analogous options > elsewhere: Maybe we should flip this around and provide a --libc- > locale > option, and have all the other providers just use the --locale > option. > This would be more consistent with the fact that it's libc that is > special in this context. I agree that libc is the odd one out. I'm not quite sure how we should express that, though, because there are also the other environment variables to worry about (e.g. LC_MESSAGES). Probably best as a separate patch. > Some changes in this patch appear to be just straight renamings, like > in > src/backend/utils/init/postinit.c and > src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put > into > the previous patch instead. Moved renamings to the previous patch. Regards, Jeff Davis
Attachment
On Tue, 2024-02-13 at 07:24 +0100, Peter Eisentraut wrote: > It is my understanding that "correct" Unicode case conversion needs > to > use at least some parts of SpecialCasing.txt. ... > I think we need to use the "Unconditional" mappings and the > "Conditional > Language-Insensitive" mappings (which is just Greek sigma). > Obviously, > skip the "Language-Sensitive" mappings. Attached a new series. Overall I'm quite happy with this feature as well as the recent updates. It expands a lot on what behavior we can actually document; the character semantics are nearly as good as ICU; it's fast; and it eliminates what is arguably the last reason to use libc ("C collation combined with some other CTYPE"). Changes: * Added a doc update for the "standard collations" (tiny patch, mostly separate) which clarifies the collations that are always available, and describes them a bit better * Added built-in locale "UCS_BASIC" (is that name confusing?) which uses full case mapping and the standard properties: - "ß" uppercases to "SS" - "Σ" usually lowercases to "σ", except when the Final_Sigma condition is met, in which case it lowercases to "ς" - initcap() uses titlecase variants ("dž" changes to "Dž") - in patterns/regexes, symbols (like "=") are not treated as punctuation * Changed the UCS_BASIC collation to use the builtin "UCS_BASIC" locale with Unicode semantis. At first I was skeptical because it's a behavior change, and I am still not sure we want to do that. But doing so would take us closer to both the SQL spec as well as Unicode; and also this kind of character behavior change is less likely to cause a problem than a collation behavior change. * The built-in locale "C.UTF-8" still exists, which uses Unicode simple case mapping and the POSIX compatible properties (no change here). Implementation-wise: * I introduced the CaseKind enum, which seemed to clean up a few things and reduce code duplication between upper/lower/titlecase. It also leaves room for introducing case folding later. * Introduced a "case-ignorable" table to properly implement the Final_Sigma rule. Loose ends: * Right now you can't mix all of the full case mapping behavior with INITCAP(), it just does simple titlecase mapping. I'm not sure we want to get too fancy here; after all, INITCAP() is not a SQL standard function and it's documented in a narrow fashion that doesn't seem to leave a lot of room to be very smart. ICU does a few extra things beyond what I did: - it accepts a word break iterator to the case conversion function - it provides some built-in word break iterators - it also has some configurable "break adjustment" behavior[1][2] which re-aligns the start of the word, and I'm not entirely sure why that isn't done in the word break iterator or the titlecasing rules Regards, Jeff Davis [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e
Attachment
- v18-0001-Documentation-update-for-Standard-Collations.patch
- v18-0002-Add-Unicode-property-tables.patch
- v18-0003-Add-unicode-case-mapping-tables-and-functions.patch
- v18-0004-Catalog-changes-preparing-for-builtin-collation-.patch
- v18-0005-Introduce-collation-provider-builtin.patch
- v18-0006-Change-collation-UCS_BASIC-to-use-Unicode-semant.patch
On Mon, 2024-02-26 at 19:01 -0800, Jeff Davis wrote: > * Right now you can't mix all of the full case mapping behavior with > INITCAP(), it just does simple titlecase mapping. I'm not sure we > want > to get too fancy here; after all, INITCAP() is not a SQL standard > function and it's documented in a narrow fashion that doesn't seem to > leave a lot of room to be very smart. ICU does a few extra things > beyond what I did: > - it accepts a word break iterator to the case conversion function > - it provides some built-in word break iterators > - it also has some configurable "break adjustment" behavior[1][2] > which re-aligns the start of the word, and I'm not entirely sure why > that isn't done in the word break iterator or the titlecasing rules Attached v19 which addresses this issue. It does proper Unicode titlecasing with a word boundary iterator as an argument. For initcap, it just uses a simple word boundary iterator that breaks whenever isalnum() changes. It came out cleaner this way, ultimately, and it feels more complete even though the behavior isn't much different. It's also easier to comment the relationship of the functions to Unicode. I removed CaseKind from the public API but still use it internally to avoid code duplication. I made one other change, which is that (for now) I undid the UCS_BASIC change until we are sure we want to change it. Instead, I have builtin collations PG_C_UTF8 and PG_UNICODE_FAST. I used the name "FAST" to indicate that the collation uses fast memcmp() rather than a real collation, but the Unicode character support is all there (including full case mapping). I'm open to suggestion here on naming. Regards, Jeff Davis
Attachment
- v19-0001-Documentation-update-for-Standard-Collations.patch
- v19-0002-Add-Unicode-property-tables.patch
- v19-0003-Add-unicode-case-mapping-tables-and-functions.patch
- v19-0004-Catalog-changes-preparing-for-builtin-collation-.patch
- v19-0005-Introduce-collation-provider-builtin.patch
- v19-0006-Add-builtin-collation-objects-PG_C_UTF8-and-PG_U.patch
On Thu, 2024-02-29 at 21:05 -0800, Jeff Davis wrote: > Attached v19 which addresses this issue. I pushed the doc patch. Attached v20. I am going to start pushing some other patches. v20-0001 (property tables) and v20-0003 (catalog iculocale -> locale) have been stable for a while so are likely to go in soon. v20-0002 (case mapping) also feels close to me, but it went through significant changes to support full case mapping and titlecasing, so I'll see if there are more comments. Changes in v20: * For titlecasing with the builtin "C.UTF-8" locale, do not perform word break adjustment, so it matches libc's "C.UTF-8" titlecasing behavior more closely. * Add optimized table for ASCII code points when determining categories and properties (this was already done for the case mapping table). * Add a small patch to make UTF-8 functions inline, which speeds things up substantially. Performance: ASCII-only data: lower initcap upper "C" (libc) 2426 3326 2341 pg_c_utf8 2890 6570 2825 pg_unicode_fast 2929 7140 2893 "C.utf8" (libc) 5410 7810 5397 "en-US-x-icu" 8320 65732 9367 Including non-ASCII data: lower initcap upper "C" (libc) 2630 4677 2548 pg_c_utf8 5471 10682 5431 pg_unicode_fast 5582 12023 5587 "C.utf8" (libc) 8126 11834 8106 "en-US-x-icu" 14473 73655 15112 The new builtin collations nicely finish ahead of everything except "C" (with an exception where pg_unicode_fast is marginally slower at titlecasing non-ASCII data than libc "C.UTF-8", which is likely due to the word break adjustment semantics). I suspect the inlined UTF-8 functions also speed up a few other areas, but I didn't measure. Regards, Jeff Davis
Attachment
- v20-0001-Add-Unicode-property-tables.patch
- v20-0002-Add-unicode-case-mapping-tables-and-functions.patch
- v20-0003-Catalog-changes-preparing-for-builtin-collation-.patch
- v20-0004-Introduce-collation-provider-builtin.patch
- v20-0005-Add-builtin-collation-objects-PG_C_UTF8-and-PG_U.patch
- v20-0006-Inline-basic-UTF-8-functions.patch
On Sat, 2024-03-02 at 15:02 -0800, Jeff Davis wrote: > Attached v20. And here's v22 (I didn't post v21). I committed Unicode property tables and functions, and the simple case mapping. I separated out the full case mapping changes (based on SpecialCasing.txt) into patch 0006. Not a lot of technical changes, but I cleaned up the remaining patches and put them into a nicer order with nicer commit messages. 0001: Catalog renaming: colliculocale to colllocale and daticulocale to datlocale. 0002: Basic builtin collation provider that only supports "C". 0003: C.UTF-8 locale for builtin collation provider and collation pg_c_utf8. 0004: Inline some UTF-8 functions to improve performance 0005: Add a unicode_strtitle() function and move the implementation for the builtin provider out of formatting.c. 0006: Add full case mapping support 0007: Add PG_UNICODE_FAST locale for builtin collation provider and collation pg_unicode_fast. This behaves like the standard says UCS_BASIC should behave -- sort by code point order but use Unicode character semantics with full case mapping. 0004 and beyond could use some review. 0004 and 0005 are pretty simple and non-controversial. 0006 and 0007 are a bit more interesting and could use some discussion if we want to go ahead with full case mapping in 17. Regards, Jeff Davis
Attachment
- v22-0001-Catalog-changes-preparing-for-builtin-collation-.patch
- v22-0002-Introduce-collation-provider-builtin.patch
- v22-0003-Add-C.UTF-8-locale-to-the-new-builtin-collation-.patch
- v22-0004-Inline-basic-UTF-8-functions.patch
- v22-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch
- v22-0006-Support-Unicode-full-case-mapping-and-conversion.patch
- v22-0007-Add-PG_UNICODE_FAST-locale-to-the-builtin-collat.patch
On 08.03.24 02:00, Jeff Davis wrote: > And here's v22 (I didn't post v21). > > I committed Unicode property tables and functions, and the simple case > mapping. I separated out the full case mapping changes (based on > SpecialCasing.txt) into patch 0006. > 0002: Basic builtin collation provider that only supports "C". Overall, this patch looks sound. In the documentation, let's make the list of locale providers an actual list instead of a sequence of <sect3>s. We had some discussion on initdb option --builtin-locale and whether it should be something more general. I'm ok with leaving it like this for now and maybe consider as an "open item" for PG17. In errmsg("parameter \"locale\" must be specified") make "locale" a placeholder. (See commit 36a14afc076). It seems the builtin provider accepts both "C" and "POSIX" as locale names, but the documentation says it must be "C". Maybe we don't need to accept "POSIX"? (Seeing that there are no plans for "POSIX.UTF-8", maybe we just ignore the "POSIX" spelling altogether?) Speaking of which, the code in postinit.c is inconsistent in that respect with builtin_validate_locale(). Shouldn't postinit.c use builtin_validate_locale(), to keep it consistent? Or, there could be a general function that accepts a locale provider and a locale string and validates everything together? In initdb.c, this message printf(_("The database cluster will be initialized with no locale.\n")); sounds a bit confusing. I think it's ok to show "C" as a locale. I'm not sure we need to change the logic here. Also in initdb.c, this message pg_fatal("locale must be specified unless provider is libc"); should be flipped around, like locale must be specified if provider is %s In pg_dump.c, dumpDatabase(), there are some new warning messages that are not specifically about the builtin provider. Are those existing deficiencies? It's not clear to me. What are the changes in the pg_upgrade test about? Maybe explain the scenario it is trying to test briefly? > 0004: Inline some UTF-8 functions to improve performance Makes sense that inlining can be effective here. But why aren't you just inlining the existing function pg_utf_mblen()? Now we have two functions that do the same thing. And the comment at pg_utf_mblen() is removed completely, so it's not clear anymore why it exists. > 0005: Add a unicode_strtitle() function and move the implementation for > the builtin provider out of formatting.c. In the recent discussion you had expression some uncertainty about the detailed semantics of this. INITCAP() was copied from Oracle, so we could check there for reference, too. Or we go with full Unicode semantics. I'm not clear on all the differences and tradeoffs, if there are any. In any case, it would be good if there were documentation or a comment that somehow wrote down the resolution of this.
On Tue, 2024-03-12 at 09:24 +0100, Peter Eisentraut wrote: > In the documentation, let's make the list of locale providers an > actual > list instead of a sequence of <sect3>s. Done. > We had some discussion on initdb option --builtin-locale and whether > it > should be something more general. I'm ok with leaving it like this > for > now and maybe consider as an "open item" for PG17. OK. > In > > errmsg("parameter \"locale\" must be specified") > > make "locale" a placeholder. (See commit 36a14afc076). Done. > It seems the builtin provider accepts both "C" and "POSIX" as locale > names, but the documentation says it must be "C". Maybe we don't > need > to accept "POSIX"? (Seeing that there are no plans for "POSIX.UTF- > 8", > maybe we just ignore the "POSIX" spelling altogether?) Agreed, removed "POSIX". > Speaking of which, the code in postinit.c is inconsistent in that > respect with builtin_validate_locale(). Shouldn't postinit.c use > builtin_validate_locale(), to keep it consistent? Agreed, done. > Or, there could be a general function that accepts a locale provider > and > a locale string and validates everything together? That's a good idea -- perhaps a separate cleanup patch? > In initdb.c, this message > > printf(_("The database cluster will be initialized with no > locale.\n")); > > sounds a bit confusing. I think it's ok to show "C" as a locale. > I'm > not sure we need to change the logic here. Agreed, removed. > Also in initdb.c, this message > > pg_fatal("locale must be specified unless provider is libc"); > > should be flipped around, like > > locale must be specified if provider is %s Done. > In pg_dump.c, dumpDatabase(), there are some new warning messages > that > are not specifically about the builtin provider. Are those existing > deficiencies? It's not clear to me. I wouldn't call that a deficiency, but it seemed to be a convenient place to do some extra sanity checking along with the minor reorganization I did in that area. > What are the changes in the pg_upgrade test about? Maybe explain the > scenario it is trying to test briefly? It's trying to be a better test for commit 9637badd9f, which eliminates needless locale incompatibilities when performing a pg_upgrade. At the time of that commit, the options for testing were fairly limited, so I'm just expanding on that here a bit. It might be slightly over-engineered? I added some comments and cleaned it up. > > 0004: Inline some UTF-8 functions to improve performance > > Makes sense that inlining can be effective here. But why aren't you > just inlining the existing function pg_utf_mblen()? Now we have two > functions that do the same thing. And the comment at pg_utf_mblen() > is > removed completely, so it's not clear anymore why it exists. I was trying to figure out what to do about USE_PRIVATE_ENCODING_FUNCS. If libpq exports pg_utf_mblen(), it needs to continue to export that, or else it's an ABI break, right? So that means we need at least one extern copy of the function. See b6c7cfac88. Though now that I look at it, I'm not even calling the inlined version from my code -- I must have been using it in an earlier version and now not. So I just left pg_utf_mblen() alone, and inlined unicode_to_utf8() and utf8_to_unicode(). > > 0005: Add a unicode_strtitle() function and move the implementation > > for > > the builtin provider out of formatting.c. > > In the recent discussion you had expression some uncertainty about > the > detailed semantics of this. INITCAP() was copied from Oracle, so we > could check there for reference, too. Or we go with full Unicode > semantics. I'm not clear on all the differences and tradeoffs, if > there > are any. In any case, it would be good if there were documentation > or a > comment that somehow wrote down the resolution of this. There are a few nuances that are different between the Unicode way to titlecase a string and INITCAP(): 1. For the initial character in a word, Unicode uses the titlecase mapping, whereas INITCAP (as the name suggests) uses the uppercase mapping. 2. Unicode uses full case mapping, which can change the length of the string (e.g. mapping "ß" to the titlecase "Ss" -- though I've heard that titlecasing "ß" doesn't make a lot of sense in German because words typically don't begin with it). Full case mapping can also handle context-sensitive mappings, such as the "final sigma". 3. Unicode has a lot to say about word boundaries, whereas INITCAP() just uses the boundary between alnum and !alnum. The unicode_strtitle() function is just a way to unify those differences into one implementation. A "full" parameter controls behaviors 1 & 2, and a callback handles 3. If we just want to keep it simple, we can leave it as the character-by-character algorithm in formatting.c. My uncertainty was whether we really want INITCAP to be doing these more sophisticated titlecasing transformations, or whether that should be a separate sql function (title()? titlecase()?), or whether we just don't need that functionality. New series attached. I plan to commit 0001 very soon. Regards, Jeff Davis
Attachment
- v23-0001-Introduce-collation-provider-builtin.patch
- v23-0002-Add-C.UTF-8-locale-to-the-new-builtin-collation-.patch
- v23-0003-Inline-basic-UTF-8-functions.patch
- v23-0004-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch
- v23-0005-Support-Unicode-full-case-mapping-and-conversion.patch
- v23-0006-Add-PG_UNICODE_FAST-locale-to-the-builtin-collat.patch
On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote: > New series attached. I plan to commit 0001 very soon. Committed the basic builtin provider, supporting only the "C" locale. There were a few changes since the last version I posted: * Added simplistic validation of the locale name to initdb.c (missing before). * Consistently passed the locale name to get_collation_actual_version(). In the previous patch, the caller sometimes just passed NULL knowing that the builtin provider is not versioned, but that's not the caller's responsibility. * pg_dump previously had some minor refactoring, which you had some questions about. I eliminated that and just kept it to the changes necessary for the builtin provider. * createdb --help was missing the --builtin-locale option * improved error checking order in createdb() to avoid a confusing error message. I also attached a rebased series. 0001 (the C.UTF-8 locale) is also close. Considering that most of the infrastructure is already in place, that's not a large patch. You many have some comments about the way I'm canonicalizing and validating in initdb -- that could be cleaner, but it feels like I should refactor the surrounding code separately first. 0002 (inlining utf8 functions) is also ready. For 0003 and beyond, I'd like some validation that it's what you had in mind. Regards, Jeff Davis
Attachment
On 14.03.24 09:08, Jeff Davis wrote: > On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote: >> New series attached. I plan to commit 0001 very soon. > > Committed the basic builtin provider, supporting only the "C" locale. As you were committing this, I had another review of v23-0001-Introduce-collation-provider-builtin.patch in progress. Some of the things I found you have already addressed in what you committed. Please check the remaining comments. * doc/src/sgml/charset.sgml I don't understand the purpose of this sentence: "When using this locale, the behavior may depend on the database encoding." * doc/src/sgml/ref/create_database.sgml The new parameter builtin_locale is not documented. * src/backend/commands/collationcmds.c I think DefineCollation() should set collencoding = -1 for the COLLPROVIDER_BUILTIN case. -1 stands for any encoding. Or at least explain why not? * src/backend/utils/adt/pg_locale.c This part is a bit confusing: + cache_entry->collate_is_c = true; + cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0); Is collate always C but ctype only sometimes? Does this anticipate future patches in this series? Maybe in this patch it should always be true? * src/bin/initdb/initdb.c + printf(_(" --builtin-locale=LOCALE set builtin locale name for new databases\n")); Put in a line break so that the right "column" lines up. This output should line up better: The database cluster will be initialized with this locale configuration: default collation provider: icu default collation locale: en LC_COLLATE: C LC_CTYPE: C ... Also, why are there two spaces after "provider: "? Also we call these locale provider on input, why are they collation providers on output? What is a "collation locale"? * src/bin/pg_upgrade/t/002_pg_upgrade.pl +if ($oldnode->pg_version >= '17devel') This is weird. >= is a numeric comparison, so providing a string with non-digits is misleading at best. * src/test/icu/t/010_database.pl -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE -# are specified Why remove this test? +my ($ret, $stdout, $stderr) = $node1->psql('postgres', + q{CREATE DATABASE dbicu LOCALE_PROVIDER builtin LOCALE 'C' TEMPLATE dbicu} +); Change the name of the new database to be different from the name of the template database.
On 14.03.24 09:08, Jeff Davis wrote: > 0001 (the C.UTF-8 locale) is also close. Considering that most of the > infrastructure is already in place, that's not a large patch. You many > have some comments about the way I'm canonicalizing and validating in > initdb -- that could be cleaner, but it feels like I should refactor > the surrounding code separately first. If have tested this against the libc locale C.utf8 that was available on the OS, and the behavior is consistent. I wonder if we should version the builtin locales too. We might make a mistake and want to change something sometime? Tiny comments: * src/bin/scripts/t/020_createdb.pl The two added tests should have different names that tells them apart (like the new initdb tests). * src/include/catalog/pg_collation.dat Maybe use 'and' instead of '&' in the description. > 0002 (inlining utf8 functions) is also ready. Seems ok. > For 0003 and beyond, I'd like some validation that it's what you had in > mind. I'll look into those later.
On Thu, 2024-03-14 at 09:54 +0100, Peter Eisentraut wrote: > * doc/src/sgml/charset.sgml > > I don't understand the purpose of this sentence: > > "When using this locale, the behavior may depend on the database > encoding." The "C" locale (in either the builtin or libc provider) can sort differently in different encodings, because it's based on memcmp. For instance: select U&'\20AC' > U&'\201A' collate "C"; Returns true in UTF-8 and false in WIN1252. That's why UCS_BASIC is only available in UTF-8, because (at least for some encodings) we'd have to decode before comparison to get the code-point-order semantics right. In other words, the "C" collation is not a well-defined order, but UCS_BASIC and C.UTF-8 are well-defined. Suggestions for better wording are welcome. > * doc/src/sgml/ref/create_database.sgml > > The new parameter builtin_locale is not documented. Thank you, fixed in 0001 (review fixup). > * src/backend/commands/collationcmds.c > > I think DefineCollation() should set collencoding = -1 for the > COLLPROVIDER_BUILTIN case. -1 stands for any encoding. Or at least > explain why not? In the attached v25-0001 (review fixup) I have made it the responsibility of a function, and then extended that for the C.UTF-8 (0002) and PG_UNICODE_FAST locales (0007). > * src/backend/utils/adt/pg_locale.c > > This part is a bit confusing: > > + cache_entry->collate_is_c = true; > + cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0); > > Is collate always C but ctype only sometimes? Does this anticipate > future patches in this series? Maybe in this patch it should always > be true? Made it a constant in v25-0001, and changed it in 0002 > > * src/bin/initdb/initdb.c > > + printf(_(" --builtin-locale=LOCALE set builtin locale name > for new databases\n")); > > Put in a line break so that the right "column" lines up. Fixed in 0001 > This output should line up better: > > The database cluster will be initialized with this locale > configuration: > default collation provider: icu > default collation locale: en > LC_COLLATE: C > LC_CTYPE: C > ... > > Also, why are there two spaces after "provider: "? > > Also we call these locale provider on input, why are they collation > providers on output? What is a "collation locale"? I tried to fix these things in 0001. > * src/bin/pg_upgrade/t/002_pg_upgrade.pl > > +if ($oldnode->pg_version >= '17devel') > > This is weird. >= is a numeric comparison, so providing a string > with > non-digits is misleading at best. It's actually not a numeric comparison, it's an overloaded comparison op for the Version class. See 32dd2c1eff and: https://www.postgresql.org/message-id/1738174.1710274577%40sss.pgh.pa.us > * src/test/icu/t/010_database.pl > > -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE > -# are specified > > Why remove this test? It must have been lost during a rebase, fixed in 0001. > Change the name of the new database to be different from the name of > the template database. Fixed in 0001. New series attached. Regards, Jeff Davis
Attachment
- v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch
- v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
- v25-0003-Inline-basic-UTF-8-functions.patch
- v25-0004-Use-version-for-builtin-collations.patch
- v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch
- v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch
- v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch
On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote: > On 14.03.24 09:08, Jeff Davis wrote: > > 0001 (the C.UTF-8 locale) is also close... > > If have tested this against the libc locale C.utf8 that was available > on > the OS, and the behavior is consistent. That was the goal, in spirit. But to clarify: it's not guaranteed that the built-in C.UTF-8 is always the same as the libc UTF-8, because different implementations do different things. For instance, I saw significant differences on MacOS. > I wonder if we should version the builtin locales too. We might make > a > mistake and want to change something sometime? I'm fine with that, see v25-0004 in the reply to your other mail. The version only tracks sort order, and all of the builtin locales sort based on memcmp(). But it's possible there are bugs in the optimizations around memcmp() (e.g. abbreviated keys, or some future optimization). > Tiny comments: > > * src/bin/scripts/t/020_createdb.pl > > The two added tests should have different names that tells them apart > (like the new initdb tests). > > * src/include/catalog/pg_collation.dat Done in v25-0002 (in reply to your other mail). Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > New series attached. Coverity thinks there's something wrong with builtin_validate_locale, and as far as I can tell it's right: the last ereport is unreachable, because required_encoding is never changed from its initial -1 value. It looks like there's a chunk of logic missing there, or else that the code could be simplified further. /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/pg_locale.c: 2519 in builtin_validate_locale() >>> CID 1594398: Control flow issues (DEADCODE) >>> Execution cannot reach the expression "encoding != required_encoding" inside this statement: "if (required_encoding>= 0 ...". 2519 if (required_encoding >= 0 && encoding != required_encoding) 2520 ereport(ERROR, 2521 (errcode(ERRCODE_WRONG_OBJECT_TYPE), 2522 errmsg("encoding \"%s\" does not match locale \"%s\"", 2523 pg_encoding_to_char(encoding), locale))); regards, tom lane
On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > New series attached. > > Coverity thinks there's something wrong with builtin_validate_locale, > and as far as I can tell it's right: the last ereport is unreachable, > because required_encoding is never changed from its initial -1 value. > It looks like there's a chunk of logic missing there, or else that > the code could be simplified further. Thank you, it was a bit of over-generalization in anticipation of future patches. It may be moot soon, but I committed a fix now. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > It may be moot soon, but I committed a fix now. Thanks, but it looks like 846311051 introduced a fresh issue. MSVC is complaining about [21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning C4715: 'builtin_locale_encoding': not all controlpaths return a value This is causing all CI jobs to fail the "compiler warnings" check. Probably the best fix is the traditional return <something>; /* keep compiler quiet */ but I'm not sure what the best default result is in this function. regards, tom lane
On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > This is causing all CI jobs to fail the "compiler warnings" check. I did run CI before checkin, and it passed: https://cirrus-ci.com/build/5382423490330624 If I open up the windows build, I see the warning: https://cirrus-ci.com/task/5199979044667392 but I didn't happen to check this time. > Probably the best fix is the traditional > > return <something>; /* keep compiler quiet */ > > but I'm not sure what the best default result is in this function. In inverted the check so that I didn't have to choose a default. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: >> This is causing all CI jobs to fail the "compiler warnings" check. > I did run CI before checkin, and it passed: > https://cirrus-ci.com/build/5382423490330624 Weird, why did it not report with the same level of urgency? But anyway, thanks for fixing. regards, tom lane
On Tue, Mar 19, 2024 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > >> This is causing all CI jobs to fail the "compiler warnings" check. > > > I did run CI before checkin, and it passed: > > https://cirrus-ci.com/build/5382423490330624 > > Weird, why did it not report with the same level of urgency? > But anyway, thanks for fixing. Maybe I misunderstood this exchange but ... Currently Windows warnings don't make any CI tasks fail ie turn red, which is why Jeff's run is all green in his personal github repo. Unlike gcc and clang, and MinGW cross-build warnings which cause the special "CompilerWarnings" CI task to fail (red). That task is running on a Linux system so it can't use MSVC. The idea of keeping it separate from the "main" Linux, FreeBSD, macOS tasks (which use gcc, clang, clang respectively) was that it's nicer to try to run the actual tests even if there is a pesky warning, so having it in a separate task gets you that info without blocking other progress, and it also tries with and without assertions (a category of warning hazard, eg unused variables when assertions are off). But I did teach cfbot to do some extra digging through the logs, looking for various interesting patterns[1], including non-error warnings, and if it finds anything interesting it shows a little clickable ⚠ symbol on the front page. If there is something like -Werror on MSVC we could turn that on for the main Windows test, but that might also be a bit annoying. Perhaps there is another way: we could have it compile and test everything, allowing warnings, but also then grep the build log afterwards in a new step that fails if any warnings were there? Then Jeff would have got a failure in his personal CI run. Or something like that. [1] https://github.com/macdice/cfbot/blob/master/cfbot_work_queue.py
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Mar 19, 2024 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> This is causing all CI jobs to fail the "compiler warnings" check. >>> I did run CI before checkin, and it passed: > Maybe I misunderstood this exchange but ... > Currently Windows warnings don't make any CI tasks fail ie turn red, > which is why Jeff's run is all green in his personal github repo. > ... > But I did teach cfbot to do some extra digging through the logs, Ah. What I should have said was "it's causing cfbot to complain about every patch". Seems like the divergence in the pass criterion is not such a great idea. regards, tom lane
* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch This was committed. * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. * v25-0003-Inline-basic-UTF-8-functions.patch ok * v25-0004-Use-version-for-builtin-collations.patch Not sure about the version format "1.0", which implies some sort of major/minor or component-based system. I would just use "1". * v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch * v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch * v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch 0005 and 0006 don't contain any test cases. So I guess they are really only usable via 0007. Is that understanding correct? Btw., tested initcap() on Oracle: select initcap('džudo') from dual; (which uses the precomposed U+01F3) and the result is DŽudo (with the precomposed uppercase character). So that matches the behavior proposed in your 0002 patch. Are there any test cases that illustrate the word boundary changes in patch 0005? It might be useful to test those against Oracle as well.
On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: > * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch > > Looks ok. Committed. > * v25-0003-Inline-basic-UTF-8-functions.patch Committed. > * v25-0004-Use-version-for-builtin-collations.patch > > Not sure about the version format "1.0", which implies some sort of > major/minor or component-based system. I would just use "1". The v26 patch was not quite complete, so I didn't commit it yet. Attached v27-0001 and 0002. 0002 is necessary because otherwise lc_collate_is_c() short-circuits the version check in pg_newlocale_from_collation(). With 0002, the code is simpler and all paths go through pg_newlocale_from_collation(), and the version check happens even when lc_collate_is_c(). But perhaps there was a reason the code was the way it was, so submitting for review in case I missed something. > 0005 and 0006 don't contain any test cases. So I guess they are > really > only usable via 0007. Is that understanding correct? 0005 is not a functional change, it's just a refactoring to use a callback, which is preparation for 0007. > Are there any test cases that illustrate the word boundary changes in > patch 0005? It might be useful to test those against Oracle as well. The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 collation vs '123Abc' in PG_UNICODE_FAST. The reason for the latter behavior is that the Unicode Default Case Conversion algorithm for toTitlecase() advances to the next Cased character before mapping to titlecase, and digits are not Cased. ICU has a configurable adjustment, and defaults in a way that produces '123abc'. New rebased series attached. Regards, Jeff Davis
Attachment
On 21.03.24 01:13, Jeff Davis wrote: >> Are there any test cases that illustrate the word boundary changes in >> patch 0005? It might be useful to test those against Oracle as well. > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character before mapping to titlecase, and digits are not Cased. ICU > has a configurable adjustment, and defaults in a way that produces > '123abc'. I think this might be too big of a compatibility break. So far, initcap('123abc') has always returned '123abc'. If the new collation returns '123Abc' now, then that's quite a change. These are not some obscure Unicode special case characters, after all. What is the ICU configuration incantation for this? Maybe we could have the builtin provider understand some of that, too. Or we should create a function separate from initcap.
On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: > I think this might be too big of a compatibility break. So far, > initcap('123abc') has always returned '123abc'. If the new collation > returns '123Abc' now, then that's quite a change. These are not some > obscure Unicode special case characters, after all. It's a new collation, so I'm not sure it's a compatibility break. But you are right that it is against documentation and expectations for INITCAP(). > What is the ICU configuration incantation for this? Maybe we could > have > the builtin provider understand some of that, too. https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e > Or we should create a function separate from initcap. If we create a new function, that also gives us the opportunity to accept optional arguments to control the behavior rather than relying on collation for every decision. Regards, Jeff Davis
Hello Jeff, 21.03.2024 03:13, Jeff Davis wrote: > On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: >> * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch >> >> Looks ok. > Committed. Please look at a Valgrind-detected error caused by the following query (starting from f69319f2f): SELECT lower('Π' COLLATE pg_c_utf8); ==00:00:00:03.487 1429669== Invalid read of size 1 ==00:00:00:03.487 1429669== at 0x7C64A5: convert_case (unicode_case.c:107) ==00:00:00:03.487 1429669== by 0x7C6666: unicode_strlower (unicode_case.c:70) ==00:00:00:03.487 1429669== by 0x66B218: str_tolower (formatting.c:1698) ==00:00:00:03.488 1429669== by 0x6D6C55: lower (oracle_compat.c:55) Best regards, Alexander
On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote: > Please look at a Valgrind-detected error caused by the following > query > (starting from f69319f2f): > SELECT lower('Π' COLLATE pg_c_utf8); Thank you for the report! Fixed in 503c0ad976. Valgrind did not detect the problem in my setup, so I added a unit test in case_test.c where it's easier to see the valgrind problem. Regards, Jeff Davis
There is no technical content in this mail, but I'd like to show appreciation for your work on this. I hope this will eventually remove one of the great embarrassments when using PostgreSQL: the dependency on operation system collations. Yours, Laurenz Albe
On 22.03.24 18:26, Jeff Davis wrote: > On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: >> I think this might be too big of a compatibility break. So far, >> initcap('123abc') has always returned '123abc'. If the new collation >> returns '123Abc' now, then that's quite a change. These are not some >> obscure Unicode special case characters, after all. > > It's a new collation, so I'm not sure it's a compatibility break. But > you are right that it is against documentation and expectations for > INITCAP(). > >> What is the ICU configuration incantation for this? Maybe we could >> have >> the builtin provider understand some of that, too. > > https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d > https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e > >> Or we should create a function separate from initcap. > > If we create a new function, that also gives us the opportunity to > accept optional arguments to control the behavior rather than relying > on collation for every decision. Right. I thought when you said there is an ICU configuration for it, that it might be like collation options that you specify in the locale string. But it appears it is only an internal API setting. So that, in my mind, reinforces the opinion that we should leave initcap() as is and make a new function that exposes the new functionality. (This does not have to be part of this patch set.)
On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote: > Right. I thought when you said there is an ICU configuration for it, > that it might be like collation options that you specify in the > locale > string. But it appears it is only an internal API setting. So that, > in > my mind, reinforces the opinion that we should leave initcap() as is > and > make a new function that exposes the new functionality. (This does > not > have to be part of this patch set.) OK, I'll propose a "title" or "titlecase" function for 18, along with "casefold" (which I was already planning to propose). What do you think about UPPER/LOWER and full case mapping? Should there be extra arguments for full vs simple case mapping, or should it come from the collation? It makes sense that the "dotted vs dotless i" behavior comes from the collation because that depends on locale. But full-vs-simple case mapping is not really a locale question. For instance: select lower('0Σ' collate "en-US-x-icu") AS lower_sigma, lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma, upper('ß' collate "en-US-x-icu") AS upper_eszett; lower_sigma | lower_final_sigma | upper_eszett -------------+-------------------+-------------- 0σ | ας | SS produces the same results for any ICU collation. There's also another reason to consider it an argument rather than a collation property, which is that it might be dependent on some other field in a row. I could imagine someone wanting to do: SELECT UPPER(some_field, full => true, dotless_i => CASE other_field WHEN ...) FROM ... That makes sense for a function in the target list, because different customers might be from different locales and therefore want different treatment of the dotted-vs-dotless-i. Thoughts? Should we use the collation by default but then allow parameters to override? Or should we just consider this a new set of functions? (All of this is v18 material, of course.) Regards, Jeff Davis
On 21.03.24 01:13, Jeff Davis wrote: > The v26 patch was not quite complete, so I didn't commit it yet. > Attached v27-0001 and 0002. > > 0002 is necessary because otherwise lc_collate_is_c() short-circuits > the version check in pg_newlocale_from_collation(). With 0002, the code > is simpler and all paths go through pg_newlocale_from_collation(), and > the version check happens even when lc_collate_is_c(). > > But perhaps there was a reason the code was the way it was, so > submitting for review in case I missed something. > >> 0005 and 0006 don't contain any test cases. So I guess they are >> really >> only usable via 0007. Is that understanding correct? > 0005 is not a functional change, it's just a refactoring to use a > callback, which is preparation for 0007. > >> Are there any test cases that illustrate the word boundary changes in >> patch 0005? It might be useful to test those against Oracle as well. > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character before mapping to titlecase, and digits are not Cased. ICU > has a configurable adjustment, and defaults in a way that produces > '123abc'. > > New rebased series attached. The patch set v27 is ok with me, modulo (a) discussion about initcap semantics, and (b) what collation to assign to ucs_basic, which can be revisited later.
On 25.03.24 18:52, Jeff Davis wrote: > OK, I'll propose a "title" or "titlecase" function for 18, along with > "casefold" (which I was already planning to propose). (Yay, casefold will be useful.) > What do you think about UPPER/LOWER and full case mapping? Should there > be extra arguments for full vs simple case mapping, or should it come > from the collation? > > It makes sense that the "dotted vs dotless i" behavior comes from the > collation because that depends on locale. But full-vs-simple case > mapping is not really a locale question. For instance: > > select lower('0Σ' collate "en-US-x-icu") AS lower_sigma, > lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma, > upper('ß' collate "en-US-x-icu") AS upper_eszett; > lower_sigma | lower_final_sigma | upper_eszett > -------------+-------------------+-------------- > 0σ | ας | SS > > produces the same results for any ICU collation. I think of a collation describing what language a text is in. So it makes sense that "dotless i" depends on the locale/collation. Full vs. simple case mapping is more of a legacy compatibility question, in my mind. There is some expectation/precedent that C.UTF-8 uses simple case mapping, but beyond that, I don't see a reason why someone would want to explicitly opt for simple case mapping, other than if they need length preservation or something, but if they need that, then they are going to be in a world of pain in Unicode anyway. > There's also another reason to consider it an argument rather than a > collation property, which is that it might be dependent on some other > field in a row. I could imagine someone wanting to do: > > SELECT > UPPER(some_field, > full => true, > dotless_i => CASE other_field WHEN ...) > FROM ... Can you index this usefully? It would only work if the user query matches exactly this pattern? > That makes sense for a function in the target list, because different > customers might be from different locales and therefore want different > treatment of the dotted-vs-dotless-i. There is also the concept of a session collation, which we haven't implemented, but it would address this kind of use. But there again the problem is indexing. But maybe indexing isn't as important for case conversion as it is for sorting.
Jeff Davis wrote: > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character before mapping to titlecase, and digits are not Cased. ICU > has a configurable adjustment, and defaults in a way that produces > '123abc'. Even aside from ICU, there's a different behavior between glibc and pg_c_utf8 glibc for codepoints in the decimal digit category outside of the US-ASCII range '0'..'9', select initcap(concat(chr(0xff11), 'a') collate "C.utf8"); -- glibc 2.35 initcap --------- 1a select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8"); initcap --------- 1A Both collations consider that chr(0xff11) is not a digit (isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8 says it's neither digit nor alpha. AFAIU this is why in the above initcap() call, pg_c_utf8 considers that 'a' is the first alphanumeric, whereas C.utf8 considers that '1' is the first alphanumeric, leading to different capitalizations. Comparing the 3 providers: WITH v(provider,type,result) AS (values ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"), ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"), ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"), ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"), ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"), ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8") ) select * from v \crosstabview provider | isalpha | isdigit ----------+---------+--------- ICU | f | t glibc | t | f builtin | f | f Are we fine with pg_c_utf8 differing from both ICU's point of view (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not digit, but it's alpha)? Aside from initcap(), this is going to be significant for regular expressions. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote: > provider | isalpha | isdigit > ----------+---------+--------- > ICU | f | t > glibc | t | f > builtin | f | f The "ICU" above is really the behvior of the Postgres ICU provider as we implemented it, it's not something forced on us by ICU. For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are defined by ICU to be equivalent to java.lang.Character.isLetter() and java.lang.Character.isDigit(). ICU documents[3] how regex character classes should be implemented using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite being under the heading "...for C/POSIX character classes...", [3] says it's based on the "Standard" variant of [4], rather than "POSIX Compatible". (Aside: the Postgres ICU provider doesn't match what [3] suggests for the "alpha" class. For the character U+FF11 it doesn't matter, but I suspect there are differences for other characters. This should be fixed.) The differences between PG_C_UTF8 and what ICU suggests are just because the former uses the "POSIX Compatible" definitions and the latter uses "Standard". I implemented both the "Standard" and "POSIX Compatible" compatibility properties in ad49994538, so it would be easy to change what PG_C_UTF8 uses. [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283 [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6 [3] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details [4] http://www.unicode.org/reports/tr18/#Compatibility_Properties > Are we fine with pg_c_utf8 differing from both ICU's point of view > (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is > not > digit, but it's alpha)? Yes, some differences are to be expected. But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as we can point to something other than "glibc version 2.35 does it this way". Regards, Jeff Davis
On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: > The patch set v27 is ok with me, modulo (a) discussion about initcap > semantics, and (b) what collation to assign to ucs_basic, which can > be > revisited later. I held off on the refactoring patch for lc_{ctype|collate}_is_c(). There's an explicit "NB: pg_newlocale_from_collation is only supposed to be called on non-C-equivalent locales" comment in DefineCollation(). What I'd like to do is make it possible to create valid pg_locale_t objects out of C locales, which can be used anywhere a real locale can be used. Callers can still check lc_{collate|ctype}_is_c() for various reasons; but if they did call pg_newlocale_from_collation on a C locale it would at least work for the pg_locale.h APIs. That would be a slightly simpler and safer API, and make it easier to do the collation version check consistently. That's not very complicated, but it's a bit invasive and probably out of scope for v17. It might be part of another change I had intended for a while, which is to make NULL an invalid pg_locale_t, and use a different representation to mean "use the server environment". That would clean up a lot of checks for NULL. For now, we'd still like to add the version number to the builtin collations, so that leaves us with two options: (a) Perform the version check in lc_{collate|ctype}_is_c(), which duplicates some code and creates some inconsistency in how the version is checked for different providers. (b) Don't worry about it and just commit the version change in v27- 0001. The version check is already performed correctly on the database without changes, even if the locale is "C". And there are already three built-in "C" collations: "C", "POSIX", and UCS_BASIC; so it's not clear why someone would create even more of them. And even if they did, there would be no reason to give them a warning because we haven't incremented the version, so there's no chance of a mismatch. I'm inclined toward (b). Thoughts? Regards, Jeff Davis
On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: > The patch set v27 is ok with me, modulo (a) discussion about initcap > semantics, and (b) what collation to assign to ucs_basic, which can > be > revisited later. Attached v28. The remaining patches are for full case mapping and PG_UNICODE_FAST. I am fine waiting until July to get these remaining patches committed. That would give us time to sort out details like: * Get consensus that it's OK to change UCS_BASIC. * Figure out if we need a pg-specific locale and whether PG_UNICODE_FAST is the right name. * Make sure that full case mapping interacts with regexes in a sane way (probably it needs to just fall back to simple case mapping, but perhaps that's worth a discussion). * Implement case folding. * Implement a more unicode-friendly TITLECASE() function, which could offer a number of options that don't fit well with INITCAP(). * Figure out if UPPER()/LOWER() should also have some of those options. Thoughts? Regards, Jeff Davis
Attachment
On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote: > > Full vs. simple case mapping is more of a legacy compatibility > question, > in my mind. There is some expectation/precedent that C.UTF-8 uses > simple case mapping, but beyond that, I don't see a reason why > someone > would want to explicitly opt for simple case mapping, other than if > they > need length preservation or something, but if they need that, then > they > are going to be in a world of pain in Unicode anyway. I mostly agree, though there are some other purposes for the simple mapping: * a substitute for case folding: lower() with simple case mapping will work better for that purpose than lower() with full case mapping (after we have casefold(), this won't be a problem) * simple case mapping is conceptually simpler, and that's a benefit by itself in some situations -- maybe the 1:1 assumption exists other places in their application > > There's also another reason to consider it an argument rather than > > a > > collation property, which is that it might be dependent on some > > other > > field in a row. I could imagine someone wanting to do: > > > > SELECT > > UPPER(some_field, > > full => true, > > dotless_i => CASE other_field WHEN ...) > > FROM ... > > Can you index this usefully? It would only work if the user query > matches exactly this pattern? In that example, UPPER is used in the target list -- the WHERE clause might be indexable. The UPPER is just used for display purposes, and may depend on some locale settings stored in another table associated with a particular user. Regards, Jeff Davis
On 01.04.24 21:52, Jeff Davis wrote: > On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: >> The patch set v27 is ok with me, modulo (a) discussion about initcap >> semantics, and (b) what collation to assign to ucs_basic, which can >> be >> revisited later. > > Attached v28. > > The remaining patches are for full case mapping and PG_UNICODE_FAST. > > I am fine waiting until July to get these remaining patches committed. > That would give us time to sort out details like: > > * Get consensus that it's OK to change UCS_BASIC. > * Figure out if we need a pg-specific locale and whether > PG_UNICODE_FAST is the right name. > * Make sure that full case mapping interacts with regexes in a sane way > (probably it needs to just fall back to simple case mapping, but > perhaps that's worth a discussion). > * Implement case folding. > * Implement a more unicode-friendly TITLECASE() function, which could > offer a number of options that don't fit well with INITCAP(). > * Figure out if UPPER()/LOWER() should also have some of those options. > > Thoughts? Yeah, I think it's good to give some more time to work out these things. The features committed for PG17 so far are solid, so it's a good point to pause.
Hi, +command_ok( + [ + 'initdb', '--no-sync', + '--locale-provider=builtin', '-E UTF-8', + '--builtin-locale=C.UTF-8', "$tempdir/data8" + ], + 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8'); This Sun animal recently turned on --enable-tap-tests, and that ↑ failed[1]: # Running: initdb --no-sync --locale-provider=builtin -E UTF-8 --builtin-locale=C.UTF-8 /home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/initdb/tmp_check/tmp_test_XvK1/data8 The files belonging to this database system will be owned by user "marcel". This user must also own the server process. The database cluster will be initialized with this locale configuration: locale provider: builtin default collation: C.UTF-8 LC_COLLATE: en_US LC_CTYPE: en_US LC_MESSAGES: C LC_MONETARY: en_US LC_NUMERIC: en_US LC_TIME: en_US initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN1) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. [14:04:12.462](0.036s) not ok 28 - locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8 [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-04-04%2011%3A42%3A40
On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote: > Hi, > > +command_ok( > + [ > + 'initdb', '--no-sync', > + '--locale-provider=builtin', '-E UTF-8', > + '--builtin-locale=C.UTF-8', "$tempdir/data8" > + ], > + 'locale provider builtin with -E UTF-8 --builtin- > locale=C.UTF-8'); ... > LC_COLLATE: en_US > LC_CTYPE: en_US > LC_MESSAGES: C > LC_MONETARY: en_US > LC_NUMERIC: en_US > LC_TIME: en_US > initdb: error: encoding mismatch > initdb: detail: The encoding you selected (UTF8) and the encoding > that > the selected locale uses (LATIN1) do not match. Thank you for the report. I fixed it in e2a2357671 by forcing the environment locale to C which is compatible with any encoding. The test still forces the encoding to UTF-8 and the collation to the builtin C.UTF-8. In passing, I noticed some unrelated regression test failures when I set LANG=tr_TR: tsearch, tsdict, json, and jsonb. There's an additional failure in the updatable_views test when LANG=tr_TR.utf8. I haven't looked into the details yet. Regards, Jeff Davis
Attachment
On Wed, Mar 20, 2024 at 05:13:26PM -0700, Jeff Davis wrote: > On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: > > * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch > > > > Looks ok. > > Committed. > <varlistentry> > + <term><literal>pg_c_utf8</literal></term> > + <listitem> > + <para> > + This collation sorts by Unicode code point values rather than natural > + language order. For the functions <function>lower</function>, > + <function>initcap</function>, and <function>upper</function>, it uses > + Unicode simple case mapping. For pattern matching (including regular > + expressions), it uses the POSIX Compatible variant of Unicode <ulink > + url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility > + Properties</ulink>. Behavior is efficient and stable within a > + <productname>Postgres</productname> major version. This collation is > + only available for encoding <literal>UTF8</literal>. > + </para> > + </listitem> > + </varlistentry> lower(), initcap(), upper(), and regexp_matches() are PROVOLATILE_IMMUTABLE. Until now, we've delegated that responsibility to the user. The user is supposed to somehow never update libc or ICU in a way that changes outcomes from these functions. Now that postgresql.org is taking that responsibility for builtin C.UTF-8, how should we govern it? I think the above text and [1] convey that we'll update the Unicode data between major versions, making functions like lower() effectively STABLE. Is that right? (This thread had some discussion[2] that datcollversion/collversion won't necessarily change when a major versions changes lower() behavior.) [1] https://postgr.es/m/7089acb3ebac0c1682a79c8bc16803cf06896fb9.camel@j-davis.com [2] https://postgr.es/m/5a1ecc40539f36cac5b27a62739a45a49785ca54.camel@j-davis.com
On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote: > lower(), initcap(), upper(), and regexp_matches() are > PROVOLATILE_IMMUTABLE. > Until now, we've delegated that responsibility to the user. The user > is > supposed to somehow never update libc or ICU in a way that changes > outcomes > from these functions. To me, "delegated" connotes a clear and organized transfer of responsibility to the right person to solve it. In that sense, I disagree that we've delegated it. What's happened here is evolution of various choices that seemed reasonable at the time. Unfortunately, the consequences that are hard for us to manage and even harder for users to manage themselves. > Now that postgresql.org is taking that responsibility > for builtin C.UTF-8, how should we govern it? I think the above text > and [1] > convey that we'll update the Unicode data between major versions, > making > functions like lower() effectively STABLE. Is that right? Marking them STABLE is not a viable option, that would break a lot of valid use cases, e.g. an index on LOWER(). Unicode already has its own governance, including a stability policy that includes case mapping: https://www.unicode.org/policies/stability_policy.html#Case_Pair Granted, that policy does not guarantee that the results will never change. In particular, the results can change if using unassinged code poitns that are later assigned to Cased characters. That's not terribly common though; for instance, there are zero changes in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1 (current) -- even for code points that were unassigned in 14.0 and later assigned. I checked this by modifying case_test.c to look at unassigned code points as well. There's a greater chance that character properties can change (e.g. whether a character is "alphabetic" or not) in new releases of Unicode. Such properties can affect regex character classifications, and in some cases the results of initcap (because it uses the "alphanumeric" classification to determine word boundaries). I don't think we need code changes for 17. Some documentation changes might be helpful, though. Should we have a note around LOWER()/UPPER() that users should REINDEX any dependent indexes when the provider is updated? > (This thread had some discussion[2] that datcollversion/collversion > won't > necessarily change when a major versions changes lower() behavior.) datcollversion/collversion track the vertsion of the collation specifically (text ordering only), not the ctype (character semantics). When using the libc provider, get_collation_actual_version() completely ignores the ctype. It would be interesting to consider tracking the versions separately, though. Regards, Jeff Davis
On Mon, Jul 01, 2024 at 12:24:15PM -0700, Jeff Davis wrote: > On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote: > > lower(), initcap(), upper(), and regexp_matches() are > > PROVOLATILE_IMMUTABLE. > > Until now, we've delegated that responsibility to the user. The user > > is > > supposed to somehow never update libc or ICU in a way that changes > > outcomes > > from these functions. > > To me, "delegated" connotes a clear and organized transfer of > responsibility to the right person to solve it. In that sense, I > disagree that we've delegated it. Good point. > > Now that postgresql.org is taking that responsibility > > for builtin C.UTF-8, how should we govern it? I think the above text > > and [1] > > convey that we'll update the Unicode data between major versions, > > making > > functions like lower() effectively STABLE. Is that right? > > Marking them STABLE is not a viable option, that would break a lot of > valid use cases, e.g. an index on LOWER(). I agree. > I don't think we need code changes for 17. Some documentation changes > might be helpful, though. Should we have a note around LOWER()/UPPER() > that users should REINDEX any dependent indexes when the provider is > updated? I agree the v17 code is fine. Today, a user can (with difficulty) choose dependency libraries so regexp_matches() is IMMUTABLE, as marked. I don't want $SUBJECT to be the ctype that, at some post-v17 version, can't achieve that with unpatched PostgreSQL. Let's change the documentation to say this provider uses a particular snapshot of Unicode data, taken around PostgreSQL 17. We plan never to change that data, so IMMUTABLE functions can rely on the data. If we provide a newer Unicode data set in the future, we'll provide it in such a way that DDL must elect the new data. How well would that suit your vision for this feature? An alternative would be to make pg_upgrade reject operating on a cluster that contains use of $SUBJECT.
On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote: > I agree the v17 code is fine. Today, a user can (with difficulty) > choose > dependency libraries so regexp_matches() is IMMUTABLE, as marked. I > don't > want $SUBJECT to be the ctype that, at some post-v17 version, can't > achieve > that with unpatched PostgreSQL. We aren't forcing anyone to use the builtin "C.UTF-8" locale. Anyone can still use the builtin "C" locale (which never changes), or another provider if they can sort out the difficulties (and live with the consequences) of pinning the dependencies to a specific version. > Let's change the documentation to say this > provider uses a particular snapshot of Unicode data, taken around > PostgreSQL > 17. We plan never to change that data, so IMMUTABLE functions can > rely on the > data. We can discuss this in the context of version 18 or the next time we plan to update Unicode. I don't think we should make such a promise in version 17. > If we provide a newer Unicode data set in the future, we'll provide > it > in such a way that DDL must elect the new data. How well would that > suit your > vision for this feature? Thomas tried tracking collation versions along with individual objects, and it had to be reverted (ec48314708). It fits my vision to do something like that as a way of tightening things up. But there are some open design questions we need to settle, along with a lot of work. So I don't think we should pre-emptively block all Unicode updates waiting for it. > An alternative would be to make pg_upgrade reject > operating on a cluster that contains use of $SUBJECT. That wouldn't help anyone. Regards, Jeff Davis
On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote: > On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote: > > An alternative would be to make pg_upgrade reject > > operating on a cluster that contains use of $SUBJECT. > > That wouldn't help anyone. Can you say more about that? For the last decade at least, I think our standard for new features has been to error rather than allow an operation that creates a known path to wrong query results. I think that's a helpful standard that we should continue to follow.
On 02.07.24 18:51, Noah Misch wrote: > On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote: >> On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote: >>> An alternative would be to make pg_upgrade reject >>> operating on a cluster that contains use of $SUBJECT. >> >> That wouldn't help anyone. > > Can you say more about that? For the last decade at least, I think our > standard for new features has been to error rather than allow an operation > that creates a known path to wrong query results. I think that's a helpful > standard that we should continue to follow. I don't think the builtin locale provider is any different in this respect from the other providers: The locale data might change and there is a version mechanism to track that. We don't prevent pg_upgrade in scenarios like that for other providers.
On Wed, Jul 03, 2024 at 12:05:09AM +0200, Peter Eisentraut wrote: > On 02.07.24 18:51, Noah Misch wrote: > > On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote: > > > On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote: > > > > An alternative would be to make pg_upgrade reject > > > > operating on a cluster that contains use of $SUBJECT. > > > > > > That wouldn't help anyone. > > > > Can you say more about that? For the last decade at least, I think our > > standard for new features has been to error rather than allow an operation > > that creates a known path to wrong query results. I think that's a helpful > > standard that we should continue to follow. > > I don't think the builtin locale provider is any different in this respect > from the other providers: The locale data might change and there is a > version mechanism to track that. We don't prevent pg_upgrade in scenarios > like that for other providers. Each packager can choose their dependencies so the v16 providers don't have the problem. With the $SUBJECT provider, a packager won't have that option.
On Tue, 2024-07-02 at 16:03 -0700, Noah Misch wrote: > Each packager can choose their dependencies so the v16 providers > don't have > the problem. With the $SUBJECT provider, a packager won't have that > option. While nothing needs to be changed for 17, I agree that we may need to be careful in future releases not to break things. Broadly speaking, you are right that we may need to freeze Unicode updates or be more precise about versioning. But there's a lot of nuance to the problem, so I don't think we should pre-emptively promise either of those things right now. Consider: * Unless I made a mistake, the last three releases of Unicode (14.0, 15.0, and 15.1) all have the exact same behavior for UPPER() and LOWER() -- even for unassigned code points. It would be silly to promise to stay with 15.1 and then realize that moving to 16.0 doesn't create any actual problem. * Unicode also offers "case folding", which has even stronger stability guarantees, and I plan to propose that soon. When implemented, it would be preferred over LOWER()/UPPER() in index expressions for most use cases. * While someone can pin libc+ICU to particular versions, it's impossible when using the official packages, and additionally requires using something like [1], which just became available last year. I don't think it's reasonable to put it forth as a matter-of-fact solution. * Let's keep some perspective: we've lived for a long time with ALL text indexes at serious risk of breakage. In contrast, the concerns you are raising now are about certain kinds of expression indexes over data containing certain unassigned code points. I am not dismissing that concern, but the builtin provider moves us in the right direction and let's not lose sight of that. Given that no code changes for v17 are proposed, I suggest that we refrain from making any declarations until the next version of Unicode is released. If the pattern holds, that will be around September, which still leaves time to make reasonable decisions for v18. Regards, Jeff Davis [1] https://github.com/awslabs/compat-collation-for-glibc
On Wed, Jul 03, 2024 at 02:19:07PM -0700, Jeff Davis wrote: > * Unless I made a mistake, the last three releases of Unicode (14.0, > 15.0, and 15.1) all have the exact same behavior for UPPER() and > LOWER() -- even for unassigned code points. It would be silly to > promise to stay with 15.1 and then realize that moving to 16.0 doesn't > create any actual problem. I think you're saying that if some Unicode update changes the results of a STABLE function but does not change the result of any IMMUTABLE function, we may as well import that update. Is that about right? If so, I agree. In addition to the options I listed earlier (error in pg_upgrade or document that IMMUTABLE stands) I would be okay with a third option. Decide here that we'll not adopt a Unicode update in a way that changes a v17 IMMUTABLE function result of the new provider. We don't need to write that in the documentation, since it's implicit in IMMUTABLE. Delete the "stable within a <productname>Postgres</productname> major version" documentation text. > * While someone can pin libc+ICU to particular versions, it's > impossible when using the official packages, and additionally requires > using something like [1], which just became available last year. I > don't think it's reasonable to put it forth as a matter-of-fact > solution. > > * Let's keep some perspective: we've lived for a long time with ALL > text indexes at serious risk of breakage. In contrast, the concerns you > are raising now are about certain kinds of expression indexes over data > containing certain unassigned code points. I am not dismissing that > concern, but the builtin provider moves us in the right direction and > let's not lose sight of that. I see you're trying to help users get less breakage, and that's a good goal. I agree $SUBJECT eliminates libc+ICU breakage, and libc+ICU breakage has hurt plenty. However, you proposed to update Unicode data and give REINDEX as the solution to breakage this causes. Unlike libc+ICU breakage, the packager has no escape from that. That's a different kind of breakage proposition, and no new PostgreSQL feature should do that. It's on a different axis from helping users avoid libc+ICU breakage, and a feature doesn't get to credit helping on one axis against a regression on the other axis. What am I missing here? > Given that no code changes for v17 are proposed, I suggest that we > refrain from making any declarations until the next version of Unicode > is released. If the pattern holds, that will be around September, which > still leaves time to make reasonable decisions for v18. Soon enough, a Unicode release will add one character to regexp [[:alpha:]]. PostgreSQL will then need to decide what IMMUTABLE is going to mean. How does that get easier in September? Thanks, nm > [1] https://github.com/awslabs/compat-collation-for-glibc
Noah Misch wrote: > > I don't think the builtin locale provider is any different in this respect > > from the other providers: The locale data might change and there is a > > version mechanism to track that. We don't prevent pg_upgrade in scenarios > > like that for other providers. > > Each packager can choose their dependencies so the v16 providers don't have > the problem. With the $SUBJECT provider, a packager won't have that option. The Unicode data files downloaded into src/common/unicode/ depend on the versions defined in Makefile.global.in: # Unicode data information # Before each major release, update these and run make update-unicode. # Pick a release from here: <https://www.unicode.org/Public/>. Note # that the most recent release listed there is often a pre-release; # don't pick that one, except for testing. UNICODE_VERSION = 15.1.0 # Pick a release from here: <http://cldr.unicode.org/index/downloads> CLDR_VERSION = 45 (CLDR_VERSION is apparently not used yet). When these versions get bumped, it seems like packagers could stick to previous versions by just overriding these. When doing that, are there any function that may have an immutability breakage problem with the built-in locale provider? (I would expect none). Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, 2024-07-05 at 13:55 +0200, Daniel Verite wrote: > When these versions get bumped, it seems like packagers could stick > to > previous versions by just overriding these. That's an interesting point. It's actually easier for a packager to pin Unicode to a specific version than to pin libc to a specific version. > When doing that, are there any function that may have an immutability > breakage problem with the built-in locale provider? (I would expect > none). Right, there wouldn't be any breakage without new data files. Regards, Jeff Davis
On Thu, 2024-07-04 at 14:26 -0700, Noah Misch wrote: > I think you're saying that if some Unicode update changes the results > of a > STABLE function but does not change the result of any IMMUTABLE > function, we > may as well import that update. Is that about right? If so, I > agree. If you are proposing that Unicode updates should not be performed if they affect the results of any IMMUTABLE function, then that's a new policy. For instance, the results of NORMALIZE() changed from PG15 to PG16 due to commit 1091b48cd7: SELECT NORMALIZE(U&'\+01E030',nfkc)::bytea; Version 15: \xf09e80b0 Version 16: \xd0b0 I am neither endorsing nor opposing the new policy you propose at this time, but deep in the sub-thread of one particular feature is not the right place to discuss it. Please start a new thread for the proposed PG18 policy change and CC me. I happen to think that around the release of the next version of Unicode (in a couple months) would be the most productive time to have that discussion, but you can start the discussion now if you like. Regards, Jeff Davis
On Fri, Jul 05, 2024 at 02:38:45PM -0700, Jeff Davis wrote: > On Thu, 2024-07-04 at 14:26 -0700, Noah Misch wrote: > > I think you're saying that if some Unicode update changes the results > > of a > > STABLE function but does not change the result of any IMMUTABLE > > function, we > > may as well import that update. Is that about right? If so, I > > agree. > > If you are proposing that Unicode updates should not be performed if > they affect the results of any IMMUTABLE function, then that's a new > policy. > > For instance, the results of NORMALIZE() changed from PG15 to PG16 due > to commit 1091b48cd7: > > SELECT NORMALIZE(U&'\+01E030',nfkc)::bytea; > > Version 15: \xf09e80b0 > > Version 16: \xd0b0 As a released feature, NORMALIZE() has a different set of remedies to choose from, and I'm not proposing one. I may have sidetracked this thread by talking about remedies without an agreement that pg_c_utf8 has a problem. My question for the PostgreSQL maintainers is this: textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being IMMUTABLE, will change behavior in some major releases. pg_upgrade does not have a concept of IMMUTABLE functions changing, so index scans will return wrong query results after upgrade. Is it okay for v17 to release a pg_c_utf8 planned to behave that way when upgrading v17 to v18+? If the answer is yes, the open item closes. If the answer is no, determining the remedy can come next. Lest concrete details help anyone reading, here are some affected objects: CREATE TABLE t (s text COLLATE pg_c_utf8); INSERT INTO t VALUES (U&'\+00a7dc'), (U&'\+001dd3'); CREATE INDEX iexpr ON t ((lower(s))); CREATE INDEX ipred ON t (s) WHERE s ~ '[[:alpha:]]'; v17 can simulate the Unicode aspect of a v18 upgrade, like this: sed -i 's/^UNICODE_VERSION.*/UNICODE_VERSION = 16.0.0/' src/Makefile.global.in # ignore test failures (your ICU likely doesn't have the Unicode 16.0.0 draft) make -C src/common/unicode update-unicode make make install pg_ctl restart Behavior after that: -- 2 rows w/ seq scan, 0 rows w/ index scan SELECT 1 FROM t WHERE s ~ '[[:alpha:]]'; SET enable_seqscan = off; SELECT 1 FROM t WHERE s ~ '[[:alpha:]]'; -- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "iexpr" SELECT bt_index_parent_check('iexpr', heapallindexed => true); -- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "ipred" SELECT bt_index_parent_check('ipred', heapallindexed => true);
Noah Misch <noah@leadboat.com> writes: > As a released feature, NORMALIZE() has a different set of remedies to choose > from, and I'm not proposing one. I may have sidetracked this thread by > talking about remedies without an agreement that pg_c_utf8 has a problem. My > question for the PostgreSQL maintainers is this: > textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being > IMMUTABLE, will change behavior in some major releases. pg_upgrade does not > have a concept of IMMUTABLE functions changing, so index scans will return > wrong query results after upgrade. Is it okay for v17 to release a > pg_c_utf8 planned to behave that way when upgrading v17 to v18+? I do not think it is realistic to define "IMMUTABLE" as meaning that the function will never change behavior until the heat death of the universe. As a counterexample, we've not worried about applying bug fixes or algorithm improvements that change the behavior of "immutable" numeric computations. It might be unwise to do that in a minor release, but we certainly do it in major releases. I'd say a realistic policy is "immutable means we don't intend to change it within a major release". If we do change the behavior, either as a bug fix or a major-release improvement, that should be release-noted so that people know they have to rebuild dependent indexes and matviews. It gets stickier for behaviors that aren't fully under our control, which is the case for a lot of locale-related things. We cannot then promise "no changes within major releases". But I do not think it is helpful to react to that fact by refusing to label such things immutable. Then we'd just need another mutability classification, and it would effectively act the same as immutable does now, because people will certainly wish to use these functions in indexes etc. regards, tom lane
On Jul 6, 2024, at 12:51 PM, Noah Misch <noah@leadboat.com> wrote:
Behavior after that:
-- 2 rows w/ seq scan, 0 rows w/ index scan
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';
SET enable_seqscan = off;
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';
-- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "iexpr"
SELECT bt_index_parent_check('iexpr', heapallindexed => true);
-- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "ipred"
SELECT bt_index_parent_check('ipred', heapallindexed => true);
Other databases do still ship built-in ancient versions of unicode (Db2 ships 4.0+ and Oracle ships 6.1+), and they have added new Unicode versions alongside the old but not removed the old versions. They claim to have “deprecated” old versions… but it seems they haven’t been able to get rid of them yet. Maybe some customer is willing to pay to continue deferring painful rebuilds needed to get rid of the old collation versions in commercial DBs?
For reference, see the table on slide 56 at https://www.pgevents.ca/events/pgconfdev2024/schedule/session/95-collations-from-a-to-z/ and also see https://ardentperf.com/2024/05/22/default-sort-order-in-db2-sql-server-oracle-postgres-17/
Thanks for the illustration with actual Unicode 16 draft data.
Also, not directly related to this email… but reiterating a point I argued for in the recorded talk at pgconf.dev in Vancouver: a very strong argument for having the DB default to a stable unchanging built-in collation is that the dependency tracking makes it easy to identify objects in the database using non-default collations, and it’s easy to know exactly what needs to be rebuilt for a user to safely change some non-default collation provider’s behavior.
-Jeremy
Sent from my TI-83
On Sat, Jul 06, 2024 at 04:19:21PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > As a released feature, NORMALIZE() has a different set of remedies to choose > > from, and I'm not proposing one. I may have sidetracked this thread by > > talking about remedies without an agreement that pg_c_utf8 has a problem. My > > question for the PostgreSQL maintainers is this: > > > textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being > > IMMUTABLE, will change behavior in some major releases. pg_upgrade does not > > have a concept of IMMUTABLE functions changing, so index scans will return > > wrong query results after upgrade. Is it okay for v17 to release a > > pg_c_utf8 planned to behave that way when upgrading v17 to v18+? > > I do not think it is realistic to define "IMMUTABLE" as meaning that > the function will never change behavior until the heat death of the > universe. As a counterexample, we've not worried about applying > bug fixes or algorithm improvements that change the behavior of > "immutable" numeric computations. True. There's a continuum from "releases can change any IMMUTABLE function" to "index integrity always wins, even if a function is as wrong as 1+1=3". I'm less concerned about the recent "Incorrect results from numeric round" thread, even though it's proposing to back-patch. I'm thinking about these aggravating factors for $SUBJECT: - $SUBJECT is planning an annual cadence of this kind of change. - We already have ICU providing collation support for the same functions. Unlike $SUBJECT, ICU integration gives packagers control over when to accept corruption at pg_upgrade time. - SQL Server, DB2 and Oracle do their Unicode updates in a non-corrupting way. (See Jeremy Schneider's reply concerning DB2 and Oracle.) - lower() and regexp are more popular in index expressions than high-digit-count numeric calculations. > I'd say a realistic policy is "immutable means we don't intend to > change it within a major release". If we do change the behavior, > either as a bug fix or a major-release improvement, that should > be release-noted so that people know they have to rebuild dependent > indexes and matviews. It sounds like you're very comfortable with $SUBJECT proceeding in its current form. Is that right?
Noah Misch <noah@leadboat.com> writes: > It sounds like you're very comfortable with $SUBJECT proceeding in its current > form. Is that right? I don't have an opinion on whether the overall feature design is well-chosen. But the mere fact that Unicode updates will from time to time change the behavior (presumably only in edge cases or for previously-unassigned code points) doesn't strike me as a big enough problem to justify saying these functions can't be marked immutable anymore. Especially since we have been faced with that problem all along anyway; we just didn't have a way to track or quantify it before, because locale changes happened outside code we control. regards, tom lane
On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote: > > I do not think it is realistic to define "IMMUTABLE" as meaning that > > the function will never change behavior until the heat death of the > > universe. As a counterexample, we've not worried about applying > > bug fixes or algorithm improvements that change the behavior of > > "immutable" numeric computations. > > True. There's a continuum from "releases can change any IMMUTABLE function" > to "index integrity always wins, even if a function is as wrong as 1+1=3". > I'm less concerned about the recent "Incorrect results from numeric round" > thread, even though it's proposing to back-patch. I'm thinking about these > aggravating factors for $SUBJECT: > > - $SUBJECT is planning an annual cadence of this kind of change. > > - We already have ICU providing collation support for the same functions. > Unlike $SUBJECT, ICU integration gives packagers control over when to accept > corruption at pg_upgrade time. > > - SQL Server, DB2 and Oracle do their Unicode updates in a non-corrupting way. > (See Jeremy Schneider's reply concerning DB2 and Oracle.) > > - lower() and regexp are more popular in index expressions than > high-digit-count numeric calculations. My personal exprience is that very few users are aware of or care about the strict accuracy of the collation sort order and other locale aspects. But they care a lot about index corruption. So I'd argue that we should not have any breaking changes at all, even in cases where the provider is clearly wrong. Yours, Laurenz Albe
On Tue, Jul 9, 2024 at 4:00 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
My personal exprience is that very few users are aware of or care about
the strict accuracy of the collation sort order and other locale aspects.
But they care a lot about index corruption.
So I'd argue that we should not have any breaking changes at all, even in
cases where the provider is clearly wrong.
FWIW, using external ICU libraries is a nice solution for users who need strict and up-to-date Unicode support.
Cell phones do often get support for new code points before databases. So databases can end up storing characters before they are aware of the meaning. (Slide 27 in the pgconf.dev talk illustrates a recent timeline of Unicode & phone updates.)
-Jeremy
On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote: > I'm thinking about these > aggravating factors for $SUBJECT: This is still marked as an open item for 17, but you've already acknowledged[1] that no code changes are necessary in version 17. Upgrades of Unicode take an active step from a committer, so it's not a pressing problem for 18, either. The idea that you're arguing against is "stability within a PG major version". There's no new discovery here: it was listed under the heading of "Benefits" near the top of my initial proposal[2], and known to all reviewers. This is not an Open Item for 17, and new policy discussions should not happen deep in this subthread. Please resolve the Open Item, and feel free to start a thread about policy changes in 18. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/20240701230352.2c.nmisch@google.com [2] https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
On Tue, Jul 09, 2024 at 04:20:12PM -0700, Jeff Davis wrote: > On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote: > > I'm thinking about these > > aggravating factors for $SUBJECT: > > This is still marked as an open item for 17, but you've already > acknowledged[1] that no code changes are necessary in version 17. Later posts on the thread made that obsolete. The next step is to settle the question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. If that conclusion entails a remedy, v17 code changes may be part of that remedy. > The idea that you're arguing against is "stability within a PG major > version". There's no new discovery here: it was listed under the > heading of "Benefits" near the top of my initial proposal[2] Thanks for that distillation. More specifically, I'm arguing against the lack of choice about instability across major versions: | ICU collations | pg_c_utf8 ----------------------------------|-------------------|---------- Corruption within a major version | packager's choice | no Corruption at pg_upgrade time | packager's choice | yes I am a packager who chooses no-corruption (chooses stability). As a packager, the pg_c_utf8 stability within major versions is never a bad thing, but it does not compensate for instability across major versions. I don't want a future in which pg_c_utf8 is the one provider that integrity-demanding pg_upgrade users should not use. > and known to all reviewers. If after https://postgr.es/m/20240706195129.fd@rfd.leadboat.com and https://postgr.es/m/20240709010545.8c.nmisch@google.com they think $SUBJECT should continue as-committed, they should vote that way. Currently, we have multiple votes in one direction and multiple votes in the other direction. If all three reviewers were to vote in the same direction (assuming no other new votes), I argue that such a count would render whichever way they vote as the conclusion. Does that match your count? > [1] > https://www.postgresql.org/message-id/20240701230352.2c.nmisch@google.com > [2] > https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote: > > This is still marked as an open item for 17, but you've already > > acknowledged[1] that no code changes are necessary in version 17. > > Later posts on the thread made that obsolete. The next step is to > settle the > question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. > If that > conclusion entails a remedy, v17 code changes may be part of that > remedy. This is the first time you've mentioned a code change in version 17. If you have something in mind, please propose it. However, this feature followed the right policies at the time of commit, so there would need to be a strong consensus to accept such a change. Additionally, I started a discussion on version 18 policy that may also resolve your concerns: https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com Regards, Jeff Davis
On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote: > On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote: > > > This is still marked as an open item for 17, but you've already > > > acknowledged[1] that no code changes are necessary in version 17. > > > > Later posts on the thread made that obsolete. The next step is to > > settle the > > question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. > > If that > > conclusion entails a remedy, v17 code changes may be part of that > > remedy. > > This is the first time you've mentioned a code change in version 17. If That's right. > you have something in mind, please propose it. However, this feature > followed the right policies at the time of commit, so there would need > to be a strong consensus to accept such a change. If I'm counting the votes right, you and Tom have voted that the feature's current state is okay, and I and Laurenz have voted that it's not okay. I still hope more people will vote, to avoid dealing with the tie. Daniel, Peter, and Jeremy, you're all listed as reviewers on commit f69319f. Are you willing to vote one way or the other on the question in https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? A tie would become a decision against the unreleased behavior. In the event of a decision against the unreleased behavior, reverting the feature is the remedy that could proceed without further decision making.
On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote: > If I'm counting the votes right ... > , you and Tom have voted that the feature's > current state is okay, and I and Laurenz have voted that it's not > okay. ... > A tie would become a decision against the unreleased behavior. ... > In the event of a decision against the unreleased behavior, reverting > the > feature is the remedy that could proceed without further decision > making. You haven't established that any problem actually exists in version 17, and your arguments have been a moving target throughout this subthread. I reject the procedural framework that you are trying to establish. Voting won't change the fact that the "stability within a major version" that you are arguing against[1] was highlighted as a benefit in my initial proposal[2] for all reviewers to see. If you press forward with this approach, I'll use judgement that is sufficiently deferential to the review process before making any hasty decisions. Alternatively, I suggest that you participate in the thread that I started here: https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com which seems like a more direct (and more complete) path to a resolution of your concerns. I speak only for myself, but I assure you that I have an open mind in that discussion, and that I have no intention force a Unicode update past objections. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/20240711125040.11.nmisch@google.com [2] https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote: > If I'm counting the votes right, you and Tom have voted that the feature's > current state is okay, and I and Laurenz have voted that it's not okay. Maybe I should expand my position. I am very much for the built-in CTYPE provider. When I said that I am against changes in major versions, I mean changes that are likely to affect real-life usage patterns. If there are modifications affecting a code point that was previously unassigned, it is *theoretically* possible, but very unlikely, that someone has stored it in a database. I would want to deliberate about any change affecting such a code point, and if the change seems highly desirable, we can consider applying it. What I am against is routinely updating the built-in provider to adopt any changes that Unicode makes. To make a comparison with Tom's argument upthread: we have slightly changed how floating point computations work, even though they are IMMUTABLE. But I'd argue that very few people build indexes on the results of floating point arithmetic (and those who do are probably doing something wrong), so the risk is acceptable. But people index strings all the time. Yours, Laurenz Albe
Noah Misch wrote: > If I'm counting the votes right, you and Tom have voted that the feature's > current state is okay, and I and Laurenz have voted that it's not okay. I > still hope more people will vote, to avoid dealing with the tie. Daniel, > Peter, and Jeremy, you're all listed as reviewers on commit f69319f. Are > you > willing to vote one way or the other on the question in > https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? For me, the current state is okay. In the mentioned question, you're doing this: v17 can simulate the Unicode aspect of a v18 upgrade, like this: sed -i 's/^UNICODE_VERSION.*/UNICODE_VERSION = 16.0.0/' src/Makefile.global.in to force a Unicode upgrade. But a packager could do the same to force a Unicode downgrade, if they wanted. Therefore I don't agree with this summary in <20240711125040.11.nmisch@google.com>: > | ICU collations | pg_c_utf8 > ----------------------------------|-------------------|---------- > Corruption within a major version | packager's choice | no > Corruption at pg_upgrade time | packager's choice | yes Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Thu, Jul 18, 2024 at 10:05:34AM +0200, Laurenz Albe wrote: > On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote: > > If I'm counting the votes right, you and Tom have voted that the feature's > > current state is okay, and I and Laurenz have voted that it's not okay. > > Maybe I should expand my position. > > I am very much for the built-in CTYPE provider. When I said that I am against > changes in major versions, I mean changes that are likely to affect real-life > usage patterns. If there are modifications affecting a code point that was > previously unassigned, it is *theoretically* possible, but very unlikely, that > someone has stored it in a database. I would want to deliberate about any change > affecting such a code point, and if the change seems highly desirable, we can > consider applying it. > > What I am against is routinely updating the built-in provider to adopt any changes > that Unicode makes. Given all the messages on this thread, if the feature remains in PostgreSQL, I advise you to be ready to tolerate PostgreSQL "routinely updating the built-in provider to adopt any changes that Unicode makes". Maybe someone will change something in v18 so it's not like that, but don't count on it. Would you like to change your vote to "okay", keep your vote at "not okay", or change it to an abstention? > To make a comparison with Tom's argument upthread: we have slightly changed how > floating point computations work, even though they are IMMUTABLE. But I'd argue > that very few people build indexes on the results of floating point arithmetic > (and those who do are probably doing something wrong), so the risk is acceptable. > But people index strings all the time. Agreed.
On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote: > What I am against is routinely updating the built-in provider to adopt any changes > > that Unicode makes. > > Given all the messages on this thread, if the feature remains in PostgreSQL, I > advise you to be ready to tolerate PostgreSQL "routinely updating the built-in > provider to adopt any changes that Unicode makes". Maybe someone will change > something in v18 so it's not like that, but don't count on it. > > Would you like to change your vote to "okay", keep your vote at "not okay", or > change it to an abstention? In that case I am against it. Against the "routinely" in particular. Yours, Laurenz Albe
On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote: > Given all the messages on this thread, if the feature remains in > PostgreSQL, I > advise you to be ready to tolerate PostgreSQL "routinely updating the > built-in > provider to adopt any changes that Unicode makes". You mean messages from me, like: * "I have no intention force a Unicode update" [1] * "While nothing needs to be changed for 17, I agree that we may need to be careful in future releases not to break things." [2] * "...you are right that we may need to freeze Unicode updates or be more precise about versioning..." [2] * "If you are proposing that Unicode updates should not be performed if they affect the results of any IMMUTABLE function...I am neither endorsing nor opposing..." [3] ? The only source I can imagine for your concern -- please correct me if I'm wrong -- is that I declined to make a preemptive version 18 promise deep in this version 17 Open Item subthread. But I have good reasons. First, if we promise not to update Unicode, that would also affect NORMALIZE(), so for the sake of transparency we need a top-level discussion. Second, an Open Item should be tightly scoped to what actually needs to happen in version 17 before release. And thirdly, such a promise would artificially limit the range of possible outcomes, which may include various compromises that are not 17 material. I'm perplexed as to why you don't engage in the version 18 policy discussion. > Maybe someone will change > something in v18 so it's not like that, but don't count on it. That's backwards. If nothing happens in v18, then there will be no breaking Unicode change. It takes an active step by a committer to update Unicode. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/5edb38923b0b23eb643f61807ef772a237ab92cf.camel%40j-davis.com [2] https://www.postgresql.org/message-id/db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com [3] https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com
Jeff Davis <pgsql@j-davis.com> writes: > On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote: >> Maybe someone will change >> something in v18 so it's not like that, but don't count on it. > That's backwards. If nothing happens in v18, then there will be no > breaking Unicode change. It takes an active step by a committer to > update Unicode. This whole discussion seems quite bizarre to me. In the first place, it is certain that Unicode will continue to evolve, and I can't believe that we'd just freeze pg_c_utf8 on the current definition forever. Whether the first change happens in v18 or years later doesn't seem like a particularly critical point. In the second place, I cannot understand why pg_c_utf8 is being held to a mutability standard that we have never applied to any other locale-related functionality --- and indeed could not do so, since in most cases that functionality has been buried in libraries we don't control. It seems to me to be already a great step forward that with pg_c_utf8, at least we can guarantee that the behavior won't change without us knowing about it. Noah's desire to revert the feature makes the mutability situation strictly worse, because people will have to continue to rely on OS-provided functionality that can change at any time. regards, tom lane
On Thu, 2024-07-18 at 16:45 +0200, Laurenz Albe wrote: > On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote: > > What I am against is routinely updating the built-in provider to > > adopt any changes > > > that Unicode makes. That is a perfectly reasonable position; please add it to the version 18 discussion[1]. > > Given all the messages on this thread, if the feature remains in > > PostgreSQL, I > > advise you to be ready to tolerate PostgreSQL "routinely updating > > the built-in > > provider to adopt any changes that Unicode makes". Maybe someone > > will change > > something in v18 so it's not like that, but don't count on it. ... > In that case I am against it. Against the "routinely" in particular. Also, please see my response[2] to Noah. I don't believe his statement above is an accurate characterization. There's plenty of opportunity for deliberation and compromise in version 18, and my mind is still open to pretty much everything, up to and including freezing Unicode updates if necessary[3]. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com [2] https://www.postgresql.org/message-id/f29686018b4432a8ae2e535dbe19f0d88e7a79d5.camel@j-davis.com [3] https://www.postgresql.org/message-id/db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com
On Thu, Jul 18, 2024 at 01:03:31PM -0400, Tom Lane wrote: > This whole discussion seems quite bizarre to me. In the first > place, it is certain that Unicode will continue to evolve, and > I can't believe that we'd just freeze pg_c_utf8 on the current > definition forever. Whether the first change happens in v18 > or years later doesn't seem like a particularly critical point. > > In the second place, I cannot understand why pg_c_utf8 is being > held to a mutability standard that we have never applied to any > other locale-related functionality --- and indeed could not do > so, since in most cases that functionality has been buried in > libraries we don't control. It seems to me to be already a With libc and ICU providers, packagers have a way to avoid locale-related behavior changes. That's the "mutability standard" I want pg_c_utf8 to join. pg_c_utf8 is the one provider where packagers can't opt out[1] of annual pg_upgrade-time index scan breakage on affected expression indexes. > great step forward that with pg_c_utf8, at least we can guarantee > that the behavior won't change without us knowing about it. > Noah's desire to revert the feature makes the mutability situation > strictly worse, because people will have to continue to rely on > OS-provided functionality that can change at any time. I see: - one step forward: "string1 < string2" won't change, forever, regardless of packager choices - one step backward: "string ~ '[[:alpha:]]'" will change at pg_upgrade time, regardless of packager choices I think one's perspective on the relative importance of the step forward and the step backward depends on the sort of packages one uses today. Consider a user of Debian packages with locale!=C, doing Debian upgrades and pg_upgrade. For that user, pg_c_utf8 gets far less index corruption than an ICU locale. The step forward is a great step forward _for this user_, and the step backward is in the noise next to the step forward. I'm with a different kind of packager. I don't tolerate index scans returning wrong answers. To achieve that, my libc and ICU aren't changing collation behavior. I suspect my packages won't offer a newer ICU behavior until PostgreSQL gets support for multiple ICU library versions per database. (SQL Server, DB2 and Oracle already do. I agree we can't freeze forever. The multiple-versions feature gets more valuable each year.) _For this_ kind of package, the step forward is a no-op. The step backward is the sole effect on this kind of package. How much does that pair of perspectives explain the contrast between my "revert" and your "great step forward"? We may continue to disagree on the ultimate decision, but I hope I can make my position cease to appear bizarre to you. Thanks, nm [1] Daniel Verite said packagers could patch src/Makefile.global.in and run "make -C src/common/unicode update-unicode". Editing src/Makefile.global.in is modifying PostgreSQL, not configuring a packager-facing option.
On Thu, 2024-07-18 at 13:03 -0400, Tom Lane wrote: > In the second place, I cannot understand why pg_c_utf8 is being > held to a mutability standard that we have never applied to any > other locale-related functionality --- and indeed could not do > so, since in most cases that functionality has been buried in > libraries we don't control. I believe that we should hold it to a higher standard *precisely because* the previous way that we handled mutability in locale-related functionality was a problem. > It seems to me to be already a > great step forward that with pg_c_utf8, at least we can guarantee > that the behavior won't change without us knowing about it. +1 But the greatness of the step depends on our readiness to be careful with such changes. > Noah's desire to revert the feature makes the mutability situation > strictly worse, because people will have to continue to rely on > OS-provided functionality that can change at any time. I think everybody agrees that we don't want to expose users to data corruption after an upgrade. It understand Noah to take the position that anything less than strict immutability would be worse than the current state, because currently a packager can choose to keep shipping the same old version of libicu and avoid the problem completely. I don't buy that. First, the only binary distribution I have heard of that does that is EDB's Windows installer. Both the RPM and Debian packages don't. And until PostgreSQL defaults to using ICU, most people will use C library collations, and a packager cannot choose not to upgrade the C library. I believe the built-in CTYPE provider is a good thing and a step forward. But to make it a big step forward, we should be extremely careful with any changes in major releases that might require rebuilding indexes. This is where I side with Noah. Yours, Laurenz Albe
On Thu, 2024-07-18 at 16:39 -0700, Noah Misch wrote: > I'm with a different kind of packager. I don't tolerate index scans > returning > wrong answers. I doubt that. We've all been tolerating the risk of index scans returning wrong results in some cases. Consider: a. Some user creates an expression index on NORMALIZE(); vs. b. Some user chooses the builtin "C.UTF-8" locale and creates a partial index with a predicate like "string ~ '[[:alpha:]]'" (or an expression index on LOWER()) Both cases create a risk if we update Unicode in some future version. Why are you unconcerned about case (a), but highly concerned about case (b)? Neither seem to be a pressing problem because updating Unicode is our choice, so we have time to reach a compromise. > [1] Daniel Verite said packagers could patch src/Makefile.global.in > and run > "make -C src/common/unicode update-unicode". Editing > src/Makefile.global.in > is modifying PostgreSQL, not configuring a packager-facing option. Then go to the other thread[1] and propose that it be exposed as a packager-facing option along with any proposed Unicode update. There are other potential compromises possible, so I don't think this 17 subthread is the right place to discuss it, but it strikes me as a reasonable proposal. I sincerely think you are overcomplicating matters with version 17 procedural motions. Let the community process play out in version 18 like normal, because there's no actual problem now, I see no reason your objections would be taken less seriously later. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
On Fri, 2024-07-19 at 09:44 +0200, Laurenz Albe wrote: > But the greatness of the step depends on our readiness to be careful > with such changes. You and Noah have been clear on that point, which is enough to make *me* careful with any Unicode updates in the future. I'll suggest once more that you say so in the policy thread here: https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com which would get broader visibility and I believe provide you with stronger assurances that *everyone* will be careful with Unicode updates. Regards, Jeff Davis >
On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote: > On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote: > > On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote: > > > > This is still marked as an open item for 17, but you've already > > > > acknowledged[1] that no code changes are necessary in version 17. > > > > > > Later posts on the thread made that obsolete. The next step is to > > > settle the > > > question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. > > > If that > > > conclusion entails a remedy, v17 code changes may be part of that > > > remedy. > > > > This is the first time you've mentioned a code change in version 17. If > > That's right. > > > you have something in mind, please propose it. However, this feature > > followed the right policies at the time of commit, so there would need > > to be a strong consensus to accept such a change. > > If I'm counting the votes right, you and Tom have voted that the feature's > current state is okay, and I and Laurenz have voted that it's not okay. I > still hope more people will vote, to avoid dealing with the tie. Daniel, > Peter, and Jeremy, you're all listed as reviewers on commit f69319f. Are you > willing to vote one way or the other on the question in > https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? The last vote arrived 6 days ago. So far, we have votes from Jeff, Noah, Tom, Daniel, and Laurenz. I'll keep the voting open for another 24 hours from now or 36 hours after the last vote, whichever comes last. If that schedule is too compressed for anyone, do share.
On 24.07.24 17:19, Noah Misch wrote: > On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote: >> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote: >>> On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote: >>>>> This is still marked as an open item for 17, but you've already >>>>> acknowledged[1] that no code changes are necessary in version 17. >>>> >>>> Later posts on the thread made that obsolete. The next step is to >>>> settle the >>>> question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. >>>> If that >>>> conclusion entails a remedy, v17 code changes may be part of that >>>> remedy. >>> >>> This is the first time you've mentioned a code change in version 17. If >> >> That's right. >> >>> you have something in mind, please propose it. However, this feature >>> followed the right policies at the time of commit, so there would need >>> to be a strong consensus to accept such a change. >> >> If I'm counting the votes right, you and Tom have voted that the feature's >> current state is okay, and I and Laurenz have voted that it's not okay. I >> still hope more people will vote, to avoid dealing with the tie. Daniel, >> Peter, and Jeremy, you're all listed as reviewers on commit f69319f. Are you >> willing to vote one way or the other on the question in >> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? > > The last vote arrived 6 days ago. So far, we have votes from Jeff, Noah, Tom, > Daniel, and Laurenz. I'll keep the voting open for another 24 hours from now > or 36 hours after the last vote, whichever comes last. If that schedule is > too compressed for anyone, do share. My opinion is that it is okay to release as is.
On 7/24/24 11:19, Noah Misch wrote: > On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote: >> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote: >> > you have something in mind, please propose it. However, this feature >> > followed the right policies at the time of commit, so there would need >> > to be a strong consensus to accept such a change. >> >> If I'm counting the votes right, you and Tom have voted that the feature's >> current state is okay, and I and Laurenz have voted that it's not okay. I >> still hope more people will vote, to avoid dealing with the tie. Daniel, >> Peter, and Jeremy, you're all listed as reviewers on commit f69319f. Are you >> willing to vote one way or the other on the question in >> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? > > The last vote arrived 6 days ago. So far, we have votes from Jeff, Noah, Tom, > Daniel, and Laurenz. I'll keep the voting open for another 24 hours from now > or 36 hours after the last vote, whichever comes last. If that schedule is > too compressed for anyone, do share. It isn't entirely clear to me exactly what we are voting on. * If someone votes +1 (current state is ok) -- that is pretty clear. * But if someone votes -1 (current state is not ok?) what does that mean in practice? - a revert? - we hold shipping 17 until we get consensus (via some plan or mitigation or whatever)? - something else? In any case, I am a hard -1 against reverting. +0.5 on "current state is ok", and +1 on "current state is ok with agreement on what to do in 18" -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jul 24, 2024 at 9:27 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> The last vote arrived 6 days ago. So far, we have votes from Jeff, Noah, Tom,
> Daniel, and Laurenz. I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last. If that schedule is
> too compressed for anyone, do share.
My opinion is that it is okay to release as is.
Like Jeff, I don’t think counting votes or putting names on one side or another is the best way to decide things. Everyone has unique opinions and nuances, it’s not like there’s two groups that all agree together on everything and disagree with the other group. I don’t want my name put on a list this way; there are some places where I agree and some places where I disagree with most people 🙂
I don’t know the code as intimately as some others on the lists, but I’m not aware of any one-way doors that would create major difficulties for future v18+ ideas being discussed
fwiw, I don’t want to pull this feature out of v17, I think it’s okay to release it
-Jeremy
On Wed, Jul 24, 2024 at 08:19:13AM -0700, Noah Misch wrote: > On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote: > > vote one way or the other on the question in > > https://postgr.es/m/20240706195129.fd@rfd.leadboat.com? > > I'll keep the voting open for another 24 hours from now > or 36 hours after the last vote, whichever comes last. I count 4.5 or 5 votes for "okay" and 2 votes for "not okay". I've moved the open item to "Non-bugs". On Wed, Jul 17, 2024 at 11:06:43PM -0700, Jeff Davis wrote: > You haven't established that any problem actually exists in version 17, > and your arguments have been a moving target throughout this subthread. I can understand that experience of yours. It wasn't my intent to make a moving target. To be candid, I entered the thread with no doubts that you'd agree with the problem. When you and Tom instead shared a different view, I switched to pursuing the votes to recognize the problem. (Voting then held that pg_c_utf8 is okay as-is.) On Thu, Jul 18, 2024 at 09:52:44AM -0700, Jeff Davis wrote: > On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote: > > Given all the messages on this thread, if the feature remains in > > PostgreSQL, I > > advise you to be ready to tolerate PostgreSQL "routinely updating the > > built-in > > provider to adopt any changes that Unicode makes". > > You mean messages from me, like: > > * "I have no intention force a Unicode update" [1] > * "While nothing needs to be changed for 17, I agree that we may need > to be careful in future releases not to break things." [2] > * "...you are right that we may need to freeze Unicode updates or be > more precise about versioning..." [2] > * "If you are proposing that Unicode updates should not be performed > if they affect the results of any IMMUTABLE function...I am neither > endorsing nor opposing..." [3] > > ? Those, plus all the other messages. On Fri, Jul 19, 2024 at 08:50:41AM -0700, Jeff Davis wrote: > Consider: > > a. Some user creates an expression index on NORMALIZE(); vs. > b. Some user chooses the builtin "C.UTF-8" locale and creates a partial > index with a predicate like "string ~ '[[:alpha:]]'" (or an expression > index on LOWER()) > > Both cases create a risk if we update Unicode in some future version. > Why are you unconcerned about case (a), but highly concerned about case > (b)? I am not unconcerned about (a), but the v17 beta process gave an opportunity to do something about (b) that it didn't give for (a). Also, I have never handled a user report involving NORMALIZE(). I have handled user reports around regexp index inconsistency, e.g. the one at https://www.youtube.com/watch?v=kNH94tmpUus&t=1490s