Thread: Built-in CTYPE provider

Built-in CTYPE provider

From

Jeff Davis

Date:

05 December 2023, 23:46:06

CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.

Unicode offers relatively simple rules for CTYPE-like functionality
based on data files. There are a few exceptions and a few options,
which I'll address below.

(In contrast, collation varies a lot from locale to locale, and has a
lot more options and nuance than ctype.)

=== Proposal ===

Parse some Unicode data files into static lookup tables in .h files
(similar to what we already do for normalization) and provide
functions to perform the right lookups according to Unicode
recommentations[1][2]. Then expose the functionality as either a
specially-named locale for the libc provider, or as part of the
built-in collation provider which I previously proposed[3]. (Provided
patches don't expose the functionality yet; I'm looking for feedback
first.)

Using libc or ICU for a CTYPE provider would still be supported, but
as I explain below, there's not nearly as much reason to do so as you
might expect. As far as I can tell, using an external provider for
CTYPE functionality is mostly unnecessary complexity and magic.

There's still plenty of reason to use the plain "C" semantics, if
desired, but those semantics are already built-in.

=== Benefits ===

  * platform-independent ctype semantics based on Unicode, not tied to
    any dependency's implementation
  * ability to combine fast memcmp() collation with rich ctype
    semantics
  * user-visible semantics can be documented and tested
  * stability within a PG major version
  * transparency of changes: tables would be checked in to .h files,
    so whoever runs the "update-unicode" build target would see if
    there are unexpected or impactful changes that should be addressed
    in the release notes
  * the built-in tables themselves can be tested exhaustively by
    comparing with ICU so we can detect trivial parsing errors and the
    like

=== Character Classification ===

Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1], specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.

Character classification in Unicode does not vary from locale to
locale. The same character is considered to be a member of the same
classes regardless of locale (in other words, there's no
"tailoring"). There is no strong compatibility guarantee around the
classification of characters, but it doesn't seem to change much in
practice (I could collect more data here if it matters).

In glibc, character classification is not affected by the locale as
far as I can tell -- all non-"C" locales behave like "C.UTF-8"
(perhaps other libc implementations or versions or custom locales
behave differently -- corrections welcome). There are some differences
between "C.UTF-8" and what Unicode seems to recommend, and I'm not
entirely sure why those differences exist or whether those differences
are important for anything other than compatibility.

Note: ICU offers character classification based on Unicode standards,
too, but the fact that it's an external dependency makes it a
difficult-to-test black box that is not tied to a PG major
version. Also, we currently don't use the APIs that Unicode
recommends; so in Postgres today, ICU-based character classification
is further from Unicode than glibc character classification.

=== LOWER()/INITCAP()/UPPER() ===

The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.

In Unicode, lowercasing and uppercasing behavior is a mapping[2], and
also backed by a strong compatibility guarantee that "case pairs" will
always remain case pairs[4]. The mapping may be "simple"
(context-insensitive, locale-insensitive, not adding any code points),
or "full" (may be context-sensitive, locale-sensitive, and one code
point may turn into 1-3 code points).

Titlecasing (INITCAP() in Postgres) in Unicode is similar to
upper/lowercasing, except that it has the additional complexity of
finding word boundaries, which have a non-trivial definition. To
simplify, we'd either use the Postgres definition (alphanumeric) or
the "word" character class specified in [1]. If someone wants more
sophisticated word segmentation they could use ICU.

While "full" case mapping sounds more complex, there are actually very
few cases to consider and they are covered in another (small) data
file. That data file covers ~100 code points that convert to multiple
code points when the case changes (e.g. "ß" -> "SS"), 7 code points
that have context-sensitive mappings, and three locales which have
special conversions ("lt", "tr", and "az") for a few code points.

ICU can do the simple case mapping (u_tolower(), etc.) or full mapping
(u_strToLower(), etc.). I see one difference in ICU that I can't yet
explain for the full titlecase mapping of a singular \+000345.

glibc in UTF8 (at least in my tests) just does the simple upper/lower
case mapping, extended with simple mappings for the locales with
special conversions (which I think are exactly the same 3 locales
mentioned above). libc doesn't do titlecase. If the resuling character
isn't representable in the server encoding, I think libc just maps the
character to itself, though I should test this assumption.

=== Encodings ===

It's easiest to implement these rules in UTF8, but possible for any
encoding where we can decode to a Unicode code point.

=== Patches ===

0001 & 0002 are just cleanup. I intend to commit them unless someone
has a comment.

0003 implements character classification ("Standard" and "POSIX
Compatible" variants) but doesn't actually use them for anything.

0004 implements "simple" case mapping, and a partial implementation of
"full" case mapping. Again, does not use them yet.

=== Questions ===

* Is a built-in ctype provider a reasonable direction for Postgres as
  a project?
* Does it feel like it would be simpler or more complex than what
  we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
  options (e.g. "standard" or "posix" character classification;
  "simple" or "full" case mapping)?


Regards,
    Jeff Davis


[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992
[3]
https://www.postgresql.org/message-id/flat/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
[4] https://www.unicode.org/policies/stability_policy.html#Case_Pair


--
Jeff Davis
PostgreSQL Contributor Team - AWS

On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
>
> But C.UTF-8 is not available everywhere, and there's still the
> problem that Unicode updates through libc are not aligned
> with Postgres releases.

Attached is an implementation of a built-in provider for the "C.UTF-8"
locale. That way applications (and tests!) can count on C.UTF-8 always
being available on any platform; and it also aligns with the Postgres
Unicode updates. Documentation is sparse and the patch is a bit rough,
but feedback is welcome -- it does have some basic tests which can be
used as a guide.

The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple
collation semantics (code point order) but rich ctype semantics
(lower/upper/initcap and regexes). This locale is for users who want
proper Unicode semantics for character operations (upper/lower,
regexes), but don't need a specific natural-language string sort order
to apply to all queries and indexes in their system. One might use it
as the database default collation, and use COLLATE clauses (i.e.
COLLATE UNICODE) where more specific behavior is needed.

The builtin C.UTF-8 locale has the following advantages over using the
libc C.UTF-8 locale:

  * Collation performance: the builtin provider uses memcmp and
abbreviated keys. In libc, these advantages are only available for the
C locale.

  * Unicode version is aligned with other parts of Postgres, like
normalization.

  * Available on all platforms with exactly the same semantics.

  * Testable and documentable.

  * Avoids index corruption risks. In theory libc C.UTF-8 should also
have stable collation, but that is not 100% true. In the builtin
provider it is 100% stable.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

29 December 2023, 02:57:16

On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached is an implementation of a built-in provider for the "C.UTF-
> 8"

Attached a more complete version that fixes a few bugs, stabilizes the
tests, and improves the documentation. I optimized the performance, too
-- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
collation and case mapping (numbers below).

It's really nice to finally be able to have platform-independent tests
that work on any UTF-8 database.

Simple character classification:

  SELECT 'Á' ~ '[[:alpha:]]' COLLATE C_UTF8;

Case mapping is more interesting (note that accented characters are
being properly mapped, and it's using the titlecase variant "ǅ"):

  SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx' COLLATE C_UTF8);
           initcap
  --------------------------
   Axxe Áxxé ǅxxǆ ǅxxx ǅxxx

Even more interesting -- test that non-latin characters can still be a
member of a case-insensitive range:

  -- capital delta is member of lowercase range gamma to lambda
  SELECT 'Δ' ~* '[γ-λ]' COLLATE C_UTF8;
  -- small delta is member of uppercase range gamma to lambda
  SELECT 'δ' ~* '[Γ-Λ]' COLLATE C_UTF8;

Moreover, a lot of this behavior is locked in by strong Unicode
guarantees like [1] and [2]. Behavior that can change probably won't
change very often, and in any case will be tied to a PG major version.

All of these behaviors are very close to what glibc "C.utf8" does on my
machine. The case transformations are identical (except titlecasing
because libc doesn't support it). The character classifications have
some differences, which might be worth discussing, but I didn't see
anything terribly concerning (I am following the unicode
recommendations[3] on this topic).

Performance:

  Sotring 10M strings:
    libc    "C"               14s
    builtin  C_UTF8           14s
    libc    "C.utf8"          20s
    ICU     "en-US-x-icu"     31s

  Running UPPER() on 10M strings:
    libc    "C"               03s
    builtin  C_UTF8           07s
    libc    "C.utf8"          08s
    ICU     "en-US-x-icu"     15s

I didn't investigate or optimize regexes / pattern matching yet, but I
can do similar optimizations if there's any gap.

Note that I implemented the "simple" case mapping (which is what glibc
does) and the "posix compatible"[3] flavor of character classification
(which is closer to what glibc does than the "standard" flavor"). I
opted to use title case mapping for initcap(), which is a difference
from libc and I may go back to just upper/lower. These seem like
reasonable choices if we're going to name the locale after C.UTF-8.

Regards,
    Jeff Davis

[1] https://www.unicode.org/policies/stability_policy.html#Case_Pair
[2] https://www.unicode.org/policies/stability_policy.html#Identity
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties

On 1/9/24 2:31 PM, Jeff Davis wrote:
> On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
>> I think we missed something in psql, pretty sure I applied all the
>> patches but I see this error:
>>
>> =# \l
>> ERROR:  42703: column d.datlocale does not exist
>> LINE 8:   d.datlocale as "Locale",
>>
> 
> Thank you. I'll fix this in the next patch set.

Very minor nitpick/request. Not directly with this patch set but with
the category_test which is related and also recently committed.

I'm testing out "make update-unicode" scripts, and due to my system ICU
version being different from the PostgreSQL unicode version I get the
expected warnings from category_test:

Postgres Unicode Version:       15.1
ICU Unicode Version:            14.0
Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch.
category_test: All tests successful!

I know it's minor, but I saw the 5116 skipped codepoints and saw "all
tests succeeded" but I thought the output might be a little nicer if we
showed the count of tests that succeeded. For example:

Postgres Unicode Version:       15.1
ICU Unicode Version:            14.0
Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch.
category_test: All 1108996 tests successful!

It's a minor tweak to a script that I don't think even runs in the
standard build; any objections?  Patch attached.

-Jeremy

-- 
http://about.me/jeremy_schneider

Attachment

0001-Output-count-of-checked-codepoints.patch

Re: Built-in CTYPE provider

From

"Daniel Verite"

Date:

10 January 2024, 22:56:23

    Jeff Davis wrote:

> Attached a more complete version that fixes a few bugs

[v15 patch]

When selecting the builtin provider with initdb, I'm getting the
following setup:

$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata

  The database cluster will be initialized with this locale configuration:
    default collation provider:  builtin
    default collation locale:     C.UTF-8
    LC_COLLATE:  C.UTF-8
    LC_CTYPE:     C.UTF-8
    LC_MESSAGES: C.UTF-8
    LC_MONETARY: C.UTF-8
    LC_NUMERIC:  C.UTF-8
    LC_TIME:     C.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "english".

This is from an environment where LANG=fr_FR.UTF-8

I would expect all LC_* variables to be fr_FR.UTF-8, and the default
text search configuration to be "french".  It is what happens
when selecting ICU as the provider in the same environment:

$ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata

  Using language tag "en" for ICU locale "en".
  The database cluster will be initialized with this locale configuration:
    default collation provider:  icu
    default collation locale:     en
    LC_COLLATE:  fr_FR.UTF-8
    LC_CTYPE:     fr_FR.UTF-8
    LC_MESSAGES: fr_FR.UTF-8
    LC_MONETARY: fr_FR.UTF-8
    LC_NUMERIC:  fr_FR.UTF-8
    LC_TIME:     fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "french".

The collation setup does not influence the rest of the localization.
The problem AFAIU is that --locale has two distinct
meanings in the v15 patch:
--locale-provider=X --locale=Y means use "X" as the provider
with "Y" as datlocale, and it means use "Y" as the locale for all
localized libc functionalities.

I wonder what would happen if invoking
 bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
on a system where C.UTF-8 does not exist as a libc locale.
Would it fail? (I don't have an OS like this to test ATM, will try later).

A related comment is about naming the builtin locale C.UTF-8, the same
name as in libc. On one hand this is semantically sound, but on the
other hand, it's likely to confuse people. What about using completely
different names, like "pg_unicode" or something else prefixed by "pg_"
both for the locale name and the collation name (currently
C.UTF-8/c_utf8)?



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

11 January 2024, 19:05:36

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
>  
>   The database cluster will be initialized with this locale
> configuration:
>     default collation provider:  builtin
>     default collation locale:    C.UTF-8
>     LC_COLLATE:  C.UTF-8
>     LC_CTYPE:    C.UTF-8
>     LC_MESSAGES: C.UTF-8
>     LC_MONETARY: C.UTF-8
>     LC_NUMERIC:  C.UTF-8
>     LC_TIME:     C.UTF-8
>   The default database encoding has accordingly been set to "UTF8".
>   The default text search configuration will be set to "english".
>
> This is from an environment where LANG=fr_FR.UTF-8
>
> I would expect all LC_* variables to be fr_FR.UTF-8, and the default
> text search configuration to be "french".

You can get the behavior you want by doing:

  initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \
    -D/tmp/pgdata

where "--builtin-locale" is analogous to "--icu-locale".

It looks like I forgot to document the new initdb option, which seems
to be the source of the confusion. Sorry, I'll fix that in the next
patch set. (See examples in the initdb tests.)

I think this answers some of your follow-up questions as well.

> A related comment is about naming the builtin locale C.UTF-8, the
> same
> name as in libc. On one hand this is semantically sound, but on the
> other hand, it's likely to confuse people. What about using
> completely
> different names, like "pg_unicode" or something else prefixed by
> "pg_"
> both for the locale name and the collation name (currently
> C.UTF-8/c_utf8)?

I'm flexible on naming, but here are my thoughts:

* A "pg_" prefix makes sense.

* If we named it something like "pg_unicode" someone might expect it to
sort using the root collation.

* The locale name "C.UTF-8" is nice because it implies things about
both the collation and the character behavior. It's also nice because
on at least some platforms, the behavior is almost identical to the
libc locale of the same name.

* UCS_BASIC might be a good name, because it also seems to carry the
right meanings, but that name is already taken.

* We also might to support variations, such as full case mapping (which
uppercases "ß" to "SS", as the SQL standard requires), or perhaps the
"standard" flavor of regexes (which don't count all symbols as
punctuation). Leaving some room to name those variations would be a
good idea.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

11 January 2024, 23:36:33

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
>
> =# \l
> ERROR:  42703: column d.datlocale does not exist
> LINE 8:   d.datlocale as "Locale",
>           ^
> HINT:  Perhaps you meant to reference the column "d.daticulocale".
> LOCATION:  errorMissingColumn, parse_relation.c:3720

I think you're connecting to a patched server with an older version of
psql, so it doesn't know the catalog column was renamed.

pg_dump and pg_upgrade don't have that problem because they throw an
error when connecting to a newer server.

But for psql, that's perfectly reasonable to connect to a newer server.
Have we dropped or renamed catalog columns used by psql backslash
commands before, and if so, how do we handle that?

I can just not rename that column, but that's a bit frustrating because
it means I'd need to add a new column to pg_database, which seems
redundant.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

12 January 2024, 02:02:30

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> A related comment is about naming the builtin locale C.UTF-8, the
> same
> name as in libc. On one hand this is semantically sound, but on the
> other hand, it's likely to confuse people. What about using
> completely
> different names, like "pg_unicode" or something else prefixed by
> "pg_"
> both for the locale name and the collation name (currently
> C.UTF-8/c_utf8)?

New version attached. Changes:

 * Named collation object PG_C_UTF8, which seems like a good idea to
prevent name conflicts with existing collations. The locale name is
still C.UTF-8, which still makes sense to me because it matches the
behavior of the libc locale of the same name so closely.

 * Added missing documentation for initdb --builtin-locale

 * Refactored the upper/lower/initcap implementations

 * Improved tests for case conversions where the byte length of the
UTF8-encoded string changes (the string length doesn't change because
we don't do full case mapping).

 * No longer uses titlecase mappings -- libc doesn't do that, so it was
an unnecessary difference in case mapping behavior.

 * Improved test report per Jeremy's suggestion: now it reports the
number of codepoints tested.

Jeremy also raised a problem with old versions of psql connecting to a
new server: the \l and \dO won't work. Not sure exactly what to do
there, but I could work around it by adding a new field rather than
renaming (though that's not ideal).

Regards,
    Jeff Davis

    Jeff Davis wrote:

> New version attached.

[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version:    15.1
category_test: ICU Unicode version:        14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU    property
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;;;;;N;;;;;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt
0C04          ; Other_Alphabetic # Mn        TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

   Once a character is encoded, its properties may still be changed,
   but not in such a way as to change the fundamental identity of the
   character.

   The Consortium will endeavor to keep the values of the other
   properties as stable as possible, but some circumstances may arise
   that require changing them. Particularly in the situation where
   the Unicode Standard first encodes less well-documented characters
   and scripts, the exact character properties and behavior initially
   may not be well known.

   As more experience is gathered in implementing the characters,
   adjustments in the properties may become necessary. Examples of
   such properties include, but are not limited to, the following:

    - General_Category
    - Case mappings
    - Bidirectional properties
    [...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment

results-category-tests-multiple-icu-versions.tar.bz2

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

15 January 2024, 19:42:56

On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote:
> Concerning the target category_test, it produces failures with
> versions of ICU with Unicode < 15. The first one I see with Ubuntu
> 22.04 (ICU 70.1) is:

...

> I find these results interesting because they tell us what contents
> can break regexp-based check constraints on upgrades.

Thank you for collecting and consolidating this information.

> But about category_test as a pass-or-fail kind of test, it can only
> be
> used when the Unicode version in ICU is the same as in Postgres.

The test has a few potential purposes:

1. To see if there is some error in parsing the Unicode files and
building the arrays in the .h file. For instance, let's say the perl
parser I wrote works fine on the Unicode 15.1 data file, but does
something wrong on the 16.0 data file: the test would fail and we'd
investigate. This is the most important reason for the test.

2. To notice any quirks between how we interpret Unicode vs how ICU
does.

3. To help see "interesting" differences between different Unicode
versions.

For #1 and #2, the best way to test is by using a version of ICU that
uses the same Unicode version as Postgres. The one running update-
unicode can try to recompile with the right one for the purposes of the
test. NB: There might be no version of ICU where the Unicode version
exactly matches what we'd like to update to. In that case, we'd need to
use the closest version and do some manual validation that the
generated tables are sane.

For #3, that is also interesting information to know about, but it's
not directly actionable. As you point out, Unicode does not guarantee
that these properties are static forever, so regexes can change
behavior when we update Unicode for the next PG version. That is a much
lower risk than a collation change, but as you point out, is a risk for
regexes inside of a CHECK constraint. If a user needs zero risk of
semantic changes for regexes, the only option is "C". Perhaps there
should be a separate test target for this mode so that it doesn't exit
early?

(Note: case mapping has much stronger guarantees than the character
classification.)

I will update the README to document how someone running update-unicode
should interpret the test results.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

18 January 2024, 12:53:36

On 12.01.24 03:02, Jeff Davis wrote:
> New version attached. Changes:
> 
>   * Named collation object PG_C_UTF8, which seems like a good idea to
> prevent name conflicts with existing collations. The locale name is
> still C.UTF-8, which still makes sense to me because it matches the
> behavior of the libc locale of the same name so closely.

I am catching up on this thread.  The discussions have been very 
complicated, so maybe I didn't get it all.

The patches look pretty sound, but I'm questioning how useful this 
feature is and where you plan to take it.

Earlier in the thread, the aim was summarized as

 > If the Postgres default was bytewise sorting+locale-agnostic
 > ctype functions directly derived from Unicode data files,
 > as opposed to libc/$LANG at initdb time, the main
 > annoyance would be that "ORDER BY textcol" would no
 > longer be the human-favored sort.

I think that would be a terrible direction to take, because it would 
regress the default sort order from "correct" to "useless".  Aside from 
the overall message this sends about how PostgreSQL cares about locales 
and Unicode and such.

Maybe you don't intend for this to be the default provider?  But then 
who would really use it?  I mean, sure, some people would, but how would 
you even explain, in practice, the particular niche of users or use cases?

Maybe if this new provider would be called "minimal", it might describe 
the purpose better.

I could see a use for this builtin provider if it also included the 
default UCA collation (what COLLATE UNICODE does now).  Then it would 
provide a "common" default behavior out of the box, and if you want more 
fine-tuning, you can go to ICU.  There would still be some questions 
about making sure the builtin behavior and the ICU behavior are 
consistent (different Unicode versions, stock UCA vs CLDR, etc.).  But 
for practical purposes, it might work.

There would still be a risk with that approach, since it would 
permanently marginalize ICU functionality, in the sense that only some 
locales would need ICU, and so we might not pay the same amount of 
attention to the ICU functionality.

I would be curious what your overall vision is here?  Is switching the 
default to ICU still your goal?  Or do you want the builtin provider to 
be the default?  Or something else?

Re: Built-in CTYPE provider

From

"Daniel Verite"

Date:

18 January 2024, 19:42:10

    Peter Eisentraut wrote:

> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the human-favored sort.
>
> I think that would be a terrible direction to take, because it would
> regress the default sort order from "correct" to "useless".  Aside from
> the overall message this sends about how PostgreSQL cares about
> locales and Unicode and such.

Well, offering a viable solution to avoid as much as possible
the dreaded:

"WARNING: collation "xyz" has version mismatch
... HINT: Rebuild all objects affected by this collation..."

that doesn't sound like a bad message to send.

Currently, to have in codepoint order the indexes that don't need a
linguistic order, you're supposed to use collate "C", which then means
that upper(), lower() etc.. don't work beyond ASCII.
Here our Unicode support is not good enough, and the proposal
addresses that.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

18 January 2024, 22:03:30

On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
> I think that would be a terrible direction to take, because it would
> regress the default sort order from "correct" to "useless".

I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:

  * the environment variables at initdb time don't reflect what the
users of the database actually want
  * there are so many different users using so many different
applications connected to the database that no one "correct" sort order
exists
  * libc has some implementation quirks
  * the version of Unicode that libc is based on is not what you expect
  * the version of libc is not what you expect

>   Aside from
> the overall message this sends about how PostgreSQL cares about
> locales
> and Unicode and such.

Unicode is primarily about the semantics of characters and their
relationships. The patches I propose here do a great job of that.

Collation (relationships between *strings*) is a part of Unicode, but
not the whole thing or even the main thing.

> Maybe you don't intend for this to be the default provider?

I am not proposing that this provider be the initdb-time default.

>   But then
> who would really use it? I mean, sure, some people would, but how
> would
> you even explain, in practice, the particular niche of users or use
> cases?

It's for users who want to respect Unicode support text from
international sources in their database; but are not experts on the
subject and don't know precisely what they want or understand the
consequences. If and when such users do notice a problem with the sort
order, they'd handle it at that time (perhaps with a COLLATE clause, or
sorting in the application).

> Maybe if this new provider would be called "minimal", it might
> describe
> the purpose better.

"Builtin" communicates that it's available everywhere (not a
dependency), that specific behaviors can be documented and tested, and
that behavior doesn't change within a major version. I want to
communicate all of those things.

> I could see a use for this builtin provider if it also included the
> default UCA collation (what COLLATE UNICODE does now).

I won't rule that out, but I'm not proposing that right now and my
proposal is already offering useful functionality.

> There would still be a risk with that approach, since it would
> permanently marginalize ICU functionality

Yeah, ICU already does a good job offering the root collation. I don't
think the builtin provider needs to do so.

> I would be curious what your overall vision is here?

Vision:

* The builtin provider will offer Unicode character semantics, basic
collation, platform-independence, and high performance. It can be used
on its own or in combination with ICU via the COLLATE clause.

* ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
matching, and customization with rules. It's the solution for
everything from "slightly more advanced" to "very advanced".

* libc would be for databases serving applications on the same machine
where a matching sort order is helpful, risks to indexes are
acceptable, and performance is not important.

>   Is switching the
> default to ICU still your goal?  Or do you want the builtin provider
> to
> be the default?

It's hard to answer this question while initdb chooses the database
default collation based on the environment. Neither ICU nor the builtin
provider can reasonably handle whatever those environment variables
might be set to.

Stepping back from the focus on what initdb does, we should be
providing the right encouragement in documentation and packaging to
guide users toward the right provider based their needs and the vision
outlined above.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

22 January 2024, 18:49:56

On 18.01.24 23:03, Jeff Davis wrote:
> On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
>> I think that would be a terrible direction to take, because it would
>> regress the default sort order from "correct" to "useless".
> 
> I don't agree that the current default is "correct". There are a lot of
> ways it can be wrong:
> 
>    * the environment variables at initdb time don't reflect what the
> users of the database actually want
>    * there are so many different users using so many different
> applications connected to the database that no one "correct" sort order
> exists
>    * libc has some implementation quirks
>    * the version of Unicode that libc is based on is not what you expect
>    * the version of libc is not what you expect

These are arguments why the current defaults are not universally 
perfect, but I'd argue that they are still most often the right thing as 
the default.

>>    Aside from
>> the overall message this sends about how PostgreSQL cares about
>> locales
>> and Unicode and such.
> 
> Unicode is primarily about the semantics of characters and their
> relationships. The patches I propose here do a great job of that.
> 
> Collation (relationships between *strings*) is a part of Unicode, but
> not the whole thing or even the main thing.

I don't get this argument.  Of course, people care about sorting and 
sort order.  Whether you consider this part of Unicode or adjacent to 
it, people still want it.

>> Maybe you don't intend for this to be the default provider?
> 
> I am not proposing that this provider be the initdb-time default.

ok

>>    But then
>> who would really use it? I mean, sure, some people would, but how
>> would
>> you even explain, in practice, the particular niche of users or use
>> cases?
> 
> It's for users who want to respect Unicode support text from
> international sources in their database; but are not experts on the
> subject and don't know precisely what they want or understand the
> consequences. If and when such users do notice a problem with the sort
> order, they'd handle it at that time (perhaps with a COLLATE clause, or
> sorting in the application).

> Vision:

> * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
> matching, and customization with rules. It's the solution for
> everything from "slightly more advanced" to "very advanced".

I am astonished by this.  In your world, do users not want their text 
data sorted?  Do they not care what the sort order is?  You consider UCA 
sort order an "advanced" feature?

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

22 January 2024, 23:33:54

On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote:

> >
> I don't get this argument.  Of course, people care about sorting and
> sort order.  Whether you consider this part of Unicode or adjacent to
> it, people still want it.

You said that my proposal sends a message that we somehow don't care
about Unicode, and I strongly disagree. The built-in provider I'm
proposing does implement Unicode semantics.

Surely a database that offers UCS_BASIC (a SQL spec feature) isn't
sending a message that it doesn't care about Unicode, and neither is my
proposal.

> >
> > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
> > matching, and customization with rules. It's the solution for
> > everything from "slightly more advanced" to "very advanced".
>
> I am astonished by this.  In your world, do users not want their text
> data sorted?  Do they not care what the sort order is? 

I obviously care about Unicode and collation. I've put a lot of effort
recently into contributions in this area, and I wouldn't have done that
if I thought users didn't care. You've made much greater contributions
and I thank you for that.

The logical conclusion of your line of argument would be that libc's
"C.UTF-8" locale and UCS_BASIC simply should not exist. But they do
exist, and for good reason.

One of those good reasons is that only *human* users care about the
human-friendliness of sort order. If Postgres is just feeding the
results to another system -- or an application layer that re-sorts the
data anyway -- then stability, performance, and interoperability matter
more than human-friendliness. (Though Unicode character semantics are
still useful even when the data is not going directly to a human.)

>  You consider UCA
> sort order an "advanced" feature?

I said "slightly more advanced" compared with "basic". "Advanced" can
be taken in either a positive way ("more useful") or a negative way
("complex"). I'm sorry for the misunderstanding, but my point was this:

* The builtin provider is for people who are fine with code point order
and no tailoring, but want Unicode character semantics, collation
stability, and performance.

* ICU is the right solution for anyone who wants human-friendly
collation or tailoring, and is willing to put up with some collation
stability risk and lower collation performance.

Both have their place and the user is free to mix and match as needed,
thanks to the COLLATE clause for columns and queries.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

07 February 2024, 09:53:36

Review of the v16 patch set:

(Btw., I suppose you started this patch series with 0002 because some 
0001 was committed earlier.  But I have found this rather confusing.  I 
think it's ok to renumber from 0001 for each new version.)

* v16-0002-Add-Unicode-property-tables.patch

Various comments are updated to include the term "character class".  I 
don't recognize that as an official Unicode term.  There are categories 
and properties.  Let's check this.

Some files need heavy pgindent and perltidy.  You were surely going to 
do this eventually, but maybe you want to do this sooner to check 
whether you like the results.

- src/common/unicode/Makefile

This patch series adds some new post-update-unicode tests.  Should we 
have a separate target for each or just one common "unicode test" 
target?  Not sure.

- .../generate-unicode_category_table.pl

The trailing commas handling ($firsttime etc.) is not necessary with 
C99.  The code can be simplified.

For this kind of code:

+print $OT <<"HEADER";

let's use a common marker like EOS instead of a different one for each 
block.  That just introduces unnecessary variations.

- src/common/unicode_category.c

The mask stuff at the top could use more explanation.  It's impossible
to figure out exactly what, say, PG_U_PC_MASK does.

Line breaks in the different pg_u_prop_* functions are gratuitously 
different.

Is it potentially confusing that only some pg_u_prop_* have a posix
variant?  Would it be better for a consistent interface to have a
"posix" argument for each and just ignore it if not used?  Not sure.

Let's use size_t instead of Size for new code.


* v16-0003-Add-unicode-case-mapping-tables-and-functions.patch

Several of the above points apply here analogously.


* v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch

This is mostly a straightforward renaming patch, but there are some 
changes in initdb and pg_dump that pre-assume the changes in the next 
patch, like which locale columns apply for which providers.  I think it 
would be better for the historical record to make this a straight 
renaming patch and move those semantic changes to the next patch (or a 
separate intermediate patch, if you prefer).

- src/bin/psql/describe.c
- src/test/regress/expected/psql.out

This would be a good opportunity to improve the output columns for 
collations.  The updated view is now:

+ Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | 
Deterministic?
+--------+------+----------+---------+-------+--------+-----------+----------------

This is historically grown but suboptimal.  Why is Locale after Collate 
and Ctype, and why does it show both?  I think we could have just the 
Locale column, and if the libc provider is used with different 
collate/ctype (very rare!), we somehow write that into the single locale 
column.

(A change like this would be a separate patch.)


* v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch

About this initdb --builtin-locale option and analogous options 
elsewhere:  Maybe we should flip this around and provide a --libc-locale 
option, and have all the other providers just use the --locale option. 
This would be more consistent with the fact that it's libc that is 
special in this context.

Do we even need the "C" locale?  We have established that "C.UTF-8" is 
useful, but if that is easily available, who would need "C"?

Some changes in this patch appear to be just straight renamings, like in
src/backend/utils/init/postinit.c and 
src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put into 
the previous patch instead.

On the collation naming: My expectation would have been that the 
"C.UTF-8" locale would be exposed as the UCS_BASIC collation.  And the 
"C" locale as some other name (or not at all, see above).  You have this 
the other way around.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

13 February 2024, 02:01:29

On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote:
> Various comments are updated to include the term "character class". 
> I
> don't recognize that as an official Unicode term.  There are
> categories
> and properties.  Let's check this.

It's based on
https://www.unicode.org/reports/tr18/#Compatibility_Properties

so I suppose the right name is "properties".

> Is it potentially confusing that only some pg_u_prop_* have a posix
> variant?  Would it be better for a consistent interface to have a
> "posix" argument for each and just ignore it if not used?  Not sure.

I thought about it but didn't see a clear advantage one way or another.

> About this initdb --builtin-locale option and analogous options
> elsewhere:  Maybe we should flip this around and provide a --libc-
> locale
> option, and have all the other providers just use the --locale
> option.
> This would be more consistent with the fact that it's libc that is
> special in this context.

Would --libc-locale affect all the environment variables or just
LC_CTYPE/LC_COLLATE? How do we avoid breakage?

I like the general direction here but we might need to phase in the
option or come up with a new name. Suggestions welcome.

> Do we even need the "C" locale?  We have established that "C.UTF-8"
> is
> useful, but if that is easily available, who would need "C"?

I don't think we should encourage its use generally but I also don't
think it will disappear any time soon. Some people will want it on
simplicity grounds. I hope fewer people will use "C" when we have a
better builtin option.

> Some changes in this patch appear to be just straight renamings, like
> in
> src/backend/utils/init/postinit.c and
> src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put
> into
> the previous patch instead.
>
> On the collation naming: My expectation would have been that the
> "C.UTF-8" locale would be exposed as the UCS_BASIC collation.

I'd like that. We have to sort out a couple things first, though:

1. The SQL spec mentions the capitalization of "ß" as "SS"
specifically. Should UCS_BASIC use the unconditional mappings in
SpecialCasing.txt? I already have some code to do that (not posted
yet).

2. Should UCS_BASIC use the "POSIX" or "Standard" variant of regex
properties? (The main difference seems to be whether symbols get
treated as punctuation or not.)

3. What do we do about potential breakage for existing users of
UCS_BASIC who might be expecting C-like behavior?

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

13 February 2024, 06:24:32

On 13.02.24 03:01, Jeff Davis wrote:
> 1. The SQL spec mentions the capitalization of "ß" as "SS"
> specifically. Should UCS_BASIC use the unconditional mappings in
> SpecialCasing.txt? I already have some code to do that (not posted
> yet).

It is my understanding that "correct" Unicode case conversion needs to 
use at least some parts of SpecialCasing.txt.  The header of the file says

"For compatibility, the UnicodeData.txt file only contains simple case 
mappings for characters where they are one-to-one and independent of 
context and language. The data in this file, combined with the simple 
case mappings in UnicodeData.txt, defines the full case mappings [...]"

I read this as, just using UnicodeData.txt by itself is incomplete.

I think we need to use the "Unconditional" mappings and the "Conditional 
Language-Insensitive" mappings (which is just Greek sigma).  Obviously, 
skip the "Language-Sensitive" mappings.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

16 February 2024, 00:13:19

On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote:
> Review of the v16 patch set:
>
> (Btw., I suppose you started this patch series with 0002 because some
> 0001 was committed earlier.  But I have found this rather confusing. 
> I
> think it's ok to renumber from 0001 for each new version.)

Fixed.

> Various comments are updated to include the term "character class". 
> I
> don't recognize that as an official Unicode term.  There are
> categories
> and properties.  Let's check this.

Changed to "properties" or "compatibility properties", except for a
couple places in the test. The test compares against ICU, which does
use the term "character classes" when discussing regexes:

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details

> Some files need heavy pgindent and perltidy.

Done.

> This patch series adds some new post-update-unicode tests.  Should we
> have a separate target for each or just one common "unicode test"
> target?  Not sure.

I didn't make a change here. I suspect anyone updating unicode would
want to run them all, but I don't have a strong opinion.

> - .../generate-unicode_category_table.pl
>
> The trailing commas handling ($firsttime etc.) is not necessary with
> C99.  The code can be simplified.

Thank you, fixed.

> For this kind of code:
>
> +print $OT <<"HEADER";

Done. I used the <<"EOS" style which is more friendly to emacs, but I'm
not sure if that's right for the project style.

> Is it potentially confusing that only some pg_u_prop_* have a posix
> variant?  Would it be better for a consistent interface to have a
> "posix" argument for each and just ignore it if not used?  Not sure.

I don't have a strong opinion here, so I didn't make a change. I can if
you think it's cleaner.

> Let's use size_t instead of Size for new code.

Done.

> * v16-0003-Add-unicode-case-mapping-tables-and-functions.patch
>
> Several of the above points apply here analogously.

Fixed, I think.

> * v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch
>
> This is mostly a straightforward renaming patch, but there are some
> changes in initdb and pg_dump that pre-assume the changes in the next
> patch, like which locale columns apply for which providers.  I think
> it
> would be better for the historical record to make this a straight
> renaming patch and move those semantic changes to the next patch (or
> a
> separate intermediate patch, if you prefer).

Agreed, put those non-renaming changes in the next patch.

> - src/bin/psql/describe.c
> - src/test/regress/expected/psql.out
>
> This would be a good opportunity to improve the output columns for
> collations.  The updated view is now:
>
> + Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules |
> Deterministic?
> +--------+------+----------+---------+-------+--------+-----------+--
> --------------
>
> This is historically grown but suboptimal.  Why is Locale after
> Collate
> and Ctype, and why does it show both?  I think we could have just the
> Locale column, and if the libc provider is used with different
> collate/ctype (very rare!), we somehow write that into the single
> locale
> column.
>
> (A change like this would be a separate patch.)

I didn't do this, yet.

> * v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch
>
> About this initdb --builtin-locale option and analogous options
> elsewhere:  Maybe we should flip this around and provide a --libc-
> locale
> option, and have all the other providers just use the --locale
> option.
> This would be more consistent with the fact that it's libc that is
> special in this context.

I agree that libc is the odd one out. I'm not quite sure how we should
express that, though, because there are also the other environment
variables to worry about (e.g. LC_MESSAGES). Probably best as a
separate patch.

> Some changes in this patch appear to be just straight renamings, like
> in
> src/backend/utils/init/postinit.c and
> src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put
> into
> the previous patch instead.

Moved renamings to the previous patch.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

27 February 2024, 03:01:37

On Tue, 2024-02-13 at 07:24 +0100, Peter Eisentraut wrote:
> It is my understanding that "correct" Unicode case conversion needs
> to
> use at least some parts of SpecialCasing.txt.
...
> I think we need to use the "Unconditional" mappings and the
> "Conditional
> Language-Insensitive" mappings (which is just Greek sigma). 
> Obviously,
> skip the "Language-Sensitive" mappings.

Attached a new series.

Overall I'm quite happy with this feature as well as the recent
updates. It expands a lot on what behavior we can actually document;
the character semantics are nearly as good as ICU; it's fast; and it
eliminates what is arguably the last reason to use libc ("C collation
combined with some other CTYPE").

Changes:

 * Added a doc update for the "standard collations" (tiny patch, mostly
separate) which clarifies the collations that are always available, and
describes them a bit better

 * Added built-in locale "UCS_BASIC" (is that name confusing?) which
uses full case mapping and the standard properties:
   - "ß" uppercases to "SS"
   - "Σ" usually lowercases to "σ", except when the Final_Sigma
condition is met, in which case it lowercases to "ς"
   - initcap() uses titlecase variants ("ǆ" changes to "ǅ")
   - in patterns/regexes, symbols (like "=") are not treated as
punctuation

 * Changed the UCS_BASIC collation to use the builtin "UCS_BASIC"
locale with Unicode semantis. At first I was skeptical because it's a
behavior change, and I am still not sure we want to do that. But doing
so would take us closer to both the SQL spec as well as Unicode; and
also this kind of character behavior change is less likely to cause a
problem than a collation behavior change.

 * The built-in locale "C.UTF-8" still exists, which uses Unicode
simple case mapping and the POSIX compatible properties (no change
here).

Implementation-wise:

 * I introduced the CaseKind enum, which seemed to clean up a few
things and reduce code duplication between upper/lower/titlecase. It
also leaves room for introducing case folding later.

 * Introduced a "case-ignorable" table to properly implement the
Final_Sigma rule.

Loose ends:

 * Right now you can't mix all of the full case mapping behavior with
INITCAP(), it just does simple titlecase mapping. I'm not sure we want
to get too fancy here; after all, INITCAP() is not a SQL standard
function and it's documented in a narrow fashion that doesn't seem to
leave a lot of room to be very smart. ICU does a few extra things
beyond what I did:
  - it accepts a word break iterator to the case conversion function
  - it provides some built-in word break iterators
  - it also has some configurable "break adjustment" behavior[1][2]
which re-aligns the start of the word, and I'm not entirely sure why
that isn't done in the word break iterator or the titlecasing rules

Regards,
    Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

01 March 2024, 05:05:34

On Mon, 2024-02-26 at 19:01 -0800, Jeff Davis wrote:
>  * Right now you can't mix all of the full case mapping behavior with
> INITCAP(), it just does simple titlecase mapping. I'm not sure we
> want
> to get too fancy here; after all, INITCAP() is not a SQL standard
> function and it's documented in a narrow fashion that doesn't seem to
> leave a lot of room to be very smart. ICU does a few extra things
> beyond what I did:
>   - it accepts a word break iterator to the case conversion function
>   - it provides some built-in word break iterators
>   - it also has some configurable "break adjustment" behavior[1][2]
> which re-aligns the start of the word, and I'm not entirely sure why
> that isn't done in the word break iterator or the titlecasing rules

Attached v19 which addresses this issue. It does proper Unicode
titlecasing with a word boundary iterator as an argument. For initcap,
it just uses a simple word boundary iterator that breaks whenever
isalnum() changes.

It came out cleaner this way, ultimately, and it feels more complete
even though the behavior isn't much different. It's also easier to
comment the relationship of the functions to Unicode. I removed
CaseKind from the public API but still use it internally to avoid code
duplication.

I made one other change, which is that (for now) I undid the UCS_BASIC
change until we are sure we want to change it. Instead, I have builtin
collations PG_C_UTF8 and PG_UNICODE_FAST. I used the name "FAST" to
indicate that the collation uses fast memcmp() rather than a real
collation, but the Unicode character support is all there (including
full case mapping). I'm open to suggestion here on naming.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

02 March 2024, 23:02:00

On Thu, 2024-02-29 at 21:05 -0800, Jeff Davis wrote:
> Attached v19 which addresses this issue.

I pushed the doc patch.

Attached v20. I am going to start pushing some other patches. v20-0001
(property tables) and v20-0003 (catalog iculocale -> locale) have been
stable for a while so are likely to go in soon. v20-0002 (case mapping)
also feels close to me, but it went through significant changes to
support full case mapping and titlecasing, so I'll see if there are
more comments.

Changes in v20:

 * For titlecasing with the builtin "C.UTF-8" locale, do not perform
word break adjustment, so it matches libc's "C.UTF-8" titlecasing
behavior more closely.

 * Add optimized table for ASCII code points when determining
categories and properties (this was already done for the case mapping
table).

 * Add a small patch to make UTF-8 functions inline, which speeds
things up substantially.

Performance:

ASCII-only data:

                       lower    initcap    upper

  "C" (libc)            2426       3326     2341
  pg_c_utf8             2890       6570     2825
  pg_unicode_fast       2929       7140     2893
  "C.utf8" (libc)       5410       7810     5397
  "en-US-x-icu"         8320      65732     9367

Including non-ASCII data:

                       lower    initcap    upper

  "C" (libc)            2630       4677     2548
  pg_c_utf8             5471      10682     5431
  pg_unicode_fast       5582      12023     5587
  "C.utf8" (libc)       8126      11834     8106
  "en-US-x-icu"        14473      73655    15112


The new builtin collations nicely finish ahead of everything except "C"
(with an exception where pg_unicode_fast is marginally slower at
titlecasing non-ASCII data than libc "C.UTF-8", which is likely due to
the word break adjustment semantics).

I suspect the inlined UTF-8 functions also speed up a few other areas,
but I didn't measure.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

08 March 2024, 01:00:21

On Sat, 2024-03-02 at 15:02 -0800, Jeff Davis wrote:
> Attached v20.

And here's v22 (I didn't post v21).

I committed Unicode property tables and functions, and the simple case
mapping. I separated out the full case mapping changes (based on
SpecialCasing.txt) into patch 0006.

Not a lot of technical changes, but I cleaned up the remaining patches
and put them into a nicer order with nicer commit messages.

0001: Catalog renaming: colliculocale to colllocale and daticulocale to
datlocale.

0002: Basic builtin collation provider that only supports "C".

0003: C.UTF-8 locale for builtin collation provider and collation
pg_c_utf8.

0004: Inline some UTF-8 functions to improve performance

0005: Add a unicode_strtitle() function and move the implementation for
the builtin provider out of formatting.c.

0006: Add full case mapping support

0007: Add PG_UNICODE_FAST locale for builtin collation provider and
collation pg_unicode_fast. This behaves like the standard says
UCS_BASIC should behave -- sort by code point order but use Unicode
character semantics with full case mapping.


0004 and beyond could use some review. 0004 and 0005 are pretty simple
and non-controversial. 0006 and 0007 are a bit more interesting and
could use some discussion if we want to go ahead with full case mapping
in 17.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

12 March 2024, 08:24:14

On 08.03.24 02:00, Jeff Davis wrote:
> And here's v22 (I didn't post v21).
> 
> I committed Unicode property tables and functions, and the simple case
> mapping. I separated out the full case mapping changes (based on
> SpecialCasing.txt) into patch 0006.

> 0002: Basic builtin collation provider that only supports "C".

Overall, this patch looks sound.

In the documentation, let's make the list of locale providers an actual 
list instead of a sequence of <sect3>s.

We had some discussion on initdb option --builtin-locale and whether it 
should be something more general.  I'm ok with leaving it like this for 
now and maybe consider as an "open item" for PG17.

In

     errmsg("parameter \"locale\" must be specified")

make "locale" a placeholder.  (See commit 36a14afc076).

It seems the builtin provider accepts both "C" and "POSIX" as locale
names, but the documentation says it must be "C".  Maybe we don't need
to accept "POSIX"?  (Seeing that there are no plans for "POSIX.UTF-8",
maybe we just ignore the "POSIX" spelling altogether?)

Speaking of which, the code in postinit.c is inconsistent in that 
respect with builtin_validate_locale().  Shouldn't postinit.c use
builtin_validate_locale(), to keep it consistent?

Or, there could be a general function that accepts a locale provider and 
a locale string and validates everything together?

In initdb.c, this message

printf(_("The database cluster will be initialized with no locale.\n"));

sounds a bit confusing.  I think it's ok to show "C" as a locale.  I'm
not sure we need to change the logic here.

Also in initdb.c, this message

pg_fatal("locale must be specified unless provider is libc");

should be flipped around, like

locale must be specified if provider is %s

In pg_dump.c, dumpDatabase(), there are some new warning messages that
are not specifically about the builtin provider.  Are those existing
deficiencies?  It's not clear to me.

What are the changes in the pg_upgrade test about?  Maybe explain the
scenario it is trying to test briefly?

> 0004: Inline some UTF-8 functions to improve performance

Makes sense that inlining can be effective here.  But why aren't you 
just inlining the existing function pg_utf_mblen()?  Now we have two 
functions that do the same thing.  And the comment at pg_utf_mblen() is 
removed completely, so it's not clear anymore why it exists.

> 0005: Add a unicode_strtitle() function and move the implementation for
> the builtin provider out of formatting.c.

In the recent discussion you had expression some uncertainty about the 
detailed semantics of this.  INITCAP() was copied from Oracle, so we 
could check there for reference, too.  Or we go with full Unicode 
semantics.  I'm not clear on all the differences and tradeoffs, if there 
are any.  In any case, it would be good if there were documentation or a 
comment that somehow wrote down the resolution of this.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

13 March 2024, 07:44:37

On Tue, 2024-03-12 at 09:24 +0100, Peter Eisentraut wrote:
> In the documentation, let's make the list of locale providers an
> actual
> list instead of a sequence of <sect3>s.

Done.

> We had some discussion on initdb option --builtin-locale and whether
> it
> should be something more general.  I'm ok with leaving it like this
> for
> now and maybe consider as an "open item" for PG17.

OK.

> In
>
>      errmsg("parameter \"locale\" must be specified")
>
> make "locale" a placeholder.  (See commit 36a14afc076).

Done.

> It seems the builtin provider accepts both "C" and "POSIX" as locale
> names, but the documentation says it must be "C".  Maybe we don't
> need
> to accept "POSIX"?  (Seeing that there are no plans for "POSIX.UTF-
> 8",
> maybe we just ignore the "POSIX" spelling altogether?)

Agreed, removed "POSIX".

> Speaking of which, the code in postinit.c is inconsistent in that
> respect with builtin_validate_locale().  Shouldn't postinit.c use
> builtin_validate_locale(), to keep it consistent?

Agreed, done.

> Or, there could be a general function that accepts a locale provider
> and
> a locale string and validates everything together?

That's a good idea -- perhaps a separate cleanup patch?

> In initdb.c, this message
>
> printf(_("The database cluster will be initialized with no
> locale.\n"));
>
> sounds a bit confusing.  I think it's ok to show "C" as a locale. 
> I'm
> not sure we need to change the logic here.

Agreed, removed.

> Also in initdb.c, this message
>
> pg_fatal("locale must be specified unless provider is libc");
>
> should be flipped around, like
>
> locale must be specified if provider is %s

Done.

> In pg_dump.c, dumpDatabase(), there are some new warning messages
> that
> are not specifically about the builtin provider.  Are those existing
> deficiencies?  It's not clear to me.

I wouldn't call that a deficiency, but it seemed to be a convenient
place to do some extra sanity checking along with the minor
reorganization I did in that area.

> What are the changes in the pg_upgrade test about?  Maybe explain the
> scenario it is trying to test briefly?

It's trying to be a better test for commit 9637badd9f, which eliminates
needless locale incompatibilities when performing a pg_upgrade.

At the time of that commit, the options for testing were fairly
limited, so I'm just expanding on that here a bit. It might be slightly
over-engineered? I added some comments and cleaned it up.

> > 0004: Inline some UTF-8 functions to improve performance
>
> Makes sense that inlining can be effective here.  But why aren't you
> just inlining the existing function pg_utf_mblen()?  Now we have two
> functions that do the same thing.  And the comment at pg_utf_mblen()
> is
> removed completely, so it's not clear anymore why it exists.

I was trying to figure out what to do about USE_PRIVATE_ENCODING_FUNCS.

If libpq exports pg_utf_mblen(), it needs to continue to export that,
or else it's an ABI break, right? So that means we need at least one
extern copy of the function. See b6c7cfac88.

Though now that I look at it, I'm not even calling the inlined version
from my code -- I must have been using it in an earlier version and now
not. So I just left pg_utf_mblen() alone, and inlined unicode_to_utf8()
and utf8_to_unicode().

> > 0005: Add a unicode_strtitle() function and move the implementation
> > for
> > the builtin provider out of formatting.c.
>
> In the recent discussion you had expression some uncertainty about
> the
> detailed semantics of this.  INITCAP() was copied from Oracle, so we
> could check there for reference, too.  Or we go with full Unicode
> semantics.  I'm not clear on all the differences and tradeoffs, if
> there
> are any.  In any case, it would be good if there were documentation
> or a
> comment that somehow wrote down the resolution of this.

There are a few nuances that are different between the Unicode way to
titlecase a string and INITCAP():

  1. For the initial character in a word, Unicode uses the titlecase
mapping, whereas INITCAP (as the name suggests) uses the uppercase
mapping.
  2. Unicode uses full case mapping, which can change the length of the
string (e.g. mapping "ß" to the titlecase "Ss" -- though I've heard
that titlecasing "ß" doesn't make a lot of sense in German because
words typically don't begin with it). Full case mapping can also handle
context-sensitive mappings, such as the "final sigma".
  3. Unicode has a lot to say about word boundaries, whereas INITCAP()
just uses the boundary between alnum and !alnum.

The unicode_strtitle() function is just a way to unify those
differences into one implementation. A "full" parameter controls
behaviors 1 & 2, and a callback handles 3. If we just want to keep it
simple, we can leave it as the character-by-character algorithm in
formatting.c.

My uncertainty was whether we really want INITCAP to be doing these
more sophisticated titlecasing transformations, or whether that should
be a separate sql function (title()? titlecase()?), or whether we just
don't need that functionality.

New series attached. I plan to commit 0001 very soon.

Regards,
    Jeff Davis

Attachment

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

14 March 2024, 08:08:19

On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote:
> New series attached. I plan to commit 0001 very soon.

Committed the basic builtin provider, supporting only the "C" locale.

There were a few changes since the last version I posted:

  * Added simplistic validation of the locale name to initdb.c (missing
before).
  * Consistently passed the locale name to
get_collation_actual_version(). In the previous patch, the caller
sometimes just passed NULL knowing that the builtin provider is not
versioned, but that's not the caller's responsibility.
  * pg_dump previously had some minor refactoring, which you had some
questions about. I eliminated that and just kept it to the changes
necessary for the builtin provider.
  * createdb --help was missing the --builtin-locale option
  * improved error checking order in createdb() to avoid a confusing
error message.

I also attached a rebased series.

0001 (the C.UTF-8 locale) is also close. Considering that most of the
infrastructure is already in place, that's not a large patch. You many
have some comments about the way I'm canonicalizing and validating in
initdb -- that could be cleaner, but it feels like I should refactor
the surrounding code separately first.

0002 (inlining utf8 functions) is also ready.

For 0003 and beyond, I'd like some validation that it's what you had in
mind.

Regards,
    Jeff Davis

On Thu, 2024-03-14 at 09:54 +0100, Peter Eisentraut wrote:
> * doc/src/sgml/charset.sgml
>
> I don't understand the purpose of this sentence:
>
> "When using this locale, the behavior may depend on the database
> encoding."

The "C" locale (in either the builtin or libc provider) can sort
differently in different encodings, because it's based on memcmp. For
instance:

   select U&'\20AC' > U&'\201A' collate "C";

Returns true in UTF-8 and false in WIN1252. That's why UCS_BASIC is
only available in UTF-8, because (at least for some encodings) we'd
have to decode before comparison to get the code-point-order semantics
right.

In other words, the "C" collation is not a well-defined order, but
UCS_BASIC and C.UTF-8 are well-defined.

Suggestions for better wording are welcome.

> * doc/src/sgml/ref/create_database.sgml
>
> The new parameter builtin_locale is not documented.

Thank you, fixed in 0001 (review fixup).

> * src/backend/commands/collationcmds.c
>
> I think DefineCollation() should set collencoding = -1 for the
> COLLPROVIDER_BUILTIN case.  -1 stands for any encoding.  Or at least
> explain why not?

In the attached v25-0001 (review fixup) I have made it the
responsibility of a function, and then extended that for the C.UTF-8
(0002) and PG_UNICODE_FAST locales (0007).

> * src/backend/utils/adt/pg_locale.c
>
> This part is a bit confusing:
>
> +           cache_entry->collate_is_c = true;
> +           cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0);
>
> Is collate always C but ctype only sometimes?  Does this anticipate
> future patches in this series?  Maybe in this patch it should always
> be true?

Made it a constant in v25-0001, and changed it in 0002

>
> * src/bin/initdb/initdb.c
>
> +   printf(_("      --builtin-locale=LOCALE   set builtin locale name
> for new databases\n"));
>
> Put in a line break so that the right "column" lines up.

Fixed in 0001

> This output should line up better:
>
> The database cluster will be initialized with this locale
> configuration:
>    default collation provider:  icu
>    default collation locale:    en
>    LC_COLLATE:  C
>    LC_CTYPE:    C
>    ...
>
> Also, why are there two spaces after "provider:  "?
>
> Also we call these locale provider on input, why are they collation
> providers on output?  What is a "collation locale"?

I tried to fix these things in 0001.

> * src/bin/pg_upgrade/t/002_pg_upgrade.pl
>
> +if ($oldnode->pg_version >= '17devel')
>
> This is weird.  >= is a numeric comparison, so providing a string
> with
> non-digits is misleading at best.

It's actually not a numeric comparison, it's an overloaded comparison
op for the Version class.

See 32dd2c1eff and:
https://www.postgresql.org/message-id/1738174.1710274577%40sss.pgh.pa.us

> * src/test/icu/t/010_database.pl
>
> -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
> -# are specified
>
> Why remove this test?

It must have been lost during a rebase, fixed in 0001.

> Change the name of the new database to be different from the name of
> the template database.

Fixed in 0001.

New series attached.

Regards,
    Jeff Davis

On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:
> * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
>
> Looks ok.

Committed.

> * v25-0003-Inline-basic-UTF-8-functions.patch

Committed.

> * v25-0004-Use-version-for-builtin-collations.patch
>
> Not sure about the version format "1.0", which implies some sort of
> major/minor or component-based system.  I would just use "1".

The v26 patch was not quite complete, so I didn't commit it yet.
Attached v27-0001 and 0002.

0002 is necessary because otherwise lc_collate_is_c() short-circuits
the version check in pg_newlocale_from_collation(). With 0002, the code
is simpler and all paths go through pg_newlocale_from_collation(), and
the version check happens even when lc_collate_is_c().

But perhaps there was a reason the code was the way it was, so
submitting for review in case I missed something.

> 0005 and 0006 don't contain any test cases.  So I guess they are
> really
> only usable via 0007.  Is that understanding correct?

0005 is not a functional change, it's just a refactoring to use a
callback, which is preparation for 0007.

> Are there any test cases that illustrate the word boundary changes in
> patch 0005?  It might be useful to test those against Oracle as well.

The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
collation vs '123Abc' in PG_UNICODE_FAST.

The reason for the latter behavior is that the Unicode Default Case
Conversion algorithm for toTitlecase() advances to the next Cased
character before mapping to titlecase, and digits are not Cased. ICU
has a configurable adjustment, and defaults in a way that produces
'123abc'.

New rebased series attached.

Regards,
    Jeff Davis

On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:
> The patch set v27 is ok with me, modulo (a) discussion about initcap
> semantics, and (b) what collation to assign to ucs_basic, which can
> be
> revisited later.

Attached v28.

The remaining patches are for full case mapping and PG_UNICODE_FAST. 

I am fine waiting until July to get these remaining patches committed.
That would give us time to sort out details like:

* Get consensus that it's OK to change UCS_BASIC.
* Figure out if we need a pg-specific locale and whether
PG_UNICODE_FAST is the right name.
* Make sure that full case mapping interacts with regexes in a sane way
(probably it needs to just fall back to simple case mapping, but
perhaps that's worth a discussion).
* Implement case folding.
* Implement a more unicode-friendly TITLECASE() function, which could
offer a number of options that don't fit well with INITCAP().
* Figure out if UPPER()/LOWER() should also have some of those options.

Thoughts?

Regards,
    Jeff Davis

On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote:
> Hi,
>
> +command_ok(
> +       [
> +               'initdb', '--no-sync',
> +               '--locale-provider=builtin', '-E UTF-8',
> +               '--builtin-locale=C.UTF-8', "$tempdir/data8"
> +       ],
> +       'locale provider builtin with -E UTF-8 --builtin-
> locale=C.UTF-8');

...

>   LC_COLLATE:  en_US
>   LC_CTYPE:    en_US
>   LC_MESSAGES: C
>   LC_MONETARY: en_US
>   LC_NUMERIC:  en_US
>   LC_TIME:     en_US
> initdb: error: encoding mismatch
> initdb: detail: The encoding you selected (UTF8) and the encoding
> that
> the selected locale uses (LATIN1) do not match.

Thank you for the report.

I fixed it in e2a2357671 by forcing the environment locale to C which
is compatible with any encoding. The test still forces the encoding to
UTF-8 and the collation to the builtin C.UTF-8.

In passing, I noticed some unrelated regression test failures when I
set LANG=tr_TR: tsearch, tsdict, json, and jsonb. There's an additional
failure in the updatable_views test when LANG=tr_TR.utf8. I haven't
looked into the details yet.

Regards,
    Jeff Davis

Attachment

v1-0001-Fix-test-failure-in-001_initdb.pl.patch

Re: Built-in CTYPE provider

From

Noah Misch

Date:

29 June 2024, 22:08:57

On Wed, Mar 20, 2024 at 05:13:26PM -0700, Jeff Davis wrote:
> On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:
> > * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
> > 
> > Looks ok.
> 
> Committed.

>       <varlistentry>
> +      <term><literal>pg_c_utf8</literal></term>
> +      <listitem>
> +       <para>
> +        This collation sorts by Unicode code point values rather than natural
> +        language order.  For the functions <function>lower</function>,
> +        <function>initcap</function>, and <function>upper</function>, it uses
> +        Unicode simple case mapping.  For pattern matching (including regular
> +        expressions), it uses the POSIX Compatible variant of Unicode <ulink
> +        url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility
> +        Properties</ulink>.  Behavior is efficient and stable within a
> +        <productname>Postgres</productname> major version.  This collation is
> +        only available for encoding <literal>UTF8</literal>.
> +       </para>
> +      </listitem>
> +     </varlistentry>

lower(), initcap(), upper(), and regexp_matches() are PROVOLATILE_IMMUTABLE.
Until now, we've delegated that responsibility to the user.  The user is
supposed to somehow never update libc or ICU in a way that changes outcomes
from these functions.  Now that postgresql.org is taking that responsibility
for builtin C.UTF-8, how should we govern it?  I think the above text and [1]
convey that we'll update the Unicode data between major versions, making
functions like lower() effectively STABLE.  Is that right?

(This thread had some discussion[2] that datcollversion/collversion won't
necessarily change when a major versions changes lower() behavior.)

[1] https://postgr.es/m/7089acb3ebac0c1682a79c8bc16803cf06896fb9.camel@j-davis.com
[2] https://postgr.es/m/5a1ecc40539f36cac5b27a62739a45a49785ca54.camel@j-davis.com

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

01 July 2024, 19:24:15

On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> lower(), initcap(), upper(), and regexp_matches() are
> PROVOLATILE_IMMUTABLE.
> Until now, we've delegated that responsibility to the user.  The user
> is
> supposed to somehow never update libc or ICU in a way that changes
> outcomes
> from these functions.

To me, "delegated" connotes a clear and organized transfer of
responsibility to the right person to solve it. In that sense, I
disagree that we've delegated it.

What's happened here is evolution of various choices that seemed
reasonable at the time. Unfortunately, the consequences that are hard
for us to manage and even harder for users to manage themselves.

>   Now that postgresql.org is taking that responsibility
> for builtin C.UTF-8, how should we govern it?  I think the above text
> and [1]
> convey that we'll update the Unicode data between major versions,
> making
> functions like lower() effectively STABLE.  Is that right?

Marking them STABLE is not a viable option, that would break a lot of
valid use cases, e.g. an index on LOWER().

Unicode already has its own governance, including a stability policy
that includes case mapping:

https://www.unicode.org/policies/stability_policy.html#Case_Pair

Granted, that policy does not guarantee that the results will never
change. In particular, the results can change if using unassinged code
poitns that are later assigned to Cased characters.

That's not terribly common though; for instance, there are zero changes
in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1
(current) -- even for code points that were unassigned in 14.0 and
later assigned. I checked this by modifying case_test.c to look at
unassigned code points as well.

There's a greater chance that character properties can change (e.g.
whether a character is "alphabetic" or not) in new releases of Unicode.
Such properties can affect regex character classifications, and in some
cases the results of initcap (because it uses the "alphanumeric"
classification to determine word boundaries).

I don't think we need code changes for 17. Some documentation changes
might be helpful, though. Should we have a note around LOWER()/UPPER()
that users should REINDEX any dependent indexes when the provider is
updated?

> (This thread had some discussion[2] that datcollversion/collversion
> won't
> necessarily change when a major versions changes lower() behavior.)

datcollversion/collversion track the vertsion of the collation
specifically (text ordering only), not the ctype (character semantics).
When using the libc provider, get_collation_actual_version() completely
ignores the ctype.

It would be interesting to consider tracking the versions separately,
though.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Noah Misch

Date:

01 July 2024, 23:03:52

On Mon, Jul 01, 2024 at 12:24:15PM -0700, Jeff Davis wrote:
> On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> > lower(), initcap(), upper(), and regexp_matches() are
> > PROVOLATILE_IMMUTABLE.
> > Until now, we've delegated that responsibility to the user.  The user
> > is
> > supposed to somehow never update libc or ICU in a way that changes
> > outcomes
> > from these functions.
> 
> To me, "delegated" connotes a clear and organized transfer of
> responsibility to the right person to solve it. In that sense, I
> disagree that we've delegated it.

Good point.

> >   Now that postgresql.org is taking that responsibility
> > for builtin C.UTF-8, how should we govern it?  I think the above text
> > and [1]
> > convey that we'll update the Unicode data between major versions,
> > making
> > functions like lower() effectively STABLE.  Is that right?
> 
> Marking them STABLE is not a viable option, that would break a lot of
> valid use cases, e.g. an index on LOWER().

I agree.

> I don't think we need code changes for 17. Some documentation changes
> might be helpful, though. Should we have a note around LOWER()/UPPER()
> that users should REINDEX any dependent indexes when the provider is
> updated?

I agree the v17 code is fine.  Today, a user can (with difficulty) choose
dependency libraries so regexp_matches() is IMMUTABLE, as marked.  I don't
want $SUBJECT to be the ctype that, at some post-v17 version, can't achieve
that with unpatched PostgreSQL.  Let's change the documentation to say this
provider uses a particular snapshot of Unicode data, taken around PostgreSQL
17.  We plan never to change that data, so IMMUTABLE functions can rely on the
data.  If we provide a newer Unicode data set in the future, we'll provide it
in such a way that DDL must elect the new data.  How well would that suit your
vision for this feature?  An alternative would be to make pg_upgrade reject
operating on a cluster that contains use of $SUBJECT.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

02 July 2024, 01:19:08

On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> I agree the v17 code is fine.  Today, a user can (with difficulty)
> choose
> dependency libraries so regexp_matches() is IMMUTABLE, as marked.  I
> don't
> want $SUBJECT to be the ctype that, at some post-v17 version, can't
> achieve
> that with unpatched PostgreSQL.

We aren't forcing anyone to use the builtin "C.UTF-8" locale. Anyone
can still use the builtin "C" locale (which never changes), or another
provider if they can sort out the difficulties (and live with the
consequences) of pinning the dependencies to a specific version.

>   Let's change the documentation to say this
> provider uses a particular snapshot of Unicode data, taken around
> PostgreSQL
> 17.  We plan never to change that data, so IMMUTABLE functions can
> rely on the
> data.

We can discuss this in the context of version 18 or the next time we
plan to update Unicode. I don't think we should make such a promise in
version 17.

>   If we provide a newer Unicode data set in the future, we'll provide
> it
> in such a way that DDL must elect the new data.  How well would that
> suit your
> vision for this feature?

Thomas tried tracking collation versions along with individual objects,
and it had to be reverted (ec48314708).

It fits my vision to do something like that as a way of tightening
things up.

But there are some open design questions we need to settle, along with
a lot of work. So I don't think we should pre-emptively block all
Unicode updates waiting for it.

>   An alternative would be to make pg_upgrade reject
> operating on a cluster that contains use of $SUBJECT.

That wouldn't help anyone.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Noah Misch

Date:

02 July 2024, 16:51:45

On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote:
> On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> >   An alternative would be to make pg_upgrade reject
> > operating on a cluster that contains use of $SUBJECT.
> 
> That wouldn't help anyone.

Can you say more about that?  For the last decade at least, I think our
standard for new features has been to error rather than allow an operation
that creates a known path to wrong query results.  I think that's a helpful
standard that we should continue to follow.

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

02 July 2024, 22:05:09

On 02.07.24 18:51, Noah Misch wrote:
> On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote:
>> On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
>>>    An alternative would be to make pg_upgrade reject
>>> operating on a cluster that contains use of $SUBJECT.
>>
>> That wouldn't help anyone.
> 
> Can you say more about that?  For the last decade at least, I think our
> standard for new features has been to error rather than allow an operation
> that creates a known path to wrong query results.  I think that's a helpful
> standard that we should continue to follow.

I don't think the builtin locale provider is any different in this 
respect from the other providers:  The locale data might change and 
there is a version mechanism to track that.  We don't prevent pg_upgrade 
in scenarios like that for other providers.

Re: Built-in CTYPE provider

From

Noah Misch

Date:

02 July 2024, 23:03:33

On Wed, Jul 03, 2024 at 12:05:09AM +0200, Peter Eisentraut wrote:
> On 02.07.24 18:51, Noah Misch wrote:
> > On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote:
> > > On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> > > >    An alternative would be to make pg_upgrade reject
> > > > operating on a cluster that contains use of $SUBJECT.
> > > 
> > > That wouldn't help anyone.
> > 
> > Can you say more about that?  For the last decade at least, I think our
> > standard for new features has been to error rather than allow an operation
> > that creates a known path to wrong query results.  I think that's a helpful
> > standard that we should continue to follow.
> 
> I don't think the builtin locale provider is any different in this respect
> from the other providers:  The locale data might change and there is a
> version mechanism to track that.  We don't prevent pg_upgrade in scenarios
> like that for other providers.

Each packager can choose their dependencies so the v16 providers don't have
the problem.  With the $SUBJECT provider, a packager won't have that option.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

03 July 2024, 21:19:07

On Tue, 2024-07-02 at 16:03 -0700, Noah Misch wrote:
> Each packager can choose their dependencies so the v16 providers
> don't have
> the problem.  With the $SUBJECT provider, a packager won't have that
> option.

While nothing needs to be changed for 17, I agree that we may need to
be careful in future releases not to break things.

Broadly speaking, you are right that we may need to freeze Unicode
updates or be more precise about versioning. But there's a lot of
nuance to the problem, so I don't think we should pre-emptively promise
either of those things right now.

Consider:

* Unless I made a mistake, the last three releases of Unicode (14.0,
15.0, and 15.1) all have the exact same behavior for UPPER() and
LOWER() -- even for unassigned code points. It would be silly to
promise to stay with 15.1 and then realize that moving to 16.0 doesn't
create any actual problem.

* Unicode also offers "case folding", which has even stronger stability
guarantees, and I plan to propose that soon. When implemented, it would
be preferred over LOWER()/UPPER() in index expressions for most use
cases.

* While someone can pin libc+ICU to particular versions, it's
impossible when using the official packages, and additionally requires
using something like [1], which just became available last year. I
don't think it's reasonable to put it forth as a matter-of-fact
solution.

* Let's keep some perspective: we've lived for a long time with ALL
text indexes at serious risk of breakage. In contrast, the concerns you
are raising now are about certain kinds of expression indexes over data
containing certain unassigned code points. I am not dismissing that
concern, but the builtin provider moves us in the right direction and
let's not lose sight of that.

Given that no code changes for v17 are proposed, I suggest that we
refrain from making any declarations until the next version of Unicode
is released. If the pattern holds, that will be around September, which
still leaves time to make reasonable decisions for v18.

Regards,
    Jeff Davis

[1] https://github.com/awslabs/compat-collation-for-glibc

Re: Built-in CTYPE provider

From

Noah Misch

Date:

04 July 2024, 21:26:41

On Wed, Jul 03, 2024 at 02:19:07PM -0700, Jeff Davis wrote:
> * Unless I made a mistake, the last three releases of Unicode (14.0,
> 15.0, and 15.1) all have the exact same behavior for UPPER() and
> LOWER() -- even for unassigned code points. It would be silly to
> promise to stay with 15.1 and then realize that moving to 16.0 doesn't
> create any actual problem.

I think you're saying that if some Unicode update changes the results of a
STABLE function but does not change the result of any IMMUTABLE function, we
may as well import that update.  Is that about right?  If so, I agree.

In addition to the options I listed earlier (error in pg_upgrade or document
that IMMUTABLE stands) I would be okay with a third option.  Decide here that
we'll not adopt a Unicode update in a way that changes a v17 IMMUTABLE
function result of the new provider.  We don't need to write that in the
documentation, since it's implicit in IMMUTABLE.  Delete the "stable within a
<productname>Postgres</productname> major version" documentation text.

> * While someone can pin libc+ICU to particular versions, it's
> impossible when using the official packages, and additionally requires
> using something like [1], which just became available last year. I
> don't think it's reasonable to put it forth as a matter-of-fact
> solution.
> 
> * Let's keep some perspective: we've lived for a long time with ALL
> text indexes at serious risk of breakage. In contrast, the concerns you
> are raising now are about certain kinds of expression indexes over data
> containing certain unassigned code points. I am not dismissing that
> concern, but the builtin provider moves us in the right direction and
> let's not lose sight of that.

I see you're trying to help users get less breakage, and that's a good goal.
I agree $SUBJECT eliminates libc+ICU breakage, and libc+ICU breakage has hurt
plenty.  However, you proposed to update Unicode data and give REINDEX as the
solution to breakage this causes.  Unlike libc+ICU breakage, the packager has
no escape from that.  That's a different kind of breakage proposition, and no
new PostgreSQL feature should do that.  It's on a different axis from helping
users avoid libc+ICU breakage, and a feature doesn't get to credit helping on
one axis against a regression on the other axis.  What am I missing here?

> Given that no code changes for v17 are proposed, I suggest that we
> refrain from making any declarations until the next version of Unicode
> is released. If the pattern holds, that will be around September, which
> still leaves time to make reasonable decisions for v18.

Soon enough, a Unicode release will add one character to regexp [[:alpha:]].
PostgreSQL will then need to decide what IMMUTABLE is going to mean.  How does
that get easier in September?

Thanks,
nm

> [1] https://github.com/awslabs/compat-collation-for-glibc

Re: Built-in CTYPE provider

From

"Daniel Verite"

Date:

05 July 2024, 11:55:40

    Noah Misch wrote:

> > I don't think the builtin locale provider is any different in this respect
> > from the other providers:  The locale data might change and there is a
> > version mechanism to track that.  We don't prevent pg_upgrade in scenarios
> > like that for other providers.
>
> Each packager can choose their dependencies so the v16 providers don't have
> the problem.  With the $SUBJECT provider, a packager won't have that option.

The Unicode data files downloaded into src/common/unicode/
depend on the versions defined in Makefile.global.in:

  # Unicode data information

  # Before each major release, update these and run make update-unicode.

  # Pick a release from here: <https://www.unicode.org/Public/>.  Note
  # that the most recent release listed there is often a pre-release;
  # don't pick that one, except for testing.
  UNICODE_VERSION = 15.1.0

  # Pick a release from here: <http://cldr.unicode.org/index/downloads>
  CLDR_VERSION = 45

(CLDR_VERSION is apparently not used yet).

When these versions get bumped, it seems like packagers could stick to
previous versions by just overriding these.
When doing that, are there any function that may have an immutability
breakage problem with the built-in locale provider? (I would expect none).

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

05 July 2024, 17:13:09

On Fri, 2024-07-05 at 13:55 +0200, Daniel Verite wrote:
> When these versions get bumped, it seems like packagers could stick
> to
> previous versions by just overriding these.

That's an interesting point. It's actually easier for a packager to pin
Unicode to a specific version than to pin libc to a specific version.

> When doing that, are there any function that may have an immutability
> breakage problem with the built-in locale provider? (I would expect
> none).

Right, there wouldn't be any breakage without new data files.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

05 July 2024, 21:38:45

On Thu, 2024-07-04 at 14:26 -0700, Noah Misch wrote:
> I think you're saying that if some Unicode update changes the results
> of a
> STABLE function but does not change the result of any IMMUTABLE
> function, we
> may as well import that update.  Is that about right?  If so, I
> agree.

If you are proposing that Unicode updates should not be performed if
they affect the results of any IMMUTABLE function, then that's a new
policy.

For instance, the results of NORMALIZE() changed from PG15 to PG16 due
to commit 1091b48cd7:

  SELECT NORMALIZE(U&'\+01E030',nfkc)::bytea;

  Version 15: \xf09e80b0

  Version 16: \xd0b0

I am neither endorsing nor opposing the new policy you propose at this
time, but deep in the sub-thread of one particular feature is not the
right place to discuss it.

Please start a new thread for the proposed PG18 policy change and CC
me. I happen to think that around the release of the next version of
Unicode (in a couple months) would be the most productive time to have
that discussion, but you can start the discussion now if you like.

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Noah Misch

Date:

06 July 2024, 19:51:29

On Fri, Jul 05, 2024 at 02:38:45PM -0700, Jeff Davis wrote:
> On Thu, 2024-07-04 at 14:26 -0700, Noah Misch wrote:
> > I think you're saying that if some Unicode update changes the results
> > of a
> > STABLE function but does not change the result of any IMMUTABLE
> > function, we
> > may as well import that update.  Is that about right?  If so, I
> > agree.
> 
> If you are proposing that Unicode updates should not be performed if
> they affect the results of any IMMUTABLE function, then that's a new
> policy.
> 
> For instance, the results of NORMALIZE() changed from PG15 to PG16 due
> to commit 1091b48cd7:
> 
>   SELECT NORMALIZE(U&'\+01E030',nfkc)::bytea;
> 
>   Version 15: \xf09e80b0
> 
>   Version 16: \xd0b0

As a released feature, NORMALIZE() has a different set of remedies to choose
from, and I'm not proposing one.  I may have sidetracked this thread by
talking about remedies without an agreement that pg_c_utf8 has a problem.  My
question for the PostgreSQL maintainers is this:

  textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being
  IMMUTABLE, will change behavior in some major releases.  pg_upgrade does not
  have a concept of IMMUTABLE functions changing, so index scans will return
  wrong query results after upgrade.  Is it okay for v17 to release a
  pg_c_utf8 planned to behave that way when upgrading v17 to v18+?

If the answer is yes, the open item closes.  If the answer is no, determining
the remedy can come next.

Lest concrete details help anyone reading, here are some affected objects:

  CREATE TABLE t (s text COLLATE pg_c_utf8);
  INSERT INTO t VALUES (U&'\+00a7dc'), (U&'\+001dd3');
  CREATE INDEX iexpr ON t ((lower(s)));
  CREATE INDEX ipred ON t (s) WHERE s ~ '[[:alpha:]]';

v17 can simulate the Unicode aspect of a v18 upgrade, like this:

  sed -i 's/^UNICODE_VERSION.*/UNICODE_VERSION = 16.0.0/' src/Makefile.global.in
  # ignore test failures (your ICU likely doesn't have the Unicode 16.0.0 draft)
  make -C src/common/unicode update-unicode
  make
  make install
  pg_ctl restart

Behavior after that:

-- 2 rows w/ seq scan, 0 rows w/ index scan
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';
SET enable_seqscan = off;
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';

-- ERROR:  heap tuple (0,1) from table "t" lacks matching index tuple within index "iexpr"
SELECT bt_index_parent_check('iexpr', heapallindexed => true);
-- ERROR:  heap tuple (0,1) from table "t" lacks matching index tuple within index "ipred"
SELECT bt_index_parent_check('ipred', heapallindexed => true);

Re: Built-in CTYPE provider

From

Tom Lane

Date:

06 July 2024, 20:19:21

Noah Misch <noah@leadboat.com> writes:
> As a released feature, NORMALIZE() has a different set of remedies to choose
> from, and I'm not proposing one.  I may have sidetracked this thread by
> talking about remedies without an agreement that pg_c_utf8 has a problem.  My
> question for the PostgreSQL maintainers is this:

>   textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being
>   IMMUTABLE, will change behavior in some major releases.  pg_upgrade does not
>   have a concept of IMMUTABLE functions changing, so index scans will return
>   wrong query results after upgrade.  Is it okay for v17 to release a
>   pg_c_utf8 planned to behave that way when upgrading v17 to v18+?

I do not think it is realistic to define "IMMUTABLE" as meaning that
the function will never change behavior until the heat death of the
universe.  As a counterexample, we've not worried about applying
bug fixes or algorithm improvements that change the behavior of
"immutable" numeric computations.  It might be unwise to do that
in a minor release, but we certainly do it in major releases.

I'd say a realistic policy is "immutable means we don't intend to
change it within a major release".  If we do change the behavior,
either as a bug fix or a major-release improvement, that should
be release-noted so that people know they have to rebuild dependent
indexes and matviews.

It gets stickier for behaviors that aren't fully under our control,
which is the case for a lot of locale-related things.  We cannot then
promise "no changes within major releases".  But I do not think it
is helpful to react to that fact by refusing to label such things
immutable.  Then we'd just need another mutability classification,
and it would effectively act the same as immutable does now, because
people will certainly wish to use these functions in indexes etc.

            regards, tom lane

Re: Built-in CTYPE provider

From

Jeremy Schneider

Date:

06 July 2024, 20:37:06

On Jul 6, 2024, at 12:51 PM, Noah Misch <noah@leadboat.com> wrote:

Behavior after that:

-- 2 rows w/ seq scan, 0 rows w/ index scan
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';
SET enable_seqscan = off;
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';

-- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "iexpr"
SELECT bt_index_parent_check('iexpr', heapallindexed => true);
-- ERROR: heap tuple (0,1) from table "t" lacks matching index tuple within index "ipred"
SELECT bt_index_parent_check('ipred', heapallindexed => true);

Other databases do still ship built-in ancient versions of unicode (Db2 ships 4.0+ and Oracle ships 6.1+), and they have added new Unicode versions alongside the old but not removed the old versions. They claim to have “deprecated” old versions… but it seems they haven’t been able to get rid of them yet. Maybe some customer is willing to pay to continue deferring painful rebuilds needed to get rid of the old collation versions in commercial DBs?

For reference, see the table on slide 56 at https://www.pgevents.ca/events/pgconfdev2024/schedule/session/95-collations-from-a-to-z/ and also see https://ardentperf.com/2024/05/22/default-sort-order-in-db2-sql-server-oracle-postgres-17/

Thanks for the illustration with actual Unicode 16 draft data.

Also, not directly related to this email… but reiterating a point I argued for in the recorded talk at pgconf.dev in Vancouver: a very strong argument for having the DB default to a stable unchanging built-in collation is that the dependency tracking makes it easy to identify objects in the database using non-default collations, and it’s easy to know exactly what needs to be rebuilt for a user to safely change some non-default collation provider’s behavior.

-Jeremy

Sent from my TI-83

Re: Built-in CTYPE provider

From

Noah Misch

Date:

09 July 2024, 01:05:45

On Sat, Jul 06, 2024 at 04:19:21PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > As a released feature, NORMALIZE() has a different set of remedies to choose
> > from, and I'm not proposing one.  I may have sidetracked this thread by
> > talking about remedies without an agreement that pg_c_utf8 has a problem.  My
> > question for the PostgreSQL maintainers is this:
> 
> >   textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being
> >   IMMUTABLE, will change behavior in some major releases.  pg_upgrade does not
> >   have a concept of IMMUTABLE functions changing, so index scans will return
> >   wrong query results after upgrade.  Is it okay for v17 to release a
> >   pg_c_utf8 planned to behave that way when upgrading v17 to v18+?
> 
> I do not think it is realistic to define "IMMUTABLE" as meaning that
> the function will never change behavior until the heat death of the
> universe.  As a counterexample, we've not worried about applying
> bug fixes or algorithm improvements that change the behavior of
> "immutable" numeric computations.

True.  There's a continuum from "releases can change any IMMUTABLE function"
to "index integrity always wins, even if a function is as wrong as 1+1=3".
I'm less concerned about the recent "Incorrect results from numeric round"
thread, even though it's proposing to back-patch.  I'm thinking about these
aggravating factors for $SUBJECT:

- $SUBJECT is planning an annual cadence of this kind of change.

- We already have ICU providing collation support for the same functions.
  Unlike $SUBJECT, ICU integration gives packagers control over when to accept
  corruption at pg_upgrade time.

- SQL Server, DB2 and Oracle do their Unicode updates in a non-corrupting way.
  (See Jeremy Schneider's reply concerning DB2 and Oracle.)

- lower() and regexp are more popular in index expressions than
  high-digit-count numeric calculations.

> I'd say a realistic policy is "immutable means we don't intend to
> change it within a major release".  If we do change the behavior,
> either as a bug fix or a major-release improvement, that should
> be release-noted so that people know they have to rebuild dependent
> indexes and matviews.

It sounds like you're very comfortable with $SUBJECT proceeding in its current
form.  Is that right?

Re: Built-in CTYPE provider

From

Tom Lane

Date:

09 July 2024, 01:17:52

Noah Misch <noah@leadboat.com> writes:
> It sounds like you're very comfortable with $SUBJECT proceeding in its current
> form.  Is that right?

I don't have an opinion on whether the overall feature design
is well-chosen.  But the mere fact that Unicode updates will
from time to time change the behavior (presumably only in edge
cases or for previously-unassigned code points) doesn't strike
me as a big enough problem to justify saying these functions
can't be marked immutable anymore.  Especially since we have been
faced with that problem all along anyway; we just didn't have a way
to track or quantify it before, because locale changes happened
outside code we control.

            regards, tom lane

Re: Built-in CTYPE provider

From

Laurenz Albe

Date:

09 July 2024, 08:00:24

On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote:
> > I do not think it is realistic to define "IMMUTABLE" as meaning that
> > the function will never change behavior until the heat death of the
> > universe.  As a counterexample, we've not worried about applying
> > bug fixes or algorithm improvements that change the behavior of
> > "immutable" numeric computations.
>
> True.  There's a continuum from "releases can change any IMMUTABLE function"
> to "index integrity always wins, even if a function is as wrong as 1+1=3".
> I'm less concerned about the recent "Incorrect results from numeric round"
> thread, even though it's proposing to back-patch.  I'm thinking about these
> aggravating factors for $SUBJECT:
>
> - $SUBJECT is planning an annual cadence of this kind of change.
>
> - We already have ICU providing collation support for the same functions.
>   Unlike $SUBJECT, ICU integration gives packagers control over when to accept
>   corruption at pg_upgrade time.
>
> - SQL Server, DB2 and Oracle do their Unicode updates in a non-corrupting way.
>   (See Jeremy Schneider's reply concerning DB2 and Oracle.)
>
> - lower() and regexp are more popular in index expressions than
>   high-digit-count numeric calculations.

My personal exprience is that very few users are aware of or care about
the strict accuracy of the collation sort order and other locale aspects.
But they care a lot about index corruption.

So I'd argue that we should not have any breaking changes at all, even in
cases where the provider is clearly wrong.

Yours,
Laurenz Albe

Re: Built-in CTYPE provider

From

Jeremy Schneider

Date:

09 July 2024, 14:51:39

On Tue, Jul 9, 2024 at 4:00 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

My personal exprience is that very few users are aware of or care about
the strict accuracy of the collation sort order and other locale aspects.
But they care a lot about index corruption.

So I'd argue that we should not have any breaking changes at all, even in
cases where the provider is clearly wrong.

FWIW, using external ICU libraries is a nice solution for users who need strict and up-to-date Unicode support.

Cell phones do often get support for new code points before databases. So databases can end up storing characters before they are aware of the meaning. (Slide 27 in the pgconf.dev talk illustrates a recent timeline of Unicode & phone updates.)

-Jeremy

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

09 July 2024, 23:20:12

On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote:
> I'm thinking about these
> aggravating factors for $SUBJECT:

This is still marked as an open item for 17, but you've already
acknowledged[1] that no code changes are necessary in version 17.
Upgrades of Unicode take an active step from a committer, so it's not a
pressing problem for 18, either.

The idea that you're arguing against is "stability within a PG major
version". There's no new discovery here: it was listed under the
heading of "Benefits" near the top of my initial proposal[2], and known
to all reviewers.

This is not an Open Item for 17, and new policy discussions should not
happen deep in this subthread. Please resolve the Open Item, and feel
free to start a thread about policy changes in 18.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/20240701230352.2c.nmisch@google.com
[2]
https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

Re: Built-in CTYPE provider

From

Noah Misch

Date:

11 July 2024, 12:50:40

On Tue, Jul 09, 2024 at 04:20:12PM -0700, Jeff Davis wrote:
> On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote:
> > I'm thinking about these
> > aggravating factors for $SUBJECT:
> 
> This is still marked as an open item for 17, but you've already
> acknowledged[1] that no code changes are necessary in version 17.

Later posts on the thread made that obsolete.  The next step is to settle the
question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com.  If that
conclusion entails a remedy, v17 code changes may be part of that remedy.

> The idea that you're arguing against is "stability within a PG major
> version". There's no new discovery here: it was listed under the
> heading of "Benefits" near the top of my initial proposal[2]

Thanks for that distillation.  More specifically, I'm arguing against the lack
of choice about instability across major versions:

                                  | ICU collations    | pg_c_utf8
----------------------------------|-------------------|----------
Corruption within a major version | packager's choice | no
Corruption at pg_upgrade time     | packager's choice | yes

I am a packager who chooses no-corruption (chooses stability).  As a packager,
the pg_c_utf8 stability within major versions is never a bad thing, but it
does not compensate for instability across major versions.  I don't want a
future in which pg_c_utf8 is the one provider that integrity-demanding
pg_upgrade users should not use.

> and known to all reviewers.

If after https://postgr.es/m/20240706195129.fd@rfd.leadboat.com and
https://postgr.es/m/20240709010545.8c.nmisch@google.com they think $SUBJECT
should continue as-committed, they should vote that way.  Currently, we have
multiple votes in one direction and multiple votes in the other direction.  If
all three reviewers were to vote in the same direction (assuming no other new
votes), I argue that such a count would render whichever way they vote as the
conclusion.  Does that match your count?

> [1]
> https://www.postgresql.org/message-id/20240701230352.2c.nmisch@google.com
> [2]
> https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

17 July 2024, 15:48:46

On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
> > This is still marked as an open item for 17, but you've already
> > acknowledged[1] that no code changes are necessary in version 17.
>
> Later posts on the thread made that obsolete.  The next step is to
> settle the
> question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. 
> If that
> conclusion entails a remedy, v17 code changes may be part of that
> remedy.

This is the first time you've mentioned a code change in version 17. If
you have something in mind, please propose it. However, this feature
followed the right policies at the time of commit, so there would need
to be a strong consensus to accept such a change.

Additionally, I started a discussion on version 18 policy that may also
resolve your concerns:

https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

Regards,
    Jeff Davis

Re: Built-in CTYPE provider

From

Noah Misch

Date:

17 July 2024, 22:03:26

On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
> On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
> > > This is still marked as an open item for 17, but you've already
> > > acknowledged[1] that no code changes are necessary in version 17.
> > 
> > Later posts on the thread made that obsolete.  The next step is to
> > settle the
> > question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. 
> > If that
> > conclusion entails a remedy, v17 code changes may be part of that
> > remedy.
> 
> This is the first time you've mentioned a code change in version 17. If

That's right.

> you have something in mind, please propose it. However, this feature
> followed the right policies at the time of commit, so there would need
> to be a strong consensus to accept such a change.

If I'm counting the votes right, you and Tom have voted that the feature's
current state is okay, and I and Laurenz have voted that it's not okay.  I
still hope more people will vote, to avoid dealing with the tie.  Daniel,
Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
willing to vote one way or the other on the question in
https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?

A tie would become a decision against the unreleased behavior.

In the event of a decision against the unreleased behavior, reverting the
feature is the remedy that could proceed without further decision making.

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

18 July 2024, 06:06:43

On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote:
> If I'm counting the votes right

...

> , you and Tom have voted that the feature's
> current state is okay, and I and Laurenz have voted that it's not
> okay.

...

> A tie would become a decision against the unreleased behavior.

...

> In the event of a decision against the unreleased behavior, reverting
> the
> feature is the remedy that could proceed without further decision
> making.

You haven't established that any problem actually exists in version 17,
and your arguments have been a moving target throughout this subthread.

I reject the procedural framework that you are trying to establish.
Voting won't change the fact that the "stability within a major
version" that you are arguing against[1] was highlighted as a benefit
in my initial proposal[2] for all reviewers to see.

If you press forward with this approach, I'll use judgement that is
sufficiently deferential to the review process before making any hasty
decisions.

Alternatively, I suggest that you participate in the thread that I
started here:

https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

which seems like a more direct (and more complete) path to a resolution
of your concerns. I speak only for myself, but I assure you that I have
an open mind in that discussion, and that I have no intention force a
Unicode update past objections.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/20240711125040.11.nmisch@google.com
[2]
https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

Re: Built-in CTYPE provider

From

Laurenz Albe

Date:

18 July 2024, 08:05:34

On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote:
> If I'm counting the votes right, you and Tom have voted that the feature's
> current state is okay, and I and Laurenz have voted that it's not okay.

Maybe I should expand my position.

I am very much for the built-in CTYPE provider.  When I said that I am against
changes in major versions, I mean changes that are likely to affect real-life
usage patterns.  If there are modifications affecting a code point that was
previously unassigned, it is *theoretically* possible, but very unlikely, that
someone has stored it in a database.  I would want to deliberate about any change
affecting such a code point, and if the change seems highly desirable, we can
consider applying it.

What I am against is routinely updating the built-in provider to adopt any changes
that Unicode makes.

To make a comparison with Tom's argument upthread: we have slightly changed how
floating point computations work, even though they are IMMUTABLE.  But I'd argue
that very few people build indexes on the results of floating point arithmetic
(and those who do are probably doing something wrong), so the risk is acceptable.
But people index strings all the time.

Yours,
Laurenz Albe

Re: Built-in CTYPE provider

From

"Daniel Verite"

Date:

18 July 2024, 11:29:27

    Noah Misch wrote:

> If I'm counting the votes right, you and Tom have voted that the feature's
> current state is okay, and I and Laurenz have voted that it's not okay.  I
> still hope more people will vote, to avoid dealing with the tie.  Daniel,
> Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are
> you
> willing to vote one way or the other on the question in
> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?

For me, the current state is okay.

In the mentioned question, you're doing this:

  v17 can simulate the Unicode aspect of a v18 upgrade, like this:
    sed -i 's/^UNICODE_VERSION.*/UNICODE_VERSION = 16.0.0/'
src/Makefile.global.in

to force a Unicode upgrade. But a packager could do the same
to force a Unicode downgrade, if they wanted.

Therefore I don't agree with this summary in
<20240711125040.11.nmisch@google.com>:

>                                  | ICU collations    | pg_c_utf8
> ----------------------------------|-------------------|----------
> Corruption within a major version | packager's choice | no
> Corruption at pg_upgrade time     | packager's choice | yes



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

From

Noah Misch

Date:

18 July 2024, 14:00:15

On Thu, Jul 18, 2024 at 10:05:34AM +0200, Laurenz Albe wrote:
> On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote:
> > If I'm counting the votes right, you and Tom have voted that the feature's
> > current state is okay, and I and Laurenz have voted that it's not okay.
> 
> Maybe I should expand my position.
> 
> I am very much for the built-in CTYPE provider.  When I said that I am against
> changes in major versions, I mean changes that are likely to affect real-life
> usage patterns.  If there are modifications affecting a code point that was
> previously unassigned, it is *theoretically* possible, but very unlikely, that
> someone has stored it in a database.  I would want to deliberate about any change
> affecting such a code point, and if the change seems highly desirable, we can
> consider applying it.
> 
> What I am against is routinely updating the built-in provider to adopt any changes
> that Unicode makes.

Given all the messages on this thread, if the feature remains in PostgreSQL, I
advise you to be ready to tolerate PostgreSQL "routinely updating the built-in
provider to adopt any changes that Unicode makes".  Maybe someone will change
something in v18 so it's not like that, but don't count on it.

Would you like to change your vote to "okay", keep your vote at "not okay", or
change it to an abstention?

> To make a comparison with Tom's argument upthread: we have slightly changed how
> floating point computations work, even though they are IMMUTABLE.  But I'd argue
> that very few people build indexes on the results of floating point arithmetic
> (and those who do are probably doing something wrong), so the risk is acceptable.
> But people index strings all the time.

Agreed.

Re: Built-in CTYPE provider

From

Laurenz Albe

Date:

18 July 2024, 14:45:20

On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
>  What I am against is routinely updating the built-in provider to adopt any changes
> > that Unicode makes.
>
> Given all the messages on this thread, if the feature remains in PostgreSQL, I
> advise you to be ready to tolerate PostgreSQL "routinely updating the built-in
> provider to adopt any changes that Unicode makes".  Maybe someone will change
> something in v18 so it's not like that, but don't count on it.
>
> Would you like to change your vote to "okay", keep your vote at "not okay", or
> change it to an abstention?

In that case I am against it.  Against the "routinely" in particular.

Yours,
Laurenz Albe

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

18 July 2024, 16:52:44

On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
> Given all the messages on this thread, if the feature remains in
> PostgreSQL, I
> advise you to be ready to tolerate PostgreSQL "routinely updating the
> built-in
> provider to adopt any changes that Unicode makes".

You mean messages from me, like:

  * "I have no intention force a Unicode update" [1]
  * "While nothing needs to be changed for 17, I agree that we may need
to be careful in future releases not to break things." [2]
  * "...you are right that we may need to freeze Unicode updates or be
more precise about versioning..." [2]
  * "If you are proposing that Unicode updates should not be performed
if they affect the results of any IMMUTABLE function...I am neither
endorsing nor opposing..." [3]

?

The only source I can imagine for your concern -- please correct me if
I'm wrong -- is that I declined to make a preemptive version 18 promise
deep in this version 17 Open Item subthread. But I have good reasons.
First, if we promise not to update Unicode, that would also affect
NORMALIZE(), so for the sake of transparency we need a top-level
discussion. Second, an Open Item should be tightly scoped to what
actually needs to happen in version 17 before release. And thirdly,
such a promise would artificially limit the range of possible outcomes,
which may include various compromises that are not 17 material.

I'm perplexed as to why you don't engage in the version 18 policy
discussion.

>   Maybe someone will change
> something in v18 so it's not like that, but don't count on it.

That's backwards. If nothing happens in v18, then there will be no
breaking Unicode change. It takes an active step by a committer to
update Unicode.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/5edb38923b0b23eb643f61807ef772a237ab92cf.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com
[3]
https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com

Re: Built-in CTYPE provider

From

Tom Lane

Date:

18 July 2024, 17:03:31

Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
>> Maybe someone will change
>> something in v18 so it's not like that, but don't count on it.

> That's backwards. If nothing happens in v18, then there will be no
> breaking Unicode change. It takes an active step by a committer to
> update Unicode.

This whole discussion seems quite bizarre to me.  In the first
place, it is certain that Unicode will continue to evolve, and
I can't believe that we'd just freeze pg_c_utf8 on the current
definition forever.  Whether the first change happens in v18
or years later doesn't seem like a particularly critical point.

In the second place, I cannot understand why pg_c_utf8 is being
held to a mutability standard that we have never applied to any
other locale-related functionality --- and indeed could not do
so, since in most cases that functionality has been buried in
libraries we don't control.  It seems to me to be already a
great step forward that with pg_c_utf8, at least we can guarantee
that the behavior won't change without us knowing about it.
Noah's desire to revert the feature makes the mutability situation
strictly worse, because people will have to continue to rely on
OS-provided functionality that can change at any time.

            regards, tom lane

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

18 July 2024, 17:13:34

On Thu, 2024-07-18 at 16:45 +0200, Laurenz Albe wrote:
> On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
> >  What I am against is routinely updating the built-in provider to
> > adopt any changes
> > > that Unicode makes.

That is a perfectly reasonable position; please add it to the version
18 discussion[1].

> > Given all the messages on this thread, if the feature remains in
> > PostgreSQL, I
> > advise you to be ready to tolerate PostgreSQL "routinely updating
> > the built-in
> > provider to adopt any changes that Unicode makes".  Maybe someone
> > will change
> > something in v18 so it's not like that, but don't count on it.

...

> In that case I am against it.  Against the "routinely" in particular.

Also, please see my response[2] to Noah. I don't believe his statement
above is an accurate characterization. There's plenty of opportunity
for deliberation and compromise in version 18, and my mind is still
open to pretty much everything, up to and including freezing Unicode
updates if necessary[3].

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/f29686018b4432a8ae2e535dbe19f0d88e7a79d5.camel@j-davis.com
[3]
https://www.postgresql.org/message-id/db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com

Re: Built-in CTYPE provider

From

Noah Misch

Date:

18 July 2024, 23:39:08

On Thu, Jul 18, 2024 at 01:03:31PM -0400, Tom Lane wrote:
> This whole discussion seems quite bizarre to me.  In the first
> place, it is certain that Unicode will continue to evolve, and
> I can't believe that we'd just freeze pg_c_utf8 on the current
> definition forever.  Whether the first change happens in v18
> or years later doesn't seem like a particularly critical point.
> 
> In the second place, I cannot understand why pg_c_utf8 is being
> held to a mutability standard that we have never applied to any
> other locale-related functionality --- and indeed could not do
> so, since in most cases that functionality has been buried in
> libraries we don't control.  It seems to me to be already a

With libc and ICU providers, packagers have a way to avoid locale-related
behavior changes.  That's the "mutability standard" I want pg_c_utf8 to join.
pg_c_utf8 is the one provider where packagers can't opt out[1] of annual
pg_upgrade-time index scan breakage on affected expression indexes.

> great step forward that with pg_c_utf8, at least we can guarantee
> that the behavior won't change without us knowing about it.
> Noah's desire to revert the feature makes the mutability situation
> strictly worse, because people will have to continue to rely on
> OS-provided functionality that can change at any time.

I see:
- one step forward:
  "string1 < string2" won't change, forever, regardless of packager choices
- one step backward:
  "string ~ '[[:alpha:]]'" will change at pg_upgrade time, regardless of packager choices

I think one's perspective on the relative importance of the step forward and
the step backward depends on the sort of packages one uses today.  Consider a
user of Debian packages with locale!=C, doing Debian upgrades and pg_upgrade.
For that user, pg_c_utf8 gets far less index corruption than an ICU locale.
The step forward is a great step forward _for this user_, and the step
backward is in the noise next to the step forward.

I'm with a different kind of packager.  I don't tolerate index scans returning
wrong answers.  To achieve that, my libc and ICU aren't changing collation
behavior.  I suspect my packages won't offer a newer ICU behavior until
PostgreSQL gets support for multiple ICU library versions per database.  (SQL
Server, DB2 and Oracle already do.  I agree we can't freeze forever.  The
multiple-versions feature gets more valuable each year.)  _For this_ kind of
package, the step forward is a no-op.  The step backward is the sole effect on
this kind of package.

How much does that pair of perspectives explain the contrast between my
"revert" and your "great step forward"?  We may continue to disagree on the
ultimate decision, but I hope I can make my position cease to appear bizarre
to you.

Thanks,
nm

[1] Daniel Verite said packagers could patch src/Makefile.global.in and run
"make -C src/common/unicode update-unicode".  Editing src/Makefile.global.in
is modifying PostgreSQL, not configuring a packager-facing option.

Re: Built-in CTYPE provider

From

Laurenz Albe

Date:

19 July 2024, 07:44:21

On Thu, 2024-07-18 at 13:03 -0400, Tom Lane wrote:
> In the second place, I cannot understand why pg_c_utf8 is being
> held to a mutability standard that we have never applied to any
> other locale-related functionality --- and indeed could not do
> so, since in most cases that functionality has been buried in
> libraries we don't control.

I believe that we should hold it to a higher standard *precisely
because* the previous way that we handled mutability in locale-related
functionality was a problem.

> It seems to me to be already a
> great step forward that with pg_c_utf8, at least we can guarantee
> that the behavior won't change without us knowing about it.

+1

But the greatness of the step depends on our readiness to be careful
with such changes.

> Noah's desire to revert the feature makes the mutability situation
> strictly worse, because people will have to continue to rely on
> OS-provided functionality that can change at any time.

I think everybody agrees that we don't want to expose users to data
corruption after an upgrade.

It understand Noah to take the position that anything less than
strict immutability would be worse than the current state, because
currently a packager can choose to keep shipping the same old
version of libicu and avoid the problem completely.

I don't buy that.  First, the only binary distribution I have heard
of that does that is EDB's Windows installer.  Both the RPM and
Debian packages don't.

And until PostgreSQL defaults to using ICU, most people will use
C library collations, and a packager cannot choose not to upgrade
the C library.

I believe the built-in CTYPE provider is a good thing and a step
forward.  But to make it a big step forward, we should be extremely
careful with any changes in major releases that might require
rebuilding indexes.
This is where I side with Noah.

Yours,
Laurenz Albe

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

19 July 2024, 15:50:41

On Thu, 2024-07-18 at 16:39 -0700, Noah Misch wrote:
> I'm with a different kind of packager.  I don't tolerate index scans
> returning
> wrong answers.

I doubt that. We've all been tolerating the risk of index scans
returning wrong results in some cases.

Consider:

a. Some user creates an expression index on NORMALIZE(); vs.
b. Some user chooses the builtin "C.UTF-8" locale and creates a partial
index with a predicate like "string ~ '[[:alpha:]]'" (or an expression
index on LOWER())

Both cases create a risk if we update Unicode in some future version.
Why are you unconcerned about case (a), but highly concerned about case
(b)?

Neither seem to be a pressing problem because updating Unicode is our
choice, so we have time to reach a compromise.

> [1] Daniel Verite said packagers could patch src/Makefile.global.in
> and run
> "make -C src/common/unicode update-unicode".  Editing
> src/Makefile.global.in
> is modifying PostgreSQL, not configuring a packager-facing option.

Then go to the other thread[1] and propose that it be exposed as a
packager-facing option along with any proposed Unicode update. There
are other potential compromises possible, so I don't think this 17
subthread is the right place to discuss it, but it strikes me as a
reasonable proposal.

I sincerely think you are overcomplicating matters with version 17
procedural motions. Let the community process play out in version 18
like normal, because there's no actual problem now, I see no reason
your objections would be taken less seriously later.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

Re: Built-in CTYPE provider

From

Jeff Davis

Date:

19 July 2024, 16:26:20

On Fri, 2024-07-19 at 09:44 +0200, Laurenz Albe wrote:
> But the greatness of the step depends on our readiness to be careful
> with such changes.

You and Noah have been clear on that point, which is enough to make
*me* careful with any Unicode updates in the future. I'll suggest once
more that you say so in the policy thread here:

https://www.postgresql.org/message-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

which would get broader visibility and I believe provide you with
stronger assurances that *everyone* will be careful with Unicode
updates.

Regards,
    Jeff Davis

>

Re: Built-in CTYPE provider

From

Noah Misch

Date:

24 July 2024, 15:19:13

On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
> > On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
> > > > This is still marked as an open item for 17, but you've already
> > > > acknowledged[1] that no code changes are necessary in version 17.
> > > 
> > > Later posts on the thread made that obsolete.  The next step is to
> > > settle the
> > > question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com. 
> > > If that
> > > conclusion entails a remedy, v17 code changes may be part of that
> > > remedy.
> > 
> > This is the first time you've mentioned a code change in version 17. If
> 
> That's right.
> 
> > you have something in mind, please propose it. However, this feature
> > followed the right policies at the time of commit, so there would need
> > to be a strong consensus to accept such a change.
> 
> If I'm counting the votes right, you and Tom have voted that the feature's
> current state is okay, and I and Laurenz have voted that it's not okay.  I
> still hope more people will vote, to avoid dealing with the tie.  Daniel,
> Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
> willing to vote one way or the other on the question in
> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?

The last vote arrived 6 days ago.  So far, we have votes from Jeff, Noah, Tom,
Daniel, and Laurenz.  I'll keep the voting open for another 24 hours from now
or 36 hours after the last vote, whichever comes last.  If that schedule is
too compressed for anyone, do share.

Re: Built-in CTYPE provider

From

Peter Eisentraut

Date:

24 July 2024, 15:27:20

On 24.07.24 17:19, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
>> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
>>> On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
>>>>> This is still marked as an open item for 17, but you've already
>>>>> acknowledged[1] that no code changes are necessary in version 17.
>>>>
>>>> Later posts on the thread made that obsolete.  The next step is to
>>>> settle the
>>>> question at https://postgr.es/m/20240706195129.fd@rfd.leadboat.com.
>>>> If that
>>>> conclusion entails a remedy, v17 code changes may be part of that
>>>> remedy.
>>>
>>> This is the first time you've mentioned a code change in version 17. If
>>
>> That's right.
>>
>>> you have something in mind, please propose it. However, this feature
>>> followed the right policies at the time of commit, so there would need
>>> to be a strong consensus to accept such a change.
>>
>> If I'm counting the votes right, you and Tom have voted that the feature's
>> current state is okay, and I and Laurenz have voted that it's not okay.  I
>> still hope more people will vote, to avoid dealing with the tie.  Daniel,
>> Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
>> willing to vote one way or the other on the question in
>> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?
> 
> The last vote arrived 6 days ago.  So far, we have votes from Jeff, Noah, Tom,
> Daniel, and Laurenz.  I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last.  If that schedule is
> too compressed for anyone, do share.

My opinion is that it is okay to release as is.

Re: Built-in CTYPE provider

From

Joe Conway

Date:

24 July 2024, 15:36:21

On 7/24/24 11:19, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
>> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
>> > you have something in mind, please propose it. However, this feature
>> > followed the right policies at the time of commit, so there would need
>> > to be a strong consensus to accept such a change.
>> 
>> If I'm counting the votes right, you and Tom have voted that the feature's
>> current state is okay, and I and Laurenz have voted that it's not okay.  I
>> still hope more people will vote, to avoid dealing with the tie.  Daniel,
>> Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
>> willing to vote one way or the other on the question in
>> https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?
> 
> The last vote arrived 6 days ago.  So far, we have votes from Jeff, Noah, Tom,
> Daniel, and Laurenz.  I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last.  If that schedule is
> too compressed for anyone, do share.


It isn't entirely clear to me exactly what we are voting on.

* If someone votes +1 (current state is ok) -- that is pretty clear.
* But if someone votes -1 (current state is not ok?) what does that mean
   in practice?
   - a revert?
   - we hold shipping 17 until we get consensus (via some plan or
     mitigation or whatever)?
   - something else?

In any case, I am a hard -1 against reverting. +0.5 on "current state is 
ok", and +1 on "current state is ok with agreement on what to do in 18"

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Built-in CTYPE provider

From

Jeremy Schneider

Date:

24 July 2024, 15:44:42

On Wed, Jul 24, 2024 at 9:27 AM Peter Eisentraut <peter@eisentraut.org> wrote:

> The last vote arrived 6 days ago. So far, we have votes from Jeff, Noah, Tom,
> Daniel, and Laurenz. I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last. If that schedule is
> too compressed for anyone, do share.

My opinion is that it is okay to release as is.

Like Jeff, I don’t think counting votes or putting names on one side or another is the best way to decide things. Everyone has unique opinions and nuances, it’s not like there’s two groups that all agree together on everything and disagree with the other group. I don’t want my name put on a list this way; there are some places where I agree and some places where I disagree with most people 🙂

I don’t know the code as intimately as some others on the lists, but I’m not aware of any one-way doors that would create major difficulties for future v18+ ideas being discussed

fwiw, I don’t want to pull this feature out of v17, I think it’s okay to release it

-Jeremy

Re: Built-in CTYPE provider

From

Noah Misch

Date:

26 July 2024, 11:29:58

On Wed, Jul 24, 2024 at 08:19:13AM -0700, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
> > vote one way or the other on the question in
> > https://postgr.es/m/20240706195129.fd@rfd.leadboat.com?
> 
> I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last.

I count 4.5 or 5 votes for "okay" and 2 votes for "not okay".  I've moved the
open item to "Non-bugs".

On Wed, Jul 17, 2024 at 11:06:43PM -0700, Jeff Davis wrote:
> You haven't established that any problem actually exists in version 17,
> and your arguments have been a moving target throughout this subthread.

I can understand that experience of yours.  It wasn't my intent to make a
moving target.  To be candid, I entered the thread with no doubts that you'd
agree with the problem.  When you and Tom instead shared a different view, I
switched to pursuing the votes to recognize the problem.  (Voting then held
that pg_c_utf8 is okay as-is.)

On Thu, Jul 18, 2024 at 09:52:44AM -0700, Jeff Davis wrote:
> On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
> > Given all the messages on this thread, if the feature remains in
> > PostgreSQL, I
> > advise you to be ready to tolerate PostgreSQL "routinely updating the
> > built-in
> > provider to adopt any changes that Unicode makes".
> 
> You mean messages from me, like:
> 
>   * "I have no intention force a Unicode update" [1]
>   * "While nothing needs to be changed for 17, I agree that we may need
> to be careful in future releases not to break things." [2]
>   * "...you are right that we may need to freeze Unicode updates or be
> more precise about versioning..." [2]
>   * "If you are proposing that Unicode updates should not be performed
> if they affect the results of any IMMUTABLE function...I am neither
> endorsing nor opposing..." [3]
> 
> ?

Those, plus all the other messages.

On Fri, Jul 19, 2024 at 08:50:41AM -0700, Jeff Davis wrote:
> Consider:
> 
> a. Some user creates an expression index on NORMALIZE(); vs.
> b. Some user chooses the builtin "C.UTF-8" locale and creates a partial
> index with a predicate like "string ~ '[[:alpha:]]'" (or an expression
> index on LOWER())
> 
> Both cases create a risk if we update Unicode in some future version.
> Why are you unconcerned about case (a), but highly concerned about case
> (b)?

I am not unconcerned about (a), but the v17 beta process gave an opportunity
to do something about (b) that it didn't give for (a).  Also, I have never
handled a user report involving NORMALIZE().  I have handled user reports
around regexp index inconsistency, e.g. the one at
https://www.youtube.com/watch?v=kNH94tmpUus&t=1490s