Thread: Re: Update Unicode data to Unicode 16.0.0

Re: Update Unicode data to Unicode 16.0.0

From
Joe Conway
Date:
On 11/11/24 01:27, Peter Eisentraut wrote:
> Here is the patch to update the Unicode data to version 16.0.0.
> 
> Normally, this would have been routine, but a few months ago there was
> some debate about how this should be handled. [0]  AFAICT, the consensus
> was to go ahead with it, but I just wanted to notify it here to be clear.
> 
> [0]:
> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

I ran a check and found that this patch causes changes in upper casing 
of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
  LOCALE_PROVIDER builtin
  BUILTIN_LOCALE 'C.UTF-8'
  TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------


8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------


8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
                md5
----------------------------------
  9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing 
something.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Update Unicode data to Unicode 16.0.0

From
Laurenz Albe
Date:
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
> On 11/11/24 01:27, Peter Eisentraut wrote:
> > Here is the patch to update the Unicode data to version 16.0.0.
> >
> > Normally, this would have been routine, but a few months ago there was
> > some debate about how this should be handled. [0]  AFAICT, the consensus
> > was to go ahead with it, but I just wanted to notify it here to be clear.
> >
> > [0]:
> > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
>
> I ran a check and found that this patch causes changes in upper casing
> of some characters.

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.

From my experiences in the field, I consider this need to rebuild indexes
one of the greatest current problems for the usability of PostgreSQL.
I dare say that most people would prefer living with an outdated Unicode version.

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

From
Peter Eisentraut
Date:
On 12.11.24 10:40, Laurenz Albe wrote:
> On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
>> On 11/11/24 01:27, Peter Eisentraut wrote:
>>> Here is the patch to update the Unicode data to version 16.0.0.
>>>
>>> Normally, this would have been routine, but a few months ago there was
>>> some debate about how this should be handled. [0]  AFAICT, the consensus
>>> was to go ahead with it, but I just wanted to notify it here to be clear.
>>>
>>> [0]:
>>> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com
>>
>> I ran a check and found that this patch causes changes in upper casing
>> of some characters.
> 
> I want to reiterate what I said in the above thread:
> If that means that indexes on strings using the "builtin" collation
> provider need to be reindexed after an upgrade, I am very much against it.

The practice of regularly updating the Unicode files is older than the 
builtin collation provider.  It is similar to updating the time zone 
files, the encoding conversion files, the snowball files, etc.  We need 
to move all of these things forward to keep up with the aspects of the 
real world that this data reflects.  New features are required to live 
in that environment.  If a new feature were proposed that would then 
require us to stop updating any of these files, we would likely not 
accept that, or at least need a very deliberate discussion about that 
before the feature is introduced.  This was not done here at all.  If 
this new feature has this hidden requirement, then that feature is not 
complete yet, and work should probably continue to make that feature 
complete.  But that can't take progress in other areas hostage.




Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
> I want to reiterate what I said in the above thread:
> If that means that indexes on strings using the "builtin" collation
> provider need to be reindexed after an upgrade, I am very much
> against it.

How would you feel if there was a better way to "lock down" the
behavior using an extension?

I have a patchset here:

https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com

that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Laurenz Albe
Date:
On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote:
> On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
> > I want to reiterate what I said in the above thread:
> > If that means that indexes on strings using the "builtin" collation
> > provider need to be reindexed after an upgrade, I am very much
> > against it.
>
> How would you feel if there was a better way to "lock down" the
> behavior using an extension?

Better.

> I have a patchset here:
>
> https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com
>
> that changes the implementation of collation and ctype to use method
> tables rather than branching, and it also introduces some hooks that
> can be used to replace the method tables with whatever you want.

That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library.  You still have to build your own ICU library, though.

I had hoped that the builtin provider would remove the need to REINDEX,
but I have given up that hope.  Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote:
> That looks like a nice idea, since it obviates the need to build
> PostgreSQL yourself if you want to use a non-standard copy of - say -
> the ICU library.  You still have to build your own ICU library,
> though.

It would work with the builtin provider, too, which would not require
ICU at all.

The idea is that you could build an extension that copies the same
logic for building the Unicode tables that we have in Postgres now,
except that it uses whatever version of the Unicode data files you
want.

If we want it to be targeted more specifically at the builtin provider,
we can make it even simpler by allowing you to just replace the unicode
tables with an extension (rather than the method tables). I'm not 100%
sure what people actually want here, so I'm open to suggestion.

> I had hoped that the builtin provider would remove the need to
> REINDEX,
> but I have given up that hope.  Peter's argument is sound from a
> conceptual point of view, even though I doubt that the average user
> will be able to appreciate it.

I'd like to provide options for all kinds of users and packagers.

Regards,
    Jeff Davis