Thread: Re: Update Unicode data to Unicode 16.0.0
On 11/11/24 01:27, Peter Eisentraut wrote: > Here is the patch to update the Unicode data to version 16.0.0. > > Normally, this would have been routine, but a few months ago there was > some debate about how this should be handled. [0] AFAICT, the consensus > was to go ahead with it, but I just wanted to notify it here to be clear. > > [0]: > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com I ran a check and found that this patch causes changes in upper casing of some characters. Repro: setup 8<------------- wget https://joeconway.com/presentations/formated-unicode.txt initdb psql CREATE DATABASE builtincoll LOCALE_PROVIDER builtin BUILTIN_LOCALE 'C.UTF-8' TEMPLATE template0; \c builtincoll CREATE TABLE unsorted_table(strings text); \copy unsorted_table from formated-unicode.txt (format csv) VACUUM FREEZE ANALYZE unsorted_table; 8<------------- 8<------------- -- on master builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 7ec7f5c2d8729ec960942942bb82aedd (1 row) builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 97f83a4d1937aa65bcf8be134bf7b0c4 (1 row) builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 8cf65a43affc221f3a20645ef402085e (1 row) 8<------------- 8<------------- -- master+patch builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 7ec7f5c2d8729ec960942942bb82aedd (1 row) Time: 19858.981 ms (00:19.859) builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 3055b3d5dff76c8c1250ef500c6ec13f (1 row) Time: 19774.467 ms (00:19.774) builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1) SELECT md5(string_agg(t.s,NULL)) FROM t; md5 ---------------------------------- 9985acddf7902ea603897cdaccd02114 (1 row) 8<------------- So both UPPER and INITCAP produce different results unless I am missing something. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote: > On 11/11/24 01:27, Peter Eisentraut wrote: > > Here is the patch to update the Unicode data to version 16.0.0. > > > > Normally, this would have been routine, but a few months ago there was > > some debate about how this should be handled. [0] AFAICT, the consensus > > was to go ahead with it, but I just wanted to notify it here to be clear. > > > > [0]: > > https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com > > I ran a check and found that this patch causes changes in upper casing > of some characters. I want to reiterate what I said in the above thread: If that means that indexes on strings using the "builtin" collation provider need to be reindexed after an upgrade, I am very much against it. From my experiences in the field, I consider this need to rebuild indexes one of the greatest current problems for the usability of PostgreSQL. I dare say that most people would prefer living with an outdated Unicode version. Yours, Laurenz Albe
On 12.11.24 10:40, Laurenz Albe wrote: > On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote: >> On 11/11/24 01:27, Peter Eisentraut wrote: >>> Here is the patch to update the Unicode data to version 16.0.0. >>> >>> Normally, this would have been routine, but a few months ago there was >>> some debate about how this should be handled. [0] AFAICT, the consensus >>> was to go ahead with it, but I just wanted to notify it here to be clear. >>> >>> [0]: >>> https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com >> >> I ran a check and found that this patch causes changes in upper casing >> of some characters. > > I want to reiterate what I said in the above thread: > If that means that indexes on strings using the "builtin" collation > provider need to be reindexed after an upgrade, I am very much against it. The practice of regularly updating the Unicode files is older than the builtin collation provider. It is similar to updating the time zone files, the encoding conversion files, the snowball files, etc. We need to move all of these things forward to keep up with the aspects of the real world that this data reflects. New features are required to live in that environment. If a new feature were proposed that would then require us to stop updating any of these files, we would likely not accept that, or at least need a very deliberate discussion about that before the feature is introduced. This was not done here at all. If this new feature has this hidden requirement, then that feature is not complete yet, and work should probably continue to make that feature complete. But that can't take progress in other areas hostage.
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote: > I want to reiterate what I said in the above thread: > If that means that indexes on strings using the "builtin" collation > provider need to be reindexed after an upgrade, I am very much > against it. How would you feel if there was a better way to "lock down" the behavior using an extension? I have a patchset here: https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com that changes the implementation of collation and ctype to use method tables rather than branching, and it also introduces some hooks that can be used to replace the method tables with whatever you want. Regards, Jeff Davis
On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote: > On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote: > > I want to reiterate what I said in the above thread: > > If that means that indexes on strings using the "builtin" collation > > provider need to be reindexed after an upgrade, I am very much > > against it. > > How would you feel if there was a better way to "lock down" the > behavior using an extension? Better. > I have a patchset here: > > https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com > > that changes the implementation of collation and ctype to use method > tables rather than branching, and it also introduces some hooks that > can be used to replace the method tables with whatever you want. That looks like a nice idea, since it obviates the need to build PostgreSQL yourself if you want to use a non-standard copy of - say - the ICU library. You still have to build your own ICU library, though. I had hoped that the builtin provider would remove the need to REINDEX, but I have given up that hope. Peter's argument is sound from a conceptual point of view, even though I doubt that the average user will be able to appreciate it. Yours, Laurenz Albe
On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote: > That looks like a nice idea, since it obviates the need to build > PostgreSQL yourself if you want to use a non-standard copy of - say - > the ICU library. You still have to build your own ICU library, > though. It would work with the builtin provider, too, which would not require ICU at all. The idea is that you could build an extension that copies the same logic for building the Unicode tables that we have in Postgres now, except that it uses whatever version of the Unicode data files you want. If we want it to be targeted more specifically at the builtin provider, we can make it even simpler by allowing you to just replace the unicode tables with an extension (rather than the method tables). I'm not 100% sure what people actually want here, so I'm open to suggestion. > I had hoped that the builtin provider would remove the need to > REINDEX, > but I have given up that hope. Peter's argument is sound from a > conceptual point of view, even though I doubt that the average user > will be able to appreciate it. I'd like to provide options for all kinds of users and packagers. Regards, Jeff Davis