Re: GB18030-2022 Support in PostgreSQL - Mailing list pgsql-hackers

From Chao Li
Subject Re: GB18030-2022 Support in PostgreSQL
Date
Msg-id CAEoWx2mvqeC0Qmcf5UqYhG1OWe5Mjie15nD-0owNr+4zQF6eTA@mail.gmail.com
Whole thread Raw
In response to Re: GB18030-2022 Support in PostgreSQL  (John Naylor <johncnaylorls@gmail.com>)
List pgsql-hackers
I did more researches about the changes in 2022 over 2000, here is a summary:

* 66 new characters have been added in 2022. All these are 4 bytes characters. As the map files store only 2 bytes GB code mappings, 4 bytes GB code mapping are calculated, thus these chars can be properly encoded/decoded without this patch, I tested that.
* 9 characters are no longer required by 2022, but application may decide to retain them or not. As the ucm file (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) retains them, we also retain them.
* Unicode mappings for 18 characters have changed. Only these changes will cause backward compatibility issues. However, half of them are rarely used punctuation marks and rests are glyphs that I cannot recognize as a native Chinese speaker. So these changes should not significantly impact most existing databases.

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Chao Li (Evan)
---------------------
Highgo Software Co., Ltd.
https://www.highgo.com/


John Naylor <johncnaylorls@gmail.com> 于2025年8月5日周二 18:25写道:
On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So on the whole I'd lean a bit towards just redefining GB18030 as
> meaning the new standard.  The fact that we don't support it as a
> server-side encoding perhaps makes that idea more tenable than it
> would be if the encoding governed the interpretation of our own
> stored data.

> I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.
>
> As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used.

If that's the case than redefining is probably okay.

> One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1.

ICU locales can only be used with sever-side encodings.

> At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.

I highly doubt such a large hammer will be necessary. Whatever advice
we give for discovery and conversion of affected text is our
responsibility and can be in the form of example queries.

--
John Naylor
Amazon Web Services
Attachment

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication
Next
From: Bertrand Drouvot
Date:
Subject: Re: Adding per backend commit and rollback counters