Re: GB18030-2022 Support in PostgreSQL - Mailing list pgsql-hackers

From Chao Li
Subject Re: GB18030-2022 Support in PostgreSQL
Date
Msg-id 2F92C344-A707-44D0-A718-670E30B2C1DF@gmail.com
Whole thread Raw
In response to Re: GB18030-2022 Support in PostgreSQL  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: GB18030-2022 Support in PostgreSQL
List pgsql-hackers
Hi John,

Thanks for your review.

Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:

```diff - omit the comment part
> <U20AC> \x80 |3
> <U3000> \xA3\xA0 |3
> <UE5E5> \xA3\xA0 |4
>
28067a28099,28114
> <U9FB4> \xFE\x59 |0
> <U9FB4> \x82\x35\x90\x37 |3
> <U9FB5> \xFE\x61 |0
> <U9FB5> \x82\x35\x90\x38 |3
> <U9FB6> \xFE\x66 |0
> <U9FB6> \x82\x35\x90\x39 |3
> <U9FB7> \xFE\x67 |0
> <U9FB7> \x82\x35\x91\x30 |3
> <U9FB8> \xFE\x6D |0
> <U9FB8> \x82\x35\x91\x31 |3
> <U9FB9> \xFE\x7E |0
> <U9FB9> \x82\x35\x91\x32 |3
> <U9FBA> \xFE\x90 |0
> <U9FBA> \x82\x35\x91\x33 |3
> <U9FBB> \xFE\xA0 |0
> <U9FBB> \x82\x35\x91\x34 |3
29577c29624
< <UE5E5> \xA3\xA0 |0
---
> # <UE5E5> \xA3\xA0 |0
30001,30010c30048,30057
< <UE78D> \xA6\xD9 |0
< <UE78E> \xA6\xDA |0
< <UE78F> \xA6\xDB |0
< <UE790> \xA6\xDC |0
< <UE791> \xA6\xDD |0
< <UE792> \xA6\xDE |0
< <UE793> \xA6\xDF |0
< <UE794> \xA6\xEC |0
< <UE795> \xA6\xED |0
< <UE796> \xA6\xF3 |0
---
> <UE78D> \xA6\xD9 |1
> <UE78E> \xA6\xDA |1
> <UE78F> \xA6\xDB |1
> <UE790> \xA6\xDC |1
> <UE791> \xA6\xDD |1
> <UE792> \xA6\xDE |1
> <UE793> \xA6\xDF |1
> <UE794> \xA6\xEC |1
> <UE795> \xA6\xED |1
> <UE796> \xA6\xF3 |1
30146c30193
< <UE81E> \xFE\x59 |0
---
> <UE81E> \xFE\x59 |1
30154c30201
< <UE826> \xFE\x61 |0
---
> <UE826> \xFE\x61 |1
30159,30160c30206,30207
< <UE82B> \xFE\x66 |0
< <UE82C> \xFE\x67 |0
---
> <UE82B> \xFE\x66 |1
> <UE82C> \xFE\x67 |1
30166c30213
< <UE832> \xFE\x6D |0
---
> <UE832> \xFE\x6D |1
30183c30230
< <UE843> \xFE\x7E |0
---
> <UE843> \xFE\x7E |1
30200c30247
< <UE854> \xFE\x90 |0
---
> <UE854> \xFE\x90 |1
30216c30263
< <UE864> \xFE\xA0 |0
---
> <UE864> \xFE\xA0 |1
30470a30518,30537
> <UFE10> \xA6\xD9 |0
> <UFE10> \x84\x31\x82\x36 |3
> <UFE11> \xA6\xDB |0
> <UFE11> \x84\x31\x82\x37 |3
> <UFE12> \xA6\xDA |0
> <UFE12> \x84\x31\x82\x38 |3
> <UFE13> \xA6\xDC |0
> <UFE13> \x84\x31\x82\x39 |3
> <UFE14> \xA6\xDD |0
> <UFE14> \x84\x31\x83\x30 |3
> <UFE15> \xA6\xDE |0
> <UFE15> \x84\x31\x83\x31 |3
> <UFE16> \xA6\xDF |0
> <UFE16> \x84\x31\x83\x32 |3
> <UFE17> \xA6\xEC |0
> <UFE17> \x84\x31\x83\x33 |3
> <UFE18> \xA6\xED |0
> <UFE18> \x84\x31\x83\x34 |3
> <UFE19> \xA6\xF3 |0
> <UFE19> \x84\x31\x83\x35 |3
```

As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes:

```code comment from UCS_to_GB18030.pl
# The |n is a flag, where n has values of 0, 1, 3, 4.
# With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132,
# the flag should mean the following:
# 0 - round-trip mapping
# 1 - there are 18 mappings with flag 1, those are mapping changes
# from GB180303-2000 to GB18030-2022. Old mappings are marked
# with flag 1, new mappings with flag 0. So we can ignore all
# mappings with flag 0.
# 3 - there are 20 mappings with flag 3:
# 18 of them reflect to the 18 mappings with flag 1, but means
# the old mapping's unicode's new mapping with GB18030-2022.
# These 18 new mappings have no actual glyphs in GB18030-2022.
# So we can ignore these 18 mappings with flag 3.
# The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3".
# They are two reserved fallbacks for compatibility with GBK and
# other web data as in WHATWG. Both U20AC and U3000 have round-
# trip mappings in GB18030-2022, so we can ignore these two
# mappings with flag 3.
# So, we can ignore all mappings with flag 3.
# 4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4.
# This is a "good one-way" mapping from U+E5E5 to \xA3\xA0
# for maximum compatibility with previous behavior. So we can
# ignore this mapping as well.
```

For your question:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:

<UF92C> \xFD\x9C |0

Still appears in 2022.ucm, so that this character is retained.


Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/



On Aug 11, 2025, at 13:50, John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote:

I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.

BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Going from the old .xml file to the .ucm file makes it difficult to
see the relevant changes. Also, there are nearly 1000 non-user-visible
changes like this in the output file that are not explained:

-  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
+  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/

The 2000 version is available in the .ucm format, so maybe converting
to that first would be a good preparatory patch:

https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm

Looking at the history, it looks like that file has seen small
revisions, so it may take some research to get the exact equivalent to
the XML file we use. That will also tell us if anything will change
for us besides the actual 2022 revision.

--
John Naylor
Amazon Web Services

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PATCH] Refactor bytea_sortsupport(), take two
Next
From: Peter Smith
Date:
Subject: Re: Skipping schema changes in publication