Re: GB18030-2022 Support in PostgreSQL - Mailing list pgsql-hackers

From Chao Li
Subject Re: GB18030-2022 Support in PostgreSQL
Date
Msg-id 450E19BB-D894-420B-8B8F-8B69221E3FB2@gmail.com
Whole thread Raw
In response to Re: GB18030-2022 Support in PostgreSQL  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: GB18030-2022 Support in PostgreSQL
List pgsql-hackers


On Sep 11, 2025, at 15:39, John Naylor <johncnaylorls@gmail.com> wrote:


On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com> wrote:

> I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything?

Sorry, I'm not quite sure either how to get it to run like a normal test. I got it to show the result by doing

psql -f src/test/encodings/sql/init.sql
psql -f src/test/encodings/sql/gb18030.sql > patch.out
diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff

I've attached what I got with the v5 patches, renamed to avoid being picked up by CI.

>> The upstream correction to the 2000 version is not present in our
>> mappings, so we should mention that, unless it was reverted in or
>> before 2022.
>
>
> I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them.

This is the commit, and both of these are in the 2022 file as a round trip mapping. I don't see any mappings with non-zero flag in the 2000 file (in any upstream commit).

https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5

I managed to get the encoding test to run. I didn’t find init.sql, so I had to manually create 3 functions on my own. But finally the test passed on the master branch.

Then I switched to the patch branch, it got 21 different lines. After I updated the 18 known changes in the out file, then it got only 3 different lines:

```
- \x8135f437   | \xe1b8bf
+ \x8135f437   | \xee9f87

- \xa3a0       | \xee97a5
+ \xa3a0       | character with byte sequence 0xa3 0xa0 in encoding "GB18030" has no equivalent in encoding “UTF8"

- \xa8bc       | \xee9f87
+ \xa8bc       | \xe1b8bf
```

Where, \x8135f437 and \xa8bc reflect to the change pointed by above link:

\xA8BC used to map to unicode UE7C7, now \x8135f437 changed to map to UE7C7, and \xA8BC changed to map to U1E3F in version 2005.

For \xa3a0, in 2022.ucm, it is a not a roundtrip mapping:

```
<U3000> \xA3\xA0 |3
<UE5E5> \xA3\xA0 |4
```

So we ignored it. Then everything is clear.


We should mention this correction for completeness. It seems to just move 'ḿ' out of the private use area. To be sure, likely almost no one will notice.

>> Your draft commit message had "9 characters are no longer required by
>> the new standard, but are retained in this patch for compatibility"
>> ...but those nine were introduced in the 2005 version, right? In which
>> case it doesn't affect us. Please confirm.
>
>
> I don't find any hint about if the 9 characters were introduced in the 2005 version.

Okay, I must have been confused by language "was included" in one of the linked references, which doesn't necessarily mean they were introduced there.

The 66 new mappings required are not in the 2022 UCM file and we already cover them algorithmically in utf8_and_gb18030.c, so they already work without this patch (see below, the glyphs render on my OS but maybe not everyone can see them). The commit message needs to focus on what actually changed for users (I'll work on that). Related information should be an afterthought.

# SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
 convert_from
--------------
 龦
(1 row)

# SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
 convert_from
--------------
 鿯
(1 row)

While looking at utf8_and_gb18030.c, I see it refers to the XML file as the source of the algorithmic ranges. We'll want to keep some reference to the ranges independent of the XML file. I found

https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html

...which gives general info and mentions that U+10000 starts at GB+90308130, and also links to

https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt

...which has the same ranges we have below U+10000. Links can always disappear, but if the algorithmic ranges ever need to change (unlikely), we'll have new information about that.



I will post v6 soon with updated commit message.

By the way, for how I made the test work:

1. I copied gb18030.sql and gb18030.out to src/test/regess under sql and expected subfolders.
2. In src/test/regess/parallel_schedule, I added a line “test: gb18030”
3. Then “make check” run the gb18030 test.

Attached in my updated sql and out file. To test in master branch, use the original out file, to test with the patch, use my updated out file, it will fail with the 3 different lines as I mentioned above.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/


Attachment

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication
Next
From: shveta malik
Date:
Subject: Re: Conflict detection for update_deleted in logical replication