Re: GB18030-2022 Support in PostgreSQL - Mailing list pgsql-hackers

From John Naylor
Subject Re: GB18030-2022 Support in PostgreSQL
Date
Msg-id CANWCAZbM9Nex8A6BjpdmyHz44-1cizxvJ+zYsG7ikuxp2zJgYw@mail.gmail.com
Whole thread Raw
In response to Re: GB18030-2022 Support in PostgreSQL  (Chao Li <li.evan.chao@gmail.com>)
Responses Re: GB18030-2022 Support in PostgreSQL
List pgsql-hackers

On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com> wrote:

> I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything?

Sorry, I'm not quite sure either how to get it to run like a normal test. I got it to show the result by doing

psql -f src/test/encodings/sql/init.sql
psql -f src/test/encodings/sql/gb18030.sql > patch.out
diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff

I've attached what I got with the v5 patches, renamed to avoid being picked up by CI.

>> The upstream correction to the 2000 version is not present in our
>> mappings, so we should mention that, unless it was reverted in or
>> before 2022.
>
>
> I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them.

This is the commit, and both of these are in the 2022 file as a round trip mapping. I don't see any mappings with non-zero flag in the 2000 file (in any upstream commit).

https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5

We should mention this correction for completeness. It seems to just move 'ḿ' out of the private use area. To be sure, likely almost no one will notice.

>> Your draft commit message had "9 characters are no longer required by
>> the new standard, but are retained in this patch for compatibility"
>> ...but those nine were introduced in the 2005 version, right? In which
>> case it doesn't affect us. Please confirm.
>
>
> I don't find any hint about if the 9 characters were introduced in the 2005 version.

Okay, I must have been confused by language "was included" in one of the linked references, which doesn't necessarily mean they were introduced there.

The 66 new mappings required are not in the 2022 UCM file and we already cover them algorithmically in utf8_and_gb18030.c, so they already work without this patch (see below, the glyphs render on my OS but maybe not everyone can see them). The commit message needs to focus on what actually changed for users (I'll work on that). Related information should be an afterthought.

# SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
 convert_from
--------------
 龦
(1 row)

# SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
 convert_from
--------------
 鿯
(1 row)

While looking at utf8_and_gb18030.c, I see it refers to the XML file as the source of the algorithmic ranges. We'll want to keep some reference to the ranges independent of the XML file. I found

https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html

...which gives general info and mentions that U+10000 starts at GB+90308130, and also links to

https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt

...which has the same ranges we have below U+10000. Links can always disappear, but if the algorithmic ranges ever need to change (unlikely), we'll have new information about that.

--
John Naylor
Amazon Web Services
Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: someone else to do the list of acknowledgments
Next
From: Zsolt Parragi
Date:
Subject: Re: OAuth client code doesn't work with Google OAuth