- The URL at the top currently points to a directory in Github, but v3
changed it to point to the actual file. A directory can be navigated
for inspection, so I used:
2000:
https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
2022:
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
Looks good.
- I also made the regex a multiline regex for readability, even though
the previous one was not.
Thank you very much for polishing the perl script. I am not an expert of perl. I can make the script working, but not perfect.
For 2022 version, I think it would be good to once run a test to
verify that no mappings changed that we didn't expect. Perhaps the
tests here can be used:
https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi
I have manually run tested I had done before, everything works as expected.
I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything?
The upstream correction to the 2000 version is not present in our
mappings, so we should mention that, unless it was reverted in or
before 2022.
I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them.
In the documentation (charset.sgml), do we want to mention the version
e.g. the following?
<entry><literal>GB18030</literal></entry>
-<entry>National Standard</entry>
+<entry>National Standard, version 2022</entry>
That's a good idea. I updated the sgml file:
I've whacked around the commit messages, so those should be reviewed
for accuracy.
Your draft commit message had "9 characters are no longer required by
the new standard, but are retained in this patch for compatibility"
...but those nine were introduced in the 2005 version, right? In which
case it doesn't affect us. Please confirm.
I don't find any hint about if the 9 characters were introduced in the 2005 version.
But without this patch, they can be properly converted:
```
evantest=# SELECT encode(convert_from(decode('FD9D', 'hex'), 'GB18030')::bytea, 'hex');
encode
--------
efa5b9
(1 row)
```
So they should be available in the version 2002 already.
"Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages
from this address in this thread, so could you confirm this was
intentional?
Yes, Zheng Tao is my colleague. He worked with me for this patch, so I want to credit him.
I am attaching v5 version. The only change is 0003, I added the SGML change.
Best regards,
Chao Li (Evan)
---------------------