Re: Errors in our encoding conversion tables - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Errors in our encoding conversion tables |
Date | |
Msg-id | 20151127.110027.1989081859519291674.t-ishii@sraoss.co.jp Whole thread Raw |
In response to | Errors in our encoding conversion tables (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Errors in our encoding conversion tables
|
List | pgsql-hackers |
> There's a discussion over at > http://www.postgresql.org/message-id/flat/2sa.Dhu5.1hk1yrpTNFy.1MLOlb@seznam.cz > of an apparent error in our WIN1250 -> LATIN2 conversion. I looked into this > and found that indeed, the code will happily translate certain characters > for which there seems to be no justification. I made up a quick script > that would recompute the conversion tables in latin2_and_win1250.c from > the Unicode mapping files in src/backend/utils/mb/Unicode, and what it > computes is shown in the attached diff. (Zeroes in the tables indicate > codes with no translation, for which an error should be thrown.) > > Having done that, I thought it would be a good idea to see if we had any > other conversion tables that weren't directly based on the Unicode data. > The only ones I could find were in cyrillic_and_mic.c, and those seem to > be absolutely filled with errors, to the point where I wonder if they were > made from the claimed encodings or some other ones. The attached patch > recomputes those from the Unicode data, too. > > None of this data seems to have been touched since Tatsuo-san's original > commit 969e0246, so it looks like we simply didn't vet that submission > closely enough. > > I have not attempted to reverify the files in utils/mb/Unicode against the > original Unicode Consortium data, but maybe we ought to do that before > taking any further steps here. > > Anyway, what are we going to do about this? I'm concerned that simply > shoving in corrections may cause problems for users. Almost certainly, > we should not back-patch this kind of change. I have started to looking into it. I wonder how do you create the part of your patch: *** 154,163 **** win12502mic(const unsigned char *l, unsigned char *p, int len) { static const unsigned char win1250_2_iso88592[]= { ! 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, ! 0x88, 0x89, 0xA9, 0x8B, 0xA6, 0xAB, 0xAE, 0xAC, ! 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, ! 0x98, 0x99, 0xB9, 0x9B, 0xB6, 0xBB, 0xBE, 0xBC, 0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7, 0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF, 0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00, --- 154,163 ---- win12502mic(const unsigned char *l, unsigned char *p, int len) { static const unsigned char win1250_2_iso88592[]= { ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0xA9, 0x00, 0xA6, 0xAB, 0xAE, 0xAC, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0xB9, 0x00, 0xB6, 0xBB, 0xBE, 0xBC, 0xA0, 0xB7, 0xA2, 0xA3, 0xA4, 0xA1, 0x00, 0xA7, 0xA8, 0x00, 0xAA, 0x00, 0x00, 0xAD, 0x00, 0xAF, 0xB0, 0x00, 0xB2, 0xB3, 0xB4, 0x00, 0x00, 0x00, In the above you seem to disable the conversion from 0x96 of win1250 to ISO-8859-2 by using the Unicode mapping files in src/backend/utils/mb/Unicode. But the corresponding mapping file (iso8859_2_to_utf8.amp) does include following entry: {0x0096, 0xc296}, How do you know 0x96 should be removed from the conversion? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
pgsql-hackers by date: