Re: Errors in our encoding conversion tables - Mailing list pgsql-hackers
From | Albe Laurenz |
---|---|
Subject | Re: Errors in our encoding conversion tables |
Date | |
Msg-id | A737B7A37273E048B164557ADEF4A58B50FECB63@ntex2010i.host.magwien.gv.at Whole thread Raw |
In response to | Errors in our encoding conversion tables (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Errors in our encoding conversion tables
|
List | pgsql-hackers |
Tom Lane wrote: > There's a discussion over at > http://www.postgresql.org/message-id/flat/2sa.Dhu5.1hk1yrpTNFy.1MLOlb@seznam.cz > of an apparent error in our WIN1250 -> LATIN2 conversion. I looked into this > and found that indeed, the code will happily translate certain characters > for which there seems to be no justification. I made up a quick script > that would recompute the conversion tables in latin2_and_win1250.c from > the Unicode mapping files in src/backend/utils/mb/Unicode, and what it > computes is shown in the attached diff. (Zeroes in the tables indicate > codes with no translation, for which an error should be thrown.) > > Having done that, I thought it would be a good idea to see if we had any > other conversion tables that weren't directly based on the Unicode data. > The only ones I could find were in cyrillic_and_mic.c, and those seem to > be absolutely filled with errors, to the point where I wonder if they were > made from the claimed encodings or some other ones. The attached patch > recomputes those from the Unicode data, too. > > None of this data seems to have been touched since Tatsuo-san's original > commit 969e0246, so it looks like we simply didn't vet that submission > closely enough. > > I have not attempted to reverify the files in utils/mb/Unicode against the > original Unicode Consortium data, but maybe we ought to do that before > taking any further steps here. > > Anyway, what are we going to do about this? I'm concerned that simply > shoving in corrections may cause problems for users. Almost certainly, > we should not back-patch this kind of change. Thanks for picking this up. I agree with your proposed fix, the only thing that makes me feel uncomfortable is that you get error messages like: ERROR: character with byte sequence 0x96 in encoding "WIN1250" has no equivalent inencoding "MULE_INTERNAL" which is a bit misleading. But the main thing is that no corrupt data can be entered. I can understand the reluctance to back-patch; nobody likes his application to suddenly fail after a minor database upgrade. However, the people who would fail if this were back-patched are people who will certainly run into trouble if they a) upgrade to a release where this is fixed or b) try to convert their database to, say, UTF8. The least thing we should do is stick a fat warning into the release notes of the first version where this is fixed, along with some guidelines what to do (though I am afraid that there is not much more helpful to say than "If your database encoding is X and data have been entered with client_encoding Y, fix your data in the old system"). But I think that this fix should be applied to 9.6. PostgreSQL has a strong reputation for being strict about correct encoding (not saying that everybody appreciates that), and I think we shouldn't mar that reputation. Yours, Laurenz Albe
pgsql-hackers by date: