Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers
From | Arjen Nienhuis |
---|---|
Subject | Re: Patch for bug #12845 (GB18030 encoding) |
Date | |
Msg-id | CAG6W84+FnfkFj2G_OXeLEnUWAoK2QGHNmy-72E_Yza7J-LfvAQ@mail.gmail.com Whole thread Raw |
In response to | Re: Patch for bug #12845 (GB18030 encoding) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Patch for bug #12845 (GB18030 encoding)
|
List | pgsql-hackers |
On Thu, May 14, 2015 at 11:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Wed, May 6, 2015 at 11:13 AM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >>>> Maybe not, but at the very least we should consider getting it fixed in >>>> 9.5 rather than waiting a full development cycle. Same as in >>>> https://www.postgresql.org/message-id/20150428131549.GA25925@momjian.us >>>> I'm not saying we MUST include it in 9.5, but we should at least >>>> consider it. If we simply stash it in the open CF we guarantee that it >>>> will linger there for a year. > >>> Sure, if somebody has the time to put into it now, I'm fine with that. >>> I'm afraid it won't be me, though: even if I had the time, I don't >>> know enough about encodings. > >> I concur that we should at least consider this patch for 9.5. I've >> added it to >> https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items > > I looked at this patch a bit, and read up on GB18030 (thank you > wikipedia). I concur we have a problem to fix. I do not like the way > this patch went about it though, ie copying-and-pasting LocalToUtf and > UtfToLocal and their supporting routines into utf8_and_gb18030.c. > Aside from being duplicative, this means the improved mapping capability > isn't available to use with anything except GB18030. (I do not know > whether there are any linear mapping ranges in other encodings, but > seeing that the Unicode crowd went to the trouble of defining a notation > for it in http://www.unicode.org/reports/tr22/, I'm betting there are.) > > What I think would be a better solution, if slightly more invasive, > is to extend LocalToUtf and UtfToLocal to add a callback function > argument for a function of signature "uint32 translate(uint32)". > This function, if provided, would be called after failing to find a > mapping in the mapping table(s), and it could implement any translation > that would be better handled by code than as a boatload of mapping-table > entries. If it returns zero then it doesn't know a translation either, > so throw error as before. > > An alternative definition that could be proposed would be to call the > function before consulting the mapping tables, not after, on the grounds > that the function can probably exit cheaply if the input's not in a range > that it cares about. However, consulting the mapping table first wins > if you have ranges that mostly work but contain a few exceptions: put > the exceptions in the mapping table and then the function need not worry > about handling them. > > Another alternative approach would be to try to define linear mapping > ranges in a tabular fashion, for more consistency with what's there now. > But that probably wouldn't work terribly well because the bytewise > character representations used in this logic have to be converted into > code points before you can do any sort of linear mapping. We could > hard-wire that conversion for UTF8, but the conversion in the other code > space would be encoding-specific. So we might as well just treat the > whole linear mapping behavior as a black box function for each encoding. > > I'm also discounting the possibility that someone would want an > algorithmic mapping for cases involving "combined" codes (ie pairs of > UTF8 characters). Of the encodings we support, only EUC_JIS_2004 and > SHIFT_JIS_2004 need such cases at all, and those have only a handful of > cases; so it doesn't seem popular enough to justify the extra complexity. > > I also notice that pg_gb18030_verifier isn't even close to strict enough; > it basically relies on pg_gb18030_mblen which contains no checks > whatsoever on the third and fourth bytes. So that needs to be fixed. > > The verification tightening would definitely not be something to > back-patch, and I'm inclined to think that the additional mapping > capability shouldn't be either, in view of the facts that (a) we've > had few if any field complaints yet, and (b) changing the signatures > of LocalToUtf/UtfToLocal might possibly break third-party code. > So I'm seeing this as a HEAD-only patch, but I do want to try to > squeeze it into 9.5 rather than wait another year. > > Barring objections, I'll go make this happen. GB18030 is a special case, because it's a full mapping of all unicode characters, and most of it is algorithmically defined. This makes UtfToLocal a bad choice to implement it. UtfToLocal assumes a sparse array with only the defined characters. It uses binary search to find a character. The 2 tables it uses now are huge (the .so file is 1MB). Adding the rest of the valid characters to this scheme is possible, but would make the problem worse. I think fixing UtfToLocal only for the new characters is not optimal. I think the best solution is to get rid of UtfToLocal for GB18030. Use a specialized algorithm: - For characters > U+FFFF use the algorithm from my patch - For charcaters <= U+FFFF use special mapping tables to map from/to UTF32. Those tables would be smaller, and the code would be faster (I assume). For example (256 KB): unsigned int utf32_to_gb18030[65536] = { /* 0x0 */ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, /* 0x8 */ 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, -- /* 0xdb08 */ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, -- /* 0xfff0 */ 0x8431a334, 0x8431a335, 0x8431a336, 0x8431a337, 0x8431a338, 0x8431a339, 0x8431a430, 0x8431a431, /* 0xfff8 */ 0x8431a432, 0x8431a433, 0x8431a434, 0x8431a435, 0x8431a436, 0x8431a437, 0x8431a438, 0x8431a439 }; Instead of (500KB): static pg_utf_to_local ULmapGB18030[ 63360 ] = { {0xc280, 0x81308130}, {0xc281, 0x81308131}, -- {0xefbfbe, 0x8431a438}, {0xefbfbf, 0x8431a439} }; See the attachment for a python script to generate those mappings. Gr. Arjen
Attachment
pgsql-hackers by date: