Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Patch for bug #12845 (GB18030 encoding) |
Date | |
Msg-id | 22735.1431717506@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Patch for bug #12845 (GB18030 encoding) (Arjen Nienhuis <a.g.nienhuis@gmail.com>) |
Responses |
Re: Patch for bug #12845 (GB18030 encoding)
Re: Patch for bug #12845 (GB18030 encoding) |
List | pgsql-hackers |
Arjen Nienhuis <a.g.nienhuis@gmail.com> writes: > On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> According to that, about half of the characters below U+FFFF can be >> processed via linear conversions, so I think we ought to save table >> space by doing that. However, the remaining stuff that has to be >> processed by lookup still contains a pretty substantial number of >> characters that map to 4-byte GB18030 characters, so I don't think >> we can get any table size savings by adopting a bespoke table format. >> We might as well use UtfToLocal. (Worth noting in this connection >> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte >> table entries for other encodings, even though most of the others >> are not concerned with characters outside the BMP.) > It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal > uses a sparse array: > map = {{0, x}, {1, y}, {2, z}, ...} > v.s. > map = {x, y, z, ...} > That's fine when not every code point is used, but it's different for > GB18030 where almost all code points are used. Using a plain array > saves space and saves a binary search. Well, it doesn't save any space: if we get rid of the additional linear ranges in the lookup table, what remains is 30733 entries requiring about 256K, same as (or a bit less than) what you suggest. The point about possibly being able to do this with a simple lookup table instead of binary search is valid, but I still say it's a mistake to suppose that we should consider that only for GB18030. With the reduced table size, the GB18030 conversion tables are not all that far out of line with the other Far Eastern conversions: $ size utf8*.so | sort -n text data bss dec hex filename 1880 512 16 2408 968 utf8_and_ascii.so 2394 528 16 2938 b7a utf8_and_iso8859_1.so 6674 512 16 7202 1c22 utf8_and_cyrillic.so24318 904 16 25238 6296 utf8_and_win.so 28750 968 16 29734 7426 utf8_and_iso8859.so121110 512 16 121638 1db26 utf8_and_euc_cn.so123458 512 16 123986 1e452 utf8_and_sjis.so133606 512 16 134134 20bf6 utf8_and_euc_kr.so185014 512 16 185542 2d4c6 utf8_and_sjis2004.so185522 512 16 186050 2d6c2 utf8_and_euc2004.so212950 512 16 213478 341e6 utf8_and_euc_jp.so221394 512 16 221922 362e2 utf8_and_big5.so274772 512 16 275300 43364 utf8_and_johab.so277776 512 16 278304 43f20 utf8_and_uhc.so332262 512 16 332790 513f6 utf8_and_euc_tw.so350640 512 16 351168 55bc0 utf8_and_gbk.so496680 512 16 497208 79638 utf8_and_gb18030.so If we were to get excited about reducing the conversion time for GB18030, it would clearly make sense to use similar infrastructure for GBK, and perhaps the EUC encodings too. However, I'm not that excited about changing it. We have not heard field complaints about these converters being too slow. What's more, there doesn't seem to be any practical way to apply the same idea to the other conversion direction, which means if you do feel there's a speed problem this would only halfway fix it. So my feeling is that the most practical and maintainable answer is to keep GB18030 using code that is mostly shared with the other encodings. I've committed a fix that does it that way for 9.5. If you want to pursue the idea of a faster conversion using direct lookup tables, I think that would be 9.6 material at this point. regards, tom lane
pgsql-hackers by date: