Home > mailing lists

Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Patch for bug #12845 (GB18030 encoding)
Date	May 15, 2015 19:18:37
Msg-id	22735.1431717506@sss.pgh.pa.us Whole thread Raw
In response to	Re: Patch for bug #12845 (GB18030 encoding) (Arjen Nienhuis <a.g.nienhuis@gmail.com>)
Responses	Re: Patch for bug #12845 (GB18030 encoding) Re: Patch for bug #12845 (GB18030 encoding)
List	pgsql-hackers

Tree view

Arjen Nienhuis <a.g.nienhuis@gmail.com> writes:
> On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> According to that, about half of the characters below U+FFFF can be
>> processed via linear conversions, so I think we ought to save table
>> space by doing that.  However, the remaining stuff that has to be
>> processed by lookup still contains a pretty substantial number of
>> characters that map to 4-byte GB18030 characters, so I don't think
>> we can get any table size savings by adopting a bespoke table format.
>> We might as well use UtfToLocal.  (Worth noting in this connection
>> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
>> table entries for other encodings, even though most of the others
>> are not concerned with characters outside the BMP.)

> It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
> uses a sparse array:

> map = {{0, x}, {1, y}, {2, z}, ...}

> v.s.

> map = {x, y, z, ...}

> That's fine when not every code point is used, but it's different for
> GB18030 where almost all code points are used. Using a plain array
> saves space and saves a binary search.

Well, it doesn't save any space: if we get rid of the additional linear
ranges in the lookup table, what remains is 30733 entries requiring about
256K, same as (or a bit less than) what you suggest.

The point about possibly being able to do this with a simple lookup table
instead of binary search is valid, but I still say it's a mistake to
suppose that we should consider that only for GB18030.  With the reduced
table size, the GB18030 conversion tables are not all that far out of line
with the other Far Eastern conversions:

$ size utf8*.so | sort -n  text    data     bss     dec     hex filename  1880     512      16    2408     968
utf8_and_ascii.so 2394     528      16    2938     b7a utf8_and_iso8859_1.so  6674     512      16    7202    1c22
utf8_and_cyrillic.so24318     904      16   25238    6296 utf8_and_win.so 28750     968      16   29734    7426
utf8_and_iso8859.so121110    512      16  121638   1db26 utf8_and_euc_cn.so123458     512      16  123986   1e452
utf8_and_sjis.so133606    512      16  134134   20bf6 utf8_and_euc_kr.so185014     512      16  185542   2d4c6
utf8_and_sjis2004.so185522    512      16  186050   2d6c2 utf8_and_euc2004.so212950     512      16  213478   341e6
utf8_and_euc_jp.so221394    512      16  221922   362e2 utf8_and_big5.so274772     512      16  275300   43364
utf8_and_johab.so277776    512      16  278304   43f20 utf8_and_uhc.so332262     512      16  332790   513f6
utf8_and_euc_tw.so350640    512      16  351168   55bc0 utf8_and_gbk.so496680     512      16  497208   79638
utf8_and_gb18030.so

If we were to get excited about reducing the conversion time for GB18030,
it would clearly make sense to use similar infrastructure for GBK, and
perhaps the EUC encodings too.

However, I'm not that excited about changing it.  We have not heard field
complaints about these converters being too slow.  What's more, there
doesn't seem to be any practical way to apply the same idea to the other
conversion direction, which means if you do feel there's a speed problem
this would only halfway fix it.

So my feeling is that the most practical and maintainable answer is to
keep GB18030 using code that is mostly shared with the other encodings.
I've committed a fix that does it that way for 9.5.  If you want to
pursue the idea of a faster conversion using direct lookup tables,
I think that would be 9.6 material at this point.
        regards, tom lane

pgsql-hackers by date:

From: Bruno Harbulot
Date: 15 May 2015, 19:14:36
Subject: Problems with question marks in operators (JDBC, ECPG, ...)

From: Josh Berkus
Date: 15 May 2015, 19:33:06
Subject: Re: Triaging the remaining open commitfest items

Re: Patch for bug #12845 (GB18030 encoding) - Mailing list pgsql-hackers

Previous

Next