Thread: Wrong charset mappings

Wrong charset mappings

From

Thomas O'Dowd

Date:

06 February 2003, 23:05:35

Hi all,

One Japanese character has been causing my head to swim lately. I've
finally tracked down the problem to both Java 1.3 and Postgresql.

The problem character is namely:
utf-16: 0x301C
utf-8: 0xE3809C
SJIS: 0x8160
EUC_JP: 0xA1C1
Otherwise known as the WAVE DASH character.

The confusion stems from a very similar character 0xFF5E (utf-16) or
0xEFBD9E (utf-8) the FULLWIDTH TILDE.

Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
maps correctly to both the correct SJIS and EUC-JP character. Previously
(at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
causing all sorts of trouble.

Postgresql at least picked one of the two characters namely 0xFF5E, so
conversions in and out of the database to/from sjis/euc seemed to be
working. Problem is when you try to view utf-8 from the database or if
you read the data into java (utf-16) and try converting to euc or sjis
from there.

Anyway, I think postgresql needs to be fixed for this character. In my
opinion what needs to be done is to change the mappings...

euc-jp -> utf-8    -> euc-jp
======    ========    ======
0xA1C1 -> 0xE3809C    0xA1C1

sjis   -> utf-8    -> sjis
======    ========    ======
0x8160 -> 0xE3809C    0x8160

As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
probably should be removed. Maybe you could keep the mapping back to the
sjis/euc characters to help backward compatibility though. I'm not sure
what is the correct approach there.

If anyone can tell me how to edit the mappings under:src/backend/utils/mb/Unicode/

and rebuild postgres to use them, then I can test this out locally.

Looking forward to your replies.

Tom.

Re: Wrong charset mappings

From

Tatsuo Ishii

Date:

12 February 2003, 08:29:02

I think the problem you see is due to the the mapping table changes
between 7.2 and 7.3. It seems there are more changes other than
u301c. Moreover according to the recent discussion in Japanese local
mailing list, 7.3's JDBC driver now relies on the encoding conversion
performed by the backend. ie. The driver issues "set client_encoding =
'UNICODE'". This problem is very complex and I need time to find good
solution. I don't think simply backout the changes to the mapping
table solves the problem.

> Hi all,
> 
> One Japanese character has been causing my head to swim lately. I've
> finally tracked down the problem to both Java 1.3 and Postgresql.
> 
> The problem character is namely:
> utf-16: 0x301C
> utf-8: 0xE3809C
> SJIS: 0x8160
> EUC_JP: 0xA1C1
> Otherwise known as the WAVE DASH character.
> 
> The confusion stems from a very similar character 0xFF5E (utf-16) or
> 0xEFBD9E (utf-8) the FULLWIDTH TILDE.
> 
> Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
> maps correctly to both the correct SJIS and EUC-JP character. Previously
> (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
> causing all sorts of trouble.
> 
> Postgresql at least picked one of the two characters namely 0xFF5E, so
> conversions in and out of the database to/from sjis/euc seemed to be
> working. Problem is when you try to view utf-8 from the database or if
> you read the data into java (utf-16) and try converting to euc or sjis
> from there.
> 
> Anyway, I think postgresql needs to be fixed for this character. In my
> opinion what needs to be done is to change the mappings...
> 
> euc-jp -> utf-8    -> euc-jp
> ======    ========    ======
> 0xA1C1 -> 0xE3809C    0xA1C1
> 
> sjis   -> utf-8    -> sjis
> ======    ========    ======
> 0x8160 -> 0xE3809C    0x8160
> 
> As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
> probably should be removed. Maybe you could keep the mapping back to the
> sjis/euc characters to help backward compatibility though. I'm not sure
> what is the correct approach there.
> 
> If anyone can tell me how to edit the mappings under:
>     src/backend/utils/mb/Unicode/
> 
> and rebuild postgres to use them, then I can test this out locally.

Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
euc_jp_to_utf8.map.
--
Tatsuo Ishii

Re: Wrong charset mappings

From

Thomas O'Dowd

Date:

12 February 2003, 10:14:21

Hi Ishii-san,

Thanks for the reply. Why was the particular change made between 7.2 and
7.3? It seems to have moved away from the standard. I found the
following file...

src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl

Which generates the mappings. I found it references 3 files from unicode
organisation, namely:

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT

The JIS0208.TXT has the line...

0x8160 0x2141 0x301C # WAVE DASH

1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16.

Incidently those mapping files are marked obsolete but I guess the old
mappings still hold.

I guess if I run the perl script it will generate a mapping file
different to what postgresql is currently using. It might be interesting
to pull out the diffs and see what's right/wrong. I guess its not run
anymore?

I can't see how the change will affect the JDBC driver. It should only
improve the situation. Right now its not possible to go from sjis ->
database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character
because the mapping is wrong in postgresql. I'll cc the JDBC list and
maybe we'll find out if its a real problem to change the mapping.

Changing the mapping I think is the correct thing to do from what I can
see all around me in different tools like iconv, java 1.4.1, utf-8
terminal and any unicode reference on the web.

What do you think?

Tom.

On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote:
> I think the problem you see is due to the the mapping table changes
> between 7.2 and 7.3. It seems there are more changes other than
> u301c. Moreover according to the recent discussion in Japanese local
> mailing list, 7.3's JDBC driver now relies on the encoding conversion
> performed by the backend. ie. The driver issues "set client_encoding =
> 'UNICODE'". This problem is very complex and I need time to find good
> solution. I don't think simply backout the changes to the mapping
> table solves the problem.
>
> > Hi all,
> >
> > One Japanese character has been causing my head to swim lately. I've
> > finally tracked down the problem to both Java 1.3 and Postgresql.
> >
> > The problem character is namely:
> > utf-16: 0x301C
> > utf-8: 0xE3809C
> > SJIS: 0x8160
> > EUC_JP: 0xA1C1
> > Otherwise known as the WAVE DASH character.
> >
> > The confusion stems from a very similar character 0xFF5E (utf-16) or
> > 0xEFBD9E (utf-8) the FULLWIDTH TILDE.
> >
> > Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
> > maps correctly to both the correct SJIS and EUC-JP character. Previously
> > (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
> > causing all sorts of trouble.
> >
> > Postgresql at least picked one of the two characters namely 0xFF5E, so
> > conversions in and out of the database to/from sjis/euc seemed to be
> > working. Problem is when you try to view utf-8 from the database or if
> > you read the data into java (utf-16) and try converting to euc or sjis
> > from there.
> >
> > Anyway, I think postgresql needs to be fixed for this character. In my
> > opinion what needs to be done is to change the mappings...
> >
> > euc-jp -> utf-8    -> euc-jp
> > ======    ========    ======
> > 0xA1C1 -> 0xE3809C    0xA1C1
> >
> > sjis   -> utf-8    -> sjis
> > ======    ========    ======
> > 0x8160 -> 0xE3809C    0x8160
> >
> > As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
> > probably should be removed. Maybe you could keep the mapping back to the
> > sjis/euc characters to help backward compatibility though. I'm not sure
> > what is the correct approach there.
> >
> > If anyone can tell me how to edit the mappings under:
> >     src/backend/utils/mb/Unicode/
> >
> > and rebuild postgres to use them, then I can test this out locally.
>
> Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
> PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
> euc_jp_to_utf8.map.
> --
> Tatsuo Ishii
--
Thomas O'Dowd <tom@nooper.com>
Nooper.com Mobile Services Inc

Re: [JDBC] Wrong charset mappings

From

Barry Lind

Date:

12 February 2003, 12:54:09

I don't see any jdbc specific requirements here, other than the fact
that jdbc assumes that the following conversions are done correctly:

dbcharset <-> utf8 <-> java/utf16

where the dbcharset to/from utf8 conversion is done by the backend and
the utf8 to/from java/utf16 is done in the jdbc driver.

Prior to 7.3 the jdbc driver did the entire conversion itself.  However
versions of the jdk prior to 1.4 do a terrible job when it comes to the
performance of the conversion.  So for a significant speed up in 7.3 we
moved most of the work to the backend.

thanks,
--Barry


Thomas O'Dowd wrote:
> Hi Ishii-san,
>
> Thanks for the reply. Why was the particular change made between 7.2 and
> 7.3? It seems to have moved away from the standard. I found the
> following file...
>
> src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
>
> Which generates the mappings. I found it references 3 files from unicode
> organisation, namely:
>
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
> http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
>
> The JIS0208.TXT has the line...
>
> 0x8160 0x2141 0x301C # WAVE DASH
>
> 1st col is sjis, 2nd is EUC - 0x8080, 3rd is utf16.
>
> Incidently those mapping files are marked obsolete but I guess the old
> mappings still hold.
>
> I guess if I run the perl script it will generate a mapping file
> different to what postgresql is currently using. It might be interesting
> to pull out the diffs and see what's right/wrong. I guess its not run
> anymore?
>
> I can't see how the change will affect the JDBC driver. It should only
> improve the situation. Right now its not possible to go from sjis ->
> database (utf8) -> java (jdbc/utf16) -> sjis for the WAVE DASH character
> because the mapping is wrong in postgresql. I'll cc the JDBC list and
> maybe we'll find out if its a real problem to change the mapping.
>
> Changing the mapping I think is the correct thing to do from what I can
> see all around me in different tools like iconv, java 1.4.1, utf-8
> terminal and any unicode reference on the web.
>
> What do you think?
>
> Tom.
>
> On Wed, 2003-02-12 at 22:30, Tatsuo Ishii wrote:
>
>>I think the problem you see is due to the the mapping table changes
>>between 7.2 and 7.3. It seems there are more changes other than
>>u301c. Moreover according to the recent discussion in Japanese local
>>mailing list, 7.3's JDBC driver now relies on the encoding conversion
>>performed by the backend. ie. The driver issues "set client_encoding =
>>'UNICODE'". This problem is very complex and I need time to find good
>>solution. I don't think simply backout the changes to the mapping
>>table solves the problem.
>>
>>
>>>Hi all,
>>>
>>>One Japanese character has been causing my head to swim lately. I've
>>>finally tracked down the problem to both Java 1.3 and Postgresql.
>>>
>>>The problem character is namely:
>>>utf-16: 0x301C
>>>utf-8: 0xE3809C
>>>SJIS: 0x8160
>>>EUC_JP: 0xA1C1
>>>Otherwise known as the WAVE DASH character.
>>>
>>>The confusion stems from a very similar character 0xFF5E (utf-16) or
>>>0xEFBD9E (utf-8) the FULLWIDTH TILDE.
>>>
>>>Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
>>>maps correctly to both the correct SJIS and EUC-JP character. Previously
>>>(at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
>>>causing all sorts of trouble.
>>>
>>>Postgresql at least picked one of the two characters namely 0xFF5E, so
>>>conversions in and out of the database to/from sjis/euc seemed to be
>>>working. Problem is when you try to view utf-8 from the database or if
>>>you read the data into java (utf-16) and try converting to euc or sjis
>>>from there.
>>>
>>>Anyway, I think postgresql needs to be fixed for this character. In my
>>>opinion what needs to be done is to change the mappings...
>>>
>>>euc-jp -> utf-8    -> euc-jp
>>>======    ========    ======
>>>0xA1C1 -> 0xE3809C    0xA1C1
>>>
>>>sjis   -> utf-8    -> sjis
>>>======    ========    ======
>>>0x8160 -> 0xE3809C    0x8160
>>>
>>>As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
>>>probably should be removed. Maybe you could keep the mapping back to the
>>>sjis/euc characters to help backward compatibility though. I'm not sure
>>>what is the correct approach there.
>>>
>>>If anyone can tell me how to edit the mappings under:
>>>    src/backend/utils/mb/Unicode/
>>>
>>>and rebuild postgres to use them, then I can test this out locally.
>>
>>Just edit src/backend/utils/mb/Unicode/*.map and rebiuld
>>PostgreSQL. Probably you might want to modify utf8_to_euc_jp.map and
>>euc_jp_to_utf8.map.
>>--
>>Tatsuo Ishii