Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding - Mailing list pgsql-bugs
From | Heikki Linnakangas |
---|---|
Subject | Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding |
Date | |
Msg-id | 47E190C2.80504@enterprisedb.com Whole thread Raw |
In response to | Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
Responses |
Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
|
List | pgsql-bugs |
Heikki Linnakangas wrote: > Sergey Burladyan wrote: >> src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c >> does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code >> translation table (function iso2mic(const unsigned char *l, unsigned >> char *p, int len)). this is bug, because it is widely used and it is >> main letter like A, B or C in english :) and it is exist in all >> russian cyrillic's encoding (koi8-r, iso-8859-5, windows-1251, cp866). >> for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and >> many other must be written with it. >> >> here is the patch for add it to ISO-8859-5 to mule internal code >> translation table. i am don't know is this ok and do not brake any >> internal rule or code ? > > You'd need to modify the mic->ISO-8859-5 translation table as well, for > converting in the other direction. Here's a patch that does the conversion in the other direction as well. As I'm not too familiar with cyrillic, can you double-check that this works? I tested it using the convert() function between different encodings, and it seems ok to me. >> By the way, as i can understand you are using koi8-r encoding for >> internal representation of cyrillic charsets - this is have also >> another problem. the second "widely" used char is <U2116> NUMERO SIGN >> (many accountants and managers use it :) in cyrillic windows world) >> and it is exist in windows-1251, cp866 and iso-8859-5 encoding, but >> not in koi8-r... > > Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an > intermediate encoding, because there's no direct conversion table > between ISO-8859-5 and the other cyrillic encodings. Ideally there would > be. Another possibility would be to use UTF-8 as the intermediate > encoding; that'd probably be much slower, but UTF-8 should have all the > characters needed. > > Is there any other characters like "YO" that are missing, that exist in > all the encodings? Looking at the character set table for KOI8-R, it > looks like the "YO" is in an odd place in the table, compared to all > other cyrillic characters. Perhaps that's why it was missed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c,v retrieving revision 1.16 diff -c -r1.16 cyrillic_and_mic.c *** src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c 1 Jan 2008 19:45:53 -0000 1.16 --- src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c 19 Mar 2008 21:04:40 -0000 *************** *** 483,489 **** 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0, --- 483,489 ---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0xb3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xe1, 0xe2, 0xf7, 0xe7, 0xe4, 0xe5, 0xf6, 0xfa, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0, *************** *** 493,499 **** 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0, 0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde, 0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; --- 493,499 ---- 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0, 0xd2, 0xd3, 0xd4, 0xd5, 0xc6, 0xc8, 0xc3, 0xde, 0xdb, 0xdd, 0xdf, 0xd9, 0xd8, 0xdc, 0xc0, 0xd1, ! 0x00, 0xa3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; *************** *** 509,517 **** 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3, 0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, --- 509,517 ---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0xf1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ! 0x00, 0x00, 0x00, 0xa1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xee, 0xd0, 0xd1, 0xe6, 0xd4, 0xd5, 0xe4, 0xd3, 0xe5, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde,
pgsql-bugs by date: