Re: Illegal SJIS mapping - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Illegal SJIS mapping |
Date | |
Msg-id | 9c544547-7214-aebe-9b04-57624aedde96@iki.fi Whole thread Raw |
In response to | Illegal SJIS mapping (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Illegal SJIS mapping
Re: Illegal SJIS mapping |
List | pgsql-hackers |
On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote: > Hi, > > I found an useless entry in utf8_to_sjis.map > >> {0xc19c, 0x815f}, > > which is apparently illegal as UTF-8 which postgresql > deliberately refuses. So it should be removed and the attached > patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C > FULLWIDTH REVERSE SOLIDUS) and it is a right mapping. Yes, I think you're right. Committed, thanks! > By the way, the file comment at the beginning of UCS_to_SJIS.pl > is the following. > > # Generate UTF-8 <--> SJIS code conversion tables from > # map files provided by Unicode organization. > # Unfortunately it is prohibited by the organization > # to distribute the map files. So if you try to use this script, > # you have to obtain SHIFTJIS.TXT from > # the organization's ftp site. > > The file was found at the following place thanks to google. > > ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/ > > As the URL is showing, or as written in the file > Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and > the *live* definition *may* be found in Unicode Character > Database. But I haven't found SJIS-related informatin there.> > If I'm not missing anything, the only available authority would > be JIS X 0208/0213 but what should be implmented seems to be > maybe-modified MS932 for which I don't know the authority. > > Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got > a quite different mapping files from the current ones. > > So, I wonder how the mappings related to SJIS (and/or EUC-JP) are > maintained. If no authoritative information is available, the > generating script no longer usable. If any other autority is > choosed, it is to be modified according to whatever the new > source format is. The script is clearly intended to read CP932.TXT, rather than SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT However, running the script with that doesn't produce exactly what we have in utf8_to_sjis.map, either. It's otherwise same, but we have some extra mappings: - {0xc2a5, 0x5c}, - {0xc2ac, 0x81ca}, - {0xe28096, 0x8161}, - {0xe280be, 0x7e}, - {0xe28892, 0x817c}, - {0xe3809c, 0x8160}, Those mappings were added in commit a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus mapping for the invalid 0xc19c UTF-8 byte sequence was also added by that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces. I can't judge if those mappings make sense. If we can't find an authoritative source for them, I suggest that we leave them as they are, but also hard-code them to UCS_to_SJIS.pl, so that running that script produces those mappings in utf8_to_sjis.map, even though they are not present in the CP932.TXT source file. - Heikki
pgsql-hackers by date: