Re: Radix tree for character conversion - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Radix tree for character conversion |
Date | |
Msg-id | 08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi Whole thread Raw |
In response to | Re: Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Radix tree for character conversion
|
List | pgsql-hackers |
On 10/21/2016 11:33 AM, Kyotaro HORIGUCHI wrote: > Hello, this is new version of radix charconv. > > At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <6d85d710-9554-a928-29ff-b2d3b80b01c9@iki.fi> >> What I don't want is that the current *.map files are turned into the >> authoritative source files, that we modify by hand. There are no >> comments in them, for starters, which makes hand-editing >> cumbersome. It seems that we have edited some of them by hand already, >> but we should rectify that. > > Agreed. So, I identifed source files of each character for EUC_JP > and SJIS conversions to clarify what has been done on them. > > SJIS conversion is made from CP932.TXT and 8 additional > conversions for UTF8->SJIS and none for SJIS->UTF8. > > EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and > JIS0208.TXT are useless. It adds 83 or 86 (different by > direction) conversion entries. > > http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT > > Now the generator scripts don't use *.map as source and in turn > generates old-style map files as well as radix tree files. > > For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and > -v. The format generates the old-style flat map as well as radix > map file and additional -v adds source description for each line > in the flat map file. > > During working on this, EUC_JP map lacks some conversions but it > is another issue. Thanks! I'd reallly like to clean up all the current perl scripts, before we start to do the radix tree stuff. I worked through the rest of the conversions, and fixed/hacked the perl scripts so that they faithfully re-produce the mapping tables that we have in the repository currently. Whether those are the best mappings or not, or whether we should update them based on some authoritative source is another question, but let's try to nail down the process of creating the mapping tables. Tom Lane looked into this in Nov 2015 (https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us). This is a continuation of that, to actually fix the scripts. This patch series doesn't change any of the mappings, only the way we produce the mapping tables. Our UHC conversion tables contained a lot more characters than the CP949.TXT file it's supposedly based on. I rewrote the script to use "windows-949-2000.xml" file, from the ICU project, as the source instead. It's a much closer match to our mapping tables, containing all but one of the additional characters. We were already using gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's XML files isn't a new thing. The GB2312.TXT source file seems to have disappeared from the Unicode consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use gb-18030-2000.xml as the source instead. GB-18030 is an extension of GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that are not in GB-2312. This now forms a reasonable basis for switching to radix tree. Every mapping table is now generated by the print_tables() perl function in convutils.pm. To switch to a radix tree, you just need to swap that function with one that produces a radix tree instead of the current-format mapping tables. The perl scripts are still quite messy. For example, I lost the checks for duplicate mappings somewhere along the way - that ought to be put back. My Perl skills are limited. This is now an orthogonal discussion, and doesn't need to block the radix tree work, but we should consider what we want to base our mapping tables on. Perhaps we could use the XML files from ICU as the source for all of the mappings? ICU seems to use a BSD-like license, so we could even include the XML files in our repository. Actually, looking at http://www.unicode.org/copyright.html#License, I think we could include the *.TXT files in our repository, too, if we wanted to. The *.TXT files are found under www.unicode.org/Public/, so that license applies. I think that has changed somewhat recently, because the comments in our perl scripts claim that the license didn't allow that. - Heikki
Attachment
pgsql-hackers by date: