Unicode combining characters - Mailing list pgsql-hackers
From | Patrice Hédé |
---|---|
Subject | Unicode combining characters |
Date | |
Msg-id | 20010924235204.L1316@idf.net Whole thread Raw |
Responses |
Re: Unicode combining characters
Re: Unicode combining characters Re: Unicode combining characters |
List | pgsql-hackers |
Hi all, while working on a new project involving PostgreSQL and making some tests, I have come up with the following output from psql : lang | length | length | text | text ------+--------+--------+-----------+-----------isl | 7 | 6 | álíta | áleitisl | 7 | 7 | álíta | álitumisl | 7 | 7 | álíta | álitiðisl | 5 | 4 | maður | mannisl | 5 | 7 | maður | mönnumisl | 5 | 5 | maður | mannaisl | 5 | 4 | óska | -aði [the misalignment is what I got, it's not a copy-paste error] This is pasted from a UTF-8 xterm running psql under a UTF-8 locale, querying a database created with -E UNICODE (by the way, these are icelandic words :) ). What you see above is misleading, since it's not possible to see that 'á', 'í', 'ó' and 'ö' are using combining marks, while 'ð' is not. As a reminder, a combining mark in Unicode is that á is actually encoded as a + ' (where ' is the acute combining mark). Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301], instead of <c3 a1> [UTF16: 00E1]. The "length" fields are what is returned by length(a.text) and length(b.text). So, this shows two problems : - length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns thenumber of chars (as it is however advertised to do), rather the length of the string. - the psql frontend makes the same mistake. I am using version 7.1.3 (debian sid), so it may have been corrected in the meantime (in this case, I apologise, but I have only recently started again to use PostgreSQL and I haven't followed -hackers long enough). => I think fixing psql shouldn't be too complicated, as the glibc should be providing the locale, and return the right values (is this the case ? and what happens for combined latin + chinese characters for example ? I'll have to try that later). If it's not fixed already, do you want me to look at this ? [it will take some time, as I haven't set up any development environment for postgres yet, and I'm away for one week from thursday]. => regarding the backend, it may be more complex, as the underlaying system may not provide any UTF-8 locale to use (!= from being UTF-8 aware : an administrator may have decided that UTF-8 locales are useless on a server, as only root connections are made, and he wants only the C locale on the console - I've seen that quite often ;) ). This brings me to another subject : I will need to support the full Unicode collation algorithm (UCA, as described in TR#10 [1] of the Unicode consortium), and I will need to be able to sort according to locales which may not be installed on the backend server (some of which may not even be recognised by GNU libc, which supports already more than 140 locales -- artificial languages would be an example). I will also need to be able to normalise the unicode strings (TR#15 [2]) so that I don't have some characters in legacy codepoints [as 00E1 above], and others with combining marks. There is today an implementation in perl of the needed functionality, in Unicode::Collate and Unicode::Normalize (which I haven't tried yet :( ). But as they are Perl modules, the untrusted version of perl, plperlu, will be needed, and it's a pity for what I consider a core functionality in the future (not that plperlu isn't a good thing - I can't wait for it ! - but that an untrusted pl language is needed to support normalisation and collation). Note also that there are a lot of data associated with these algorithms, as you could expect. I was wondering if some people have already thought about this, or already done something, or if some of you are interested in this. If nobody does anything, I'll do something eventually, probably before Christmas (I don't have much time for this, and I don't need the functionality right now), but if there is an interest, I could team with others and develop it faster :) Anyway, I'm open to suggestions : - implement it in C, in the core, - implement it in C, as contributed custom functions, - implement it in perl (by reusing Unicode:: work), in a trusted plperl, - implement it in perl, calling Unicode:: modules, in an untrusted plperl. and then : - provide the data in tables (system and/or user) - which should be available across databases, - load the data from the original text files provided in Unicode (and other as needed), if the functionality is compiledinto the server. - I believe the basic unicode information should be standard, and the locales should be provided as contrib/ files to beplugged in as needed. I can't really accept a solution which would rely on the underlaying libc, as it may not provide the necessary locales (or maybe, then, have a way to override the collating tables by user tables - actually, this would be certainly the best solution if it's in the core, as the tables will put an extra burden on the distribution and the installation footprint, especially if the tables are already there, for glibc, for perl5.6+, for other software dealing with Unicode). The main functions I foresee are : - provide a normalisation function to all 4 forms, - provide a collation_key(text, language) function, as the calculation of the key may be expensive, some may want to indexon the result (I would :) ), - provide a collation algorithm, using the two previous facilities, which can do primary to tertiary collation (cf TR#10for a detailed explanation). I haven't looked at PostgreSQL code yet (shame !), so I may be completely off-track, in which case I'll retract myself and won't bother you again (on that subject, that is ;) )... Comments ? Patrice. [1] http://www.unicode.org/unicode/reports/tr10/ [2] http://www.unicode.org/unicode/reports/tr15/ -- Patrice HÉDÉ ------------------------------- patrice à islande.org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
pgsql-hackers by date: