Home > mailing lists
Unicode combining characters - Mailing list pgsql-hackers

From	Patrice Hédé
Subject	Unicode combining characters
Date	September 24, 2001 17:52:28
Msg-id	20010924235204.L1316@idf.net Whole thread
Responses	Re: Unicode combining characters Re: Unicode combining characters Re: Unicode combining characters
List	pgsql-hackers
Tree view
Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :
lang | length | length |   text    |   text
------+--------+--------+-----------+-----------isl  |      7 |      6 | álíta | áleitisl  |      7 |      7 | álíta |
álitumisl |      7 |      7 | álíta | álitiðisl  |      5 |      4 | maður    | mannisl  |      5 |      7 | maður    |
mönnumisl |      5 |      5 | maður    | mannaisl  |      5 |      4 | óska    | -aði
 

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'á', 'í', 'ó' and 'ö' are using combining marks, while 'ð' is not.

As a reminder, a combining mark in Unicode is that á is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns
thenumber of chars (as it is however advertised to do), rather the length of the string.
 

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).


=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).


This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted plperl.

and then :

- provide the data in tables (system and/or user) - which should be available across databases,

- load the data from the original text files provided in Unicode (and other as needed), if the functionality is
compiledinto the server.
 

- I believe the basic unicode information should be standard, and the locales should be provided as contrib/ files to
beplugged in as needed.
 

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation of the key may be expensive, some may want to
indexon the result (I would :) ),
 

- provide a collation algorithm, using the two previous facilities, which can do primary to tertiary collation (cf
TR#10for a detailed explanation).
 

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?


Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

-- 
Patrice HÉDÉ ------------------------------- patrice à islande.org ----- --  Isn't it weird  how scientists  can
imagine all the matter of the
 
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the
universe? -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
 
------------------------------------------ http://www.islande.org/ -----
pgsql-hackers by date:
From: Jan Wieck
Date: 24 September 2001, 15:27:47
Subject: Re: CHECK problem really OK now...
From: "Gowey, Geoffrey"
Date: 24 September 2001, 20:14:21
Subject: ODBC driver flakieness
Unicode combining characters - Mailing list pgsql-hackers

Previous

Next