Re: Unicode support - Mailing list pgsql-odbc
From | Marko Ristola |
---|---|
Subject | Re: Unicode support |
Date | |
Msg-id | 4321BB0D.5040207@kolumbus.fi Whole thread Raw |
In response to | Re: Unicode support ("Dave Page" <dpage@vale-housing.co.uk>) |
List | pgsql-odbc |
Marc Herbert wrote: >On Thu, Sep 08, 2005 at 08:22:50PM +0300, Marko Ristola wrote: > > >>Marc Herbert wrote: >> >> >> >>>Marko Ristola <Marko.Ristola@kolumbus.fi> writes: >>> >>> >>> >>> > >Actually my question was just: what do you mean by 'internal'? > >Usually 'internal" means 'in memory', and I really don't think there >is any application/system using UTF-8 in memory, is there? > > > I'm sorry. I hope, that this time I'll answer to the right question. Unfortunately this is also a lengthy answer. I meant with internal unicode, the wchar_t type in Linux and Unixes. Or the Windows internal Unicode (TCHAR). I tried to find out more about the wchar_t(TCHAR) implementations - the internal Unicode representations. According to libc info pages: Under GNU Linux, wchar_t is implemented as 32bit UCS-4 characters. My ealier assumption, that Linux uses UCS-2, is wrong :( My ealier assumption, that UCS-2 is 32bit, is wrong :( Some Unix systems implement wchar_t as 16bit UCS-2 characters. This means, that if they want to implement the full 31 bit character set space, they can do so by using pairs of certain UCS-2 characters. This is UCS-16 format (a multibyte version of UCS-2). - If I remember correctly: Java uses 16bit Unicode, meaning, that they use UCS-2 or UCS-16. - According to psqlodbc implementation, Windows uses UCS-2 as it's internal format. This implicates also, that Windows might actually use UCS-16 multibyte format internally, because UCS-2 is a subset of UCS-16. Of course, Windows is capable to create it's own private standards. PostgreSQL ODBC driver for Windows has already the UTF-8 to UCS-2 character set conversion functions. PostgreSQL ODBC driver for Linux still misses UTF-8 to UCS-4 character set conversion functions. PostgreSQL Server deliveres the query results as UTF-8 for Windows ODBC driver. ODBC driver then converts the UTF-8 data into UCS-2. So psqlodbc driver uses internally UTF-8 under Windows. UTF-8 use cases under Linux: - Openoffice files are stored as UTF-8. - Emacs and many other editors store files as UTF-8. - LATIN1 or another non-Unicode characters still work, but are less used. - Some programs still don't work with UTF-8. The names of the files and their data are stored nowadays as UTF-8. Many programs use UTF-8 internally, not wchar_t format UCS-4. There might be some new programs and editors (Gnome, Kde ??) that are written from scratch, and might use wchar_t internally (UCS-4). Java programs use always UCS-2 (or UCS-16) as their internal format. So it isn't wchar_t Unicode. Java way is standardized. Terminals use nowadays UTF-8. With network, UTF-8 works (ideal. Reality??) from Windows to Linux, and from Mac to Linux. So a common format is good. So, under Linux every program may choose: - to store file names as UTF-8 or as LATIN1. This is actually a bad thing. The behaviour depends on which character set you have selected before logging in. - to store files as UTF-8 or as LATIN1. This is again based on the console logging option. - to do any character conversions they like, with libiconv library. > > > > >>>locale >>> >>> >>The reason for the popularity of UTF-8 under Linux is, that each >>program needs to be adjusted very little to be able to move >>from LATIN1 style encoding into UTF-8. >> >> > >Again, are you talking about memory, disk/network? > >This is definitely not the same thing IMHO. > > > So when you ask about memory and disk: the answer is that each application chooses it's character set formats. Usually the environment variables affect the selections. So when you ask about the network character set: Network interaction is standardized. Many Unixes,Linux and Windows try to conform to these network standards. Regards, Marko Ristola
pgsql-odbc by date: