Home > mailing lists

Re: Unicode support - Mailing list pgsql-odbc

From	Marko Ristola
Subject	Re: Unicode support
Date	September 9, 2005 13:41:05
Msg-id	4321BB0D.5040207@kolumbus.fi Whole thread Raw
In response to	Re: Unicode support ("Dave Page" <dpage@vale-housing.co.uk>)
List	pgsql-odbc

Tree view

Marc Herbert wrote:

>On Thu, Sep 08, 2005 at 08:22:50PM +0300, Marko Ristola wrote:
>
>
>>Marc Herbert wrote:
>>
>>
>>
>>>Marko Ristola <Marko.Ristola@kolumbus.fi> writes:
>>>
>>>
>>>
>>>
>
>Actually my question was just: what do you mean by 'internal'?
>
>Usually 'internal" means 'in memory', and I really don't think there
>is any application/system using UTF-8 in memory, is there?
>
>
>
I'm sorry.

I hope, that this time I'll answer to the right question.
Unfortunately this is also a lengthy answer.

I meant with internal unicode, the wchar_t type in Linux and Unixes.
Or the Windows internal Unicode (TCHAR).

I tried to find out more about the wchar_t(TCHAR) implementations - the
internal Unicode representations.

According to libc info pages:

Under GNU Linux, wchar_t is implemented as 32bit UCS-4 characters.
My ealier assumption, that Linux uses UCS-2, is wrong :(
My ealier assumption, that UCS-2 is 32bit, is wrong :(

Some Unix systems implement wchar_t as 16bit UCS-2 characters.
This means, that if they want to implement the full 31 bit character set
space,
they can do so by using pairs of certain UCS-2 characters. This
is UCS-16 format (a multibyte version of UCS-2).

- If I remember correctly: Java uses 16bit Unicode, meaning, that they
use UCS-2 or UCS-16.
- According to psqlodbc implementation, Windows uses UCS-2 as it's
internal format. This implicates also, that Windows might actually use
UCS-16 multibyte format internally, because UCS-2 is a subset of UCS-16.
Of course, Windows is capable to create it's own private standards.

PostgreSQL ODBC driver for Windows has already the UTF-8 to UCS-2
character set conversion functions.

PostgreSQL ODBC driver for Linux still misses UTF-8 to UCS-4 character
set conversion functions.

PostgreSQL Server deliveres the query results as UTF-8 for Windows ODBC
driver. ODBC driver then converts the UTF-8 data into UCS-2. So psqlodbc
driver
uses internally UTF-8 under Windows.

UTF-8 use cases under Linux:

- Openoffice files are stored as UTF-8.
- Emacs and many other editors store files as UTF-8.
- LATIN1 or another non-Unicode characters still work, but are less used.
- Some programs still don't work with UTF-8.

The names of the files and their data are stored nowadays as UTF-8.
Many programs use UTF-8 internally, not wchar_t format UCS-4.

There might be some new programs and editors (Gnome, Kde ??) that are
written from
scratch, and might use wchar_t internally (UCS-4). Java programs use
always UCS-2 (or UCS-16)
as their internal format. So it isn't wchar_t Unicode. Java way is
standardized.

Terminals use nowadays UTF-8. With network, UTF-8 works (ideal.
Reality??) from Windows
to Linux, and from Mac to Linux. So a common format is good.

So, under Linux every program may choose:
- to store file names as UTF-8 or as LATIN1. This is actually a bad
thing. The
behaviour depends on which character set you have selected before
logging in.
- to store files as UTF-8 or as LATIN1. This is again based on the
console logging option.
- to do any character conversions they like, with libiconv library.

>
>
>
>
>>>locale
>>>
>>>
>>The reason for the popularity of UTF-8 under Linux is, that each
>>program needs to be adjusted very little to be able to move
>>from LATIN1 style encoding into UTF-8.
>>
>>
>
>Again, are you talking about memory, disk/network?
>
>This is definitely not the same thing IMHO.
>
>
>
So when you ask about memory and disk: the answer is that each
application chooses
it's character set formats. Usually the environment variables affect the
selections.

So when you ask about the network character set: Network interaction is
standardized.
Many Unixes,Linux and Windows try to conform to these network standards.

Regards, Marko Ristola

pgsql-odbc by date:

From: "Dave Page"
Date: 09 September 2005, 09:02:04
Subject: Re: row count with libpq driver

From: "Greg Campbell"
Date: 09 September 2005, 16:02:01
Subject: Re: ODBC Driver Failure- MS Access- Large record volume

Re: Unicode support - Mailing list pgsql-odbc

Previous

Next