Thread: Unicode vs SQL_ASCII DBs

Unicode vs SQL_ASCII DBs

From

"John Sidney-Woollett"

Date:

31 January 2004, 11:38:59

Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

1) Is there a performance loss using (multibyte) UNICODE vs (single byte)
SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and
searching/sorting speeds).

2) Can SQL_ASCII be used for accented characters.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?

5) If I decide not to support/store non-latin languages, is it better to
use a LATIN encoding over UNICODE?

6) What is MULE_INTERNAL? Does it offer performance advantages?

[This next question probably belongs to the JDBC list, but I'll ask anyway]

7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?

Thanks for any help/advice.

John Sidney-Woollett

ps I did some tests between two databases; once created using UNICODE
encoding, and the other using SQL_ASCII encoding. The database initdb
command specified no encoding, so I guess that SQL_ASCII is the default
encoding. The results are below:

I created the following table in two databases:

            Table "public.table1"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 id     | integer                | not null
 data   | character varying(255) |

Here are the databases; one with UNICODE encoding, and the other with
SQL_ASCII encoding.

          List of databases
     Name     |  Owner   | Encoding
--------------+----------+-----------
 johntest     | postgres | UNICODE
 johntest2    | postgres | SQL_ASCII
 template0    | postgres | SQL_ASCII
 template1    | postgres | SQL_ASCII

I inserted the data into both databases in psql (first setting the client
encoding to LATIN1). Here is the data (the funny character is the word
"test" with the e replaced by an e acute (233).

set client_encoding to 'LATIN1';

insert into table1 values (1, 'tést');
insert into table1 values (2, 'tast');
insert into table1 values (3, 'tost');
insert into table1 values (4, 'test');

Now select the data ordered

(UNICODE DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  1 | tést
  3 | tost

(SQL_ASCII DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  3 | tost
  1 | tést

NOTE: Bad ordering using SQL_ASCII...

Now I did some other tests, I set the client encoding to UNICODE, and
retrieved the data from both databases:

set client_encoding to 'UNICODE';

(SQL_ASCII DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  3 | tost
  1 | tst

NOTE: You can see that the e-acute has been "lost"...

 (UNICODE DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  1 | tÃ©st
  3 | tost

Re: Unicode vs SQL_ASCII DBs

From

Kris Jurka

Date:

31 January 2004, 15:44:25

On Sat, 31 Jan 2004, John Sidney-Woollett wrote:

> Hi
>
> I need to store accented characters in a postgres (7.4) database, and
> access the data (mostly) using the postgres JDBC driver (from a web app).
>
> Does anyone know if:
>
> 2) Can SQL_ASCII be used for accented characters.

Not with the JDBC driver.  A client which is blissfully unaware of
encoding issues can pass data into and out of an ascii db without knowing
what the encoding is, but java must know.

>
> 3) If I want accented characters to sort correctly, must I select UNICODE
> (or the appropriate ISO 8859 char set) over SQL_ASCII?

You are confusing encoding with locale.  Locales determines the correct
sort order and you must choose an encoding that works with your locale.

>
> 4) I'm not initially expecting arabic, chinese, cyrillic or other language
> types to be stored in the database. But if they were, would UNICODE be the
> best encoding scheme to use for future proofing the data?

Yes.

> 7) Because the database is being used to backend a java web application,
> are there other issues that I need to be aware of, for example, do I have
> to convert all data received to UTF-8 before writing it into the database?
> And do I have to ensure that the response (from the webserver)
> content-type is always set to UTF-8 to be rendered correctly in a user's
> browser?

The jdbc driver will correctly handle conversions between the database
encoding and the encoding the jvm is run under.  Receiving data from a web
application is a little different because you must convert data from the
client's encoding to the jvm's encoding for this to work.  The simplest
way to do this is just to make sure that you are using unicode in all
three places (server,jvm, and client).

Other things to note:

LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)

If using binary data (bytea) via JDBC you may need to use a unicode
db.  I don't know if this has been fixed, but the server would attempt to
do an encoding conversion on the binary data:

http://archives.postgresql.org/pgsql-jdbc/2004-01/msg00045.php

Kris Jurka

Re: Unicode vs SQL_ASCII DBs

From

Culley Harrelson

Date:

31 January 2004, 21:08:36

On Jan 31, 2004, at 7:32 AM, John Sidney-Woollett wrote:
>
> 4) I'm not initially expecting arabic, chinese, cyrillic or other
> language
> types to be stored in the database. But if they were, would UNICODE be
> the
> best encoding scheme to use for future proofing the data?
>

If there is a remote chance that you might ever need CJK support you
should start now with unicode encoding.  Switching the encoding of your
database is a major, major pain.  I have a SQL_ASCII database that I
wish were UNICODE but the conversion is such an ugly undertaking that I
haven't even tried.

culley

Re: Unicode vs SQL_ASCII DBs

From

"John Sidney-Woollett"

Date:

02 February 2004, 06:04:01

Kris, thanks for you feedback. Can you give me any further info on the
questions below?

Kris Jurka said:
>> 3) If I want accented characters to sort correctly, must I select
>> UNICODE
>> (or the appropriate ISO 8859 char set) over SQL_ASCII?
>
> You are confusing encoding with locale.  Locales determines the correct
> sort order and you must choose an encoding that works with your locale.

Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.

Any idea why would that be?

here is output from pg_controldata:

pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               in production
pg_control last modified:             Mon 02 Feb 2004 11:21:29 GMT
Current log file ID:                  0
Next log file segment:                2
Latest checkpoint location:           0/124B958
Prior checkpoint location:            0/1149DFC
Latest checkpoint's REDO location:    0/124B958
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        16
Latest checkpoint's NextXID:          527327
Latest checkpoint's NextOID:          26472
Time of latest checkpoint:            Mon 02 Feb 2004 11:21:27 GMT
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_GB.UTF-8
LC_CTYPE:                             en_GB.UTF-8

and

     Name      |  Owner   | Encoding
---------------+----------+-----------
 johntest      | postgres | UNICODE
 johntest2     | postgres | SQL_ASCII
 template0     | postgres | SQL_ASCII
 template1     | postgres | SQL_ASCII

> Other things to note:
>
> LOWER()/UPPER() only work correctly in a single byte encoding (not
> unicode)

Are there any other gotchas that I need to be aware of with a UNICODE
encoded database?

I save mention by Tom Lane of a bug: [QUOTE] The bug turns out not to be
Fedora-specific at all.  I believe it will happen on any platform if you
are using both a multibyte database encoding (such as Unicode) *and* a
non-C locale. PG 7.4 has a more restricted form of the bug --- it's not
locale specific but does still require a multibyte encoding. [END QUOTE]

I basically need "english" sorting, and accented character support without
any JDBC access/conversion problems. Do you think that my current DB
locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
can you suggest something better?

Thanks

John Sidney-Woollett

Re: Unicode vs SQL_ASCII DBs

From

Kris Jurka

Date:

02 February 2004, 06:27:07

On Mon, 2 Feb 2004, John Sidney-Woollett wrote:

> Kris, thanks for you feedback. Can you give me any further info on the
> questions below?
>
> Kris Jurka said:
> >> 3) If I want accented characters to sort correctly, must I select
> >> UNICODE
> >> (or the appropriate ISO 8859 char set) over SQL_ASCII?
> >
> > You are confusing encoding with locale.  Locales determines the correct
> > sort order and you must choose an encoding that works with your locale.
>
> Except that in my test, the two differently encoded databases were in the
> same 7.4.1 cluster with the same locale, yet they sorted the *same* data
> differently - implying the encoding is a factor.

Right, note the "and you must choose an encoding that works with your
locale." clause.  A SQL_ASCII encoding and a UTF-8 locale don't work.

> I basically need "english" sorting, and accented character support without
> any JDBC access/conversion problems. Do you think that my current DB
> locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
> can you suggest something better?

If you need "english" sorting like "en_GB" then that is the best option,
but if you just need regular sorting the C locale might be better.  It is
sometimes confusing how en_US (I assume GB is similar) sorts strings with
spaces and punctuation and so on.

Kris Jurka

Re: Unicode vs SQL_ASCII DBs

From

"John Sidney-Woollett"

Date:

02 February 2004, 07:15:11

Kris Jurka said:
> If you need "english" sorting like "en_GB" then that is the best option,
> but if you just need regular sorting the C locale might be better.  It is
> sometimes confusing how en_US (I assume GB is similar) sorts strings with
> spaces and punctuation and so on.

If I switch from "en_GB" locale to "C" locale (by recreating the cluster
using "initdb --no-locale"), will I still get accented characters be
sorted correctly, ie tast, test, tést, tost (if the DB encoding is
UNICODE)?

From what I've read, the "C" locale will give me better performance and
optimization for certain functions/expressions (eg "like").

Do you think that the "C" locale (which affects the LC_CTYPE setting) will
allow me to assume that searching and sorting will operate as I
hope/expect, and that all my existing functions etc will work as expected?

> LOWER()/UPPER() only work correctly in a single byte encoding (not
> unicode)

In one pl/pgsql function, I need the ability to lowercase (LOWER) the
three character file extension of a filename stored in the datbase. Will
the LOWER call do nothing with a char/varchar object in a unicode
database?

If so, is there somekind of workaround - I "know" the file extensions that
I'm interested in - can I replace the call to LOWER by a *huge* if then
elsif block of code which does a string match to workout the replacement
extension (in lowercase)?

Thanks again for your patience and help

John Sidney-Woollett

ps I will do some more testing, but with this topic I'm groping around in
the dark...

Re: Unicode vs SQL_ASCII DBs

From

Tom Lane

Date:

02 February 2004, 11:11:18

Kris Jurka <books@ejurka.com> writes:
> On Mon, 2 Feb 2004, John Sidney-Woollett wrote:
>> Except that in my test, the two differently encoded databases were in the
>> same 7.4.1 cluster with the same locale, yet they sorted the *same* data
>> differently - implying the encoding is a factor.

> Right, note the "and you must choose an encoding that works with your
> locale." clause.  A SQL_ASCII encoding and a UTF-8 locale don't work.

In practice, any given locale setting assumes a particular encoding and
will not work if some other encoding is used.  For instance, on recent
Red Hat releases:

$ locale -a | grep ^de_DE
de_DE
de_DE.iso88591
de_DE.iso885915@euro
de_DE.utf8
de_DE.utf8@euro
de_DE@euro

I'm not too sure which encoding "de_DE" uses, but the other two are
clearly named to reflect their expected encoding.

It is really a bug that PG allows you to select incompatible locale and
encoding settings.  We'd fix it if we could figure out a portable way of
determining which encoding a locale expects --- unfortunately the
standard APIs for libc omit this information ...

            regards, tom lane