Thread: Unicode vs SQL_ASCII DBs
Hi I need to store accented characters in a postgres (7.4) database, and access the data (mostly) using the postgres JDBC driver (from a web app). Does anyone know if: 1) Is there a performance loss using (multibyte) UNICODE vs (single byte) SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and searching/sorting speeds). 2) Can SQL_ASCII be used for accented characters. 3) If I want accented characters to sort correctly, must I select UNICODE (or the appropriate ISO 8859 char set) over SQL_ASCII? 4) I'm not initially expecting arabic, chinese, cyrillic or other language types to be stored in the database. But if they were, would UNICODE be the best encoding scheme to use for future proofing the data? 5) If I decide not to support/store non-latin languages, is it better to use a LATIN encoding over UNICODE? 6) What is MULE_INTERNAL? Does it offer performance advantages? [This next question probably belongs to the JDBC list, but I'll ask anyway] 7) Because the database is being used to backend a java web application, are there other issues that I need to be aware of, for example, do I have to convert all data received to UTF-8 before writing it into the database? And do I have to ensure that the response (from the webserver) content-type is always set to UTF-8 to be rendered correctly in a user's browser? Thanks for any help/advice. John Sidney-Woollett ps I did some tests between two databases; once created using UNICODE encoding, and the other using SQL_ASCII encoding. The database initdb command specified no encoding, so I guess that SQL_ASCII is the default encoding. The results are below: I created the following table in two databases: Table "public.table1" Column | Type | Modifiers --------+------------------------+----------- id | integer | not null data | character varying(255) | Here are the databases; one with UNICODE encoding, and the other with SQL_ASCII encoding. List of databases Name | Owner | Encoding --------------+----------+----------- johntest | postgres | UNICODE johntest2 | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII I inserted the data into both databases in psql (first setting the client encoding to LATIN1). Here is the data (the funny character is the word "test" with the e replaced by an e acute (233). set client_encoding to 'LATIN1'; insert into table1 values (1, 'tést'); insert into table1 values (2, 'tast'); insert into table1 values (3, 'tost'); insert into table1 values (4, 'test'); Now select the data ordered (UNICODE DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 1 | tést 3 | tost (SQL_ASCII DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 3 | tost 1 | tést NOTE: Bad ordering using SQL_ASCII... Now I did some other tests, I set the client encoding to UNICODE, and retrieved the data from both databases: set client_encoding to 'UNICODE'; (SQL_ASCII DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 3 | tost 1 | tst NOTE: You can see that the e-acute has been "lost"... (UNICODE DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 1 | tést 3 | tost
On Sat, 31 Jan 2004, John Sidney-Woollett wrote: > Hi > > I need to store accented characters in a postgres (7.4) database, and > access the data (mostly) using the postgres JDBC driver (from a web app). > > Does anyone know if: > > 2) Can SQL_ASCII be used for accented characters. Not with the JDBC driver. A client which is blissfully unaware of encoding issues can pass data into and out of an ascii db without knowing what the encoding is, but java must know. > > 3) If I want accented characters to sort correctly, must I select UNICODE > (or the appropriate ISO 8859 char set) over SQL_ASCII? You are confusing encoding with locale. Locales determines the correct sort order and you must choose an encoding that works with your locale. > > 4) I'm not initially expecting arabic, chinese, cyrillic or other language > types to be stored in the database. But if they were, would UNICODE be the > best encoding scheme to use for future proofing the data? Yes. > 7) Because the database is being used to backend a java web application, > are there other issues that I need to be aware of, for example, do I have > to convert all data received to UTF-8 before writing it into the database? > And do I have to ensure that the response (from the webserver) > content-type is always set to UTF-8 to be rendered correctly in a user's > browser? The jdbc driver will correctly handle conversions between the database encoding and the encoding the jvm is run under. Receiving data from a web application is a little different because you must convert data from the client's encoding to the jvm's encoding for this to work. The simplest way to do this is just to make sure that you are using unicode in all three places (server,jvm, and client). Other things to note: LOWER()/UPPER() only work correctly in a single byte encoding (not unicode) If using binary data (bytea) via JDBC you may need to use a unicode db. I don't know if this has been fixed, but the server would attempt to do an encoding conversion on the binary data: http://archives.postgresql.org/pgsql-jdbc/2004-01/msg00045.php Kris Jurka
On Jan 31, 2004, at 7:32 AM, John Sidney-Woollett wrote: > > 4) I'm not initially expecting arabic, chinese, cyrillic or other > language > types to be stored in the database. But if they were, would UNICODE be > the > best encoding scheme to use for future proofing the data? > If there is a remote chance that you might ever need CJK support you should start now with unicode encoding. Switching the encoding of your database is a major, major pain. I have a SQL_ASCII database that I wish were UNICODE but the conversion is such an ugly undertaking that I haven't even tried. culley
Kris, thanks for you feedback. Can you give me any further info on the questions below? Kris Jurka said: >> 3) If I want accented characters to sort correctly, must I select >> UNICODE >> (or the appropriate ISO 8859 char set) over SQL_ASCII? > > You are confusing encoding with locale. Locales determines the correct > sort order and you must choose an encoding that works with your locale. Except that in my test, the two differently encoded databases were in the same 7.4.1 cluster with the same locale, yet they sorted the *same* data differently - implying the encoding is a factor. Any idea why would that be? here is output from pg_controldata: pg_control version number: 72 Catalog version number: 200310211 Database cluster state: in production pg_control last modified: Mon 02 Feb 2004 11:21:29 GMT Current log file ID: 0 Next log file segment: 2 Latest checkpoint location: 0/124B958 Prior checkpoint location: 0/1149DFC Latest checkpoint's REDO location: 0/124B958 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's StartUpID: 16 Latest checkpoint's NextXID: 527327 Latest checkpoint's NextOID: 26472 Time of latest checkpoint: Mon 02 Feb 2004 11:21:27 GMT Database block size: 8192 Blocks per segment of large relation: 131072 Maximum length of identifiers: 64 Maximum number of function arguments: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: en_GB.UTF-8 LC_CTYPE: en_GB.UTF-8 and Name | Owner | Encoding ---------------+----------+----------- johntest | postgres | UNICODE johntest2 | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII > Other things to note: > > LOWER()/UPPER() only work correctly in a single byte encoding (not > unicode) Are there any other gotchas that I need to be aware of with a UNICODE encoded database? I save mention by Tom Lane of a bug: [QUOTE] The bug turns out not to be Fedora-specific at all. I believe it will happen on any platform if you are using both a multibyte database encoding (such as Unicode) *and* a non-C locale. PG 7.4 has a more restricted form of the bug --- it's not locale specific but does still require a multibyte encoding. [END QUOTE] I basically need "english" sorting, and accented character support without any JDBC access/conversion problems. Do you think that my current DB locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or can you suggest something better? Thanks John Sidney-Woollett
On Mon, 2 Feb 2004, John Sidney-Woollett wrote: > Kris, thanks for you feedback. Can you give me any further info on the > questions below? > > Kris Jurka said: > >> 3) If I want accented characters to sort correctly, must I select > >> UNICODE > >> (or the appropriate ISO 8859 char set) over SQL_ASCII? > > > > You are confusing encoding with locale. Locales determines the correct > > sort order and you must choose an encoding that works with your locale. > > Except that in my test, the two differently encoded databases were in the > same 7.4.1 cluster with the same locale, yet they sorted the *same* data > differently - implying the encoding is a factor. Right, note the "and you must choose an encoding that works with your locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work. > I basically need "english" sorting, and accented character support without > any JDBC access/conversion problems. Do you think that my current DB > locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or > can you suggest something better? If you need "english" sorting like "en_GB" then that is the best option, but if you just need regular sorting the C locale might be better. It is sometimes confusing how en_US (I assume GB is similar) sorts strings with spaces and punctuation and so on. Kris Jurka
Kris Jurka said: > If you need "english" sorting like "en_GB" then that is the best option, > but if you just need regular sorting the C locale might be better. It is > sometimes confusing how en_US (I assume GB is similar) sorts strings with > spaces and punctuation and so on. If I switch from "en_GB" locale to "C" locale (by recreating the cluster using "initdb --no-locale"), will I still get accented characters be sorted correctly, ie tast, test, tést, tost (if the DB encoding is UNICODE)? From what I've read, the "C" locale will give me better performance and optimization for certain functions/expressions (eg "like"). Do you think that the "C" locale (which affects the LC_CTYPE setting) will allow me to assume that searching and sorting will operate as I hope/expect, and that all my existing functions etc will work as expected? > LOWER()/UPPER() only work correctly in a single byte encoding (not > unicode) In one pl/pgsql function, I need the ability to lowercase (LOWER) the three character file extension of a filename stored in the datbase. Will the LOWER call do nothing with a char/varchar object in a unicode database? If so, is there somekind of workaround - I "know" the file extensions that I'm interested in - can I replace the call to LOWER by a *huge* if then elsif block of code which does a string match to workout the replacement extension (in lowercase)? Thanks again for your patience and help John Sidney-Woollett ps I will do some more testing, but with this topic I'm groping around in the dark...
Kris Jurka <books@ejurka.com> writes: > On Mon, 2 Feb 2004, John Sidney-Woollett wrote: >> Except that in my test, the two differently encoded databases were in the >> same 7.4.1 cluster with the same locale, yet they sorted the *same* data >> differently - implying the encoding is a factor. > Right, note the "and you must choose an encoding that works with your > locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work. In practice, any given locale setting assumes a particular encoding and will not work if some other encoding is used. For instance, on recent Red Hat releases: $ locale -a | grep ^de_DE de_DE de_DE.iso88591 de_DE.iso885915@euro de_DE.utf8 de_DE.utf8@euro de_DE@euro I'm not too sure which encoding "de_DE" uses, but the other two are clearly named to reflect their expected encoding. It is really a bug that PG allows you to select incompatible locale and encoding settings. We'd fix it if we could figure out a portable way of determining which encoding a locale expects --- unfortunately the standard APIs for libc omit this information ... regards, tom lane