Re: Sorting Problem - Mailing list pgsql-general
From | Dennis Gearon |
---|---|
Subject | Re: Sorting Problem |
Date | |
Msg-id | 3F3A629C.5090307@cvc.net Whole thread Raw |
In response to | Re: Sorting Problem (Dennis Björklund <db@zigo.dhs.org>) |
Responses |
Re: Sorting Problem
|
List | pgsql-general |
Dennis Björklund wrote: > In the future we need indexes that depend on the locale (and a lot of other changes). > I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even letsa user define a language(maybe encoding) down to the column level! I've been reading on GNU-C and on languages, encoding, and localization. http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html There are three basic approaches to doing different langauges in computerized text: A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series. One byte per character. Easy storing, small size for a string. Easy storing, if english characters, 100% efficient use of storage space. Easy processing between applications, works well in the stream model of *nix Easy processing in applications, a byte is a character. Easy string handling, NOY NULL bytes in a string, except end of string. NOT easy to know encoding from inherently in the document. This is not the way of the future. B/ wide characters UTF16, UTF32, SHIFT-JIS-16, others each character the same width, 2 or 4 bytes (2 bytes handles 99% of all languages) Not so easy storing, if english characters, 50% to 75% loss of storage space. Difficult processing between applications, does NOT work well in the stream model of *nix Easy processing in applications, a set width of bits/bytes is a character. Difficult string handling, MANY NULL bytes in a string, especially if in English. Moderately easy to tell encoding/language in the document. ********This should be how Postgress stores data internally.******** C/ Multibyte characters UTF8 variable width for different characters 1-5 Not so easy storing, if non english characters, 50% to 80% loss of storage space, (in reality, most common western languages hover aournd 5-20% loss of storage space most common non western languages hover aournd 40-60%% loss of storage space) Easy processing between applications, works well in the stream model of *nix Difficult processing in applications, a variable number of bytes is a character. Easy string string handling, ONE NULL byte in a string. Moderately easy to tell encoding/language in the document. ********This is how Postgress should default to sending data OUT of the application, i.e. to the display or the web, or other system applications******** >
pgsql-general by date: