Thread: unicode and sorting(at least)
This is what has to be eventually done:(as sybase, and probably others do it) http://www.ianywhere.com/whitepapers/unicode.html I'm not sure how that will affect LIKE and REGEX.
On Wed, 23 Jun 2004, Dennis Gearon wrote: > This is what has to be eventually done:(as sybase, and probably others do it) > > http://www.ianywhere.com/whitepapers/unicode.html Actually, what probably has to be eventually done is what's in the SQL spec. Which is AFAICS basically: Allow multiple encodings Allow multiple character sets (within an encoding) Allow one or more collations per character set Allow columns to specify character set and collation Allow literals in multiple character sets Allow translations and encoding conversions (as supported) Allow explicit COLLATE clauses to control ordering and comparisons. Handle identifiers in multiple character sets plus some misc things like allowing sets that control the default character set for literals for this session and such.
> On Wed, 23 Jun 2004, Dennis Gearon wrote: > > > This is what has to be eventually done:(as sybase, and probably others do it) > > > > http://www.ianywhere.com/whitepapers/unicode.html > > Actually, what probably has to be eventually done is what's in the SQL > spec. > > Which is AFAICS basically: > Allow multiple encodings > Allow multiple character sets (within an encoding) Could Please explain more details for above. In my understanding a character set can have multiple encodings but... -- Tatsuo Ishii > Allow one or more collations per character set > Allow columns to specify character set and collation > Allow literals in multiple character sets > Allow translations and encoding conversions (as supported) > Allow explicit COLLATE clauses to control ordering and comparisons. > Handle identifiers in multiple character sets > > plus some misc things like allowing sets that control the default > character set for literals for this session and such. > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes theycan handle several langauges besides English, and are actually designed to handle a family of langauges. The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series.(UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated,and there is room for future languages. So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family thatthe database's encoding was defined for(in postgres currently, choosing encoding down to the column level is availableon several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't haveUTF16 or UTF32 available) Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kindof encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which asan encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instanceof Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/characterset encoding is chosen for the database. It so happens that most non UTF encodings are designed to sort well in an extended ascii/8 bit environment, which is whatthe ISO8xxxx and LatinX encodings actually are. I'm not sure that it's perfect though. So, if SQL_ASCII is chosen forthe LOCALE, and the encoding is ISO8xxx or LATINx, it will probably sort OK. UTF8/16/32 is built the same way. However, this only applies per character, and only works painlessly on UTF32, which hasfixed width characters. UTF8/16 OTOH, have variable length characters (in multiples of 8 bits). Since SQL_ASCII sortsin a binary fashion, UTF8/16 won't sort correctly under SQL_ASCII locale, I believe. Tatsuo Ishii wrote: >>On Wed, 23 Jun 2004, Dennis Gearon wrote: >> >> >>>This is what has to be eventually done:(as sybase, and probably others do it) >>> >>> http://www.ianywhere.com/whitepapers/unicode.html >> >>Actually, what probably has to be eventually done is what's in the SQL >>spec. >> >>Which is AFAICS basically: >> Allow multiple encodings >> Allow multiple character sets (within an encoding) > > > Could Please explain more details for above. In my understanding a > character set can have multiple encodings but... > -- > Tatsuo Ishii > > >> Allow one or more collations per character set >> Allow columns to specify character set and collation >> Allow literals in multiple character sets >> Allow translations and encoding conversions (as supported) >> Allow explicit COLLATE clauses to control ordering and comparisons. >> Handle identifiers in multiple character sets >> >>plus some misc things like allowing sets that control the default >>character set for literals for this session and such. >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match >> > >
> All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimesthey can handle several langauges besides English, and are actually designed to handle a family of langauges. ISO 8xxx series are not encodings but character sets. For example, ISO-8859-1 can be expressed in 8-bit encoding form, it also can be expressed in 7-bit encoding form. This is called ISO-2022. I know that PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand for "8-bit encoded ISO-8859-1". Also, let's not mix together "languages" and "character sets". Langugaes are defined by human, not by computers. While character sets are perfectly definable by computers. More important thing is that a language can be expressed in several character sets. For example language Japanese can be expressed in EUC-JP of cousrse. It also can be expressed in ASCII by using ROMAJI script. What I want to say here is talking about "languages" is almost useless and we have to talk about character sets and encodings. > The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series.(UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated,and there is room for future languages. > > So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family thatthe database's encoding was defined for(in postgres currently, choosing encoding down to the column level is availableon several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't haveUTF16 or UTF32 available) > > Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kindof encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which asan encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instanceof Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/characterset encoding is chosen for the database. The sorting order problem is not neccessary limited to "clutser vs. locale" one. My example about ROMAJI above raises another question "How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we could sort it in alphabetical order. But if we regard it as Japanaese, probably sorting in alphabetical order is not appropreate. This example shows that the sorting order should be defined by users or applications, not by systems or DBMSs. This is why the SQL standard has "COLLATION" concept IMO. > It so happens that most non UTF encodings are designed to sort well in an extended ascii/8 bit environment, which is whatthe ISO8xxxx and LatinX encodings actually are. I'm not sure that it's perfect though. So, if SQL_ASCII is chosen forthe LOCALE, and the encoding is ISO8xxx or LATINx, it will probably sort OK. > > UTF8/16/32 is built the same way. However, this only applies per character, and only works painlessly on UTF32, which hasfixed width characters. UTF8/16 OTOH, have variable length characters (in multiples of 8 bits). Since SQL_ASCII sortsin a binary fashion, UTF8/16 won't sort correctly under SQL_ASCII locale, I believe. > Tatsuo Ishii wrote: > > >>On Wed, 23 Jun 2004, Dennis Gearon wrote: > >> > >> > >>>This is what has to be eventually done:(as sybase, and probably others do it) > >>> > >>> http://www.ianywhere.com/whitepapers/unicode.html > >> > >>Actually, what probably has to be eventually done is what's in the SQL > >>spec. > >> > >>Which is AFAICS basically: > >> Allow multiple encodings > >> Allow multiple character sets (within an encoding) > > > > > > Could Please explain more details for above. In my understanding a > > character set can have multiple encodings but... > > -- > > Tatsuo Ishii > > > > > >> Allow one or more collations per character set > >> Allow columns to specify character set and collation > >> Allow literals in multiple character sets > >> Allow translations and encoding conversions (as supported) > >> Allow explicit COLLATE clauses to control ordering and comparisons. > >> Handle identifiers in multiple character sets > >> > >>plus some misc things like allowing sets that control the default > >>character set for literals for this session and such. > >> > >> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 9: the planner will ignore your desire to choose an index scan if your > >> joining column's datatypes do not match > >> > > > > >
On Thu, 24 Jun 2004 09:06:44 -0700 Dennis Gearon <gearond@fireserve.net> wrote > All of the ISO 8xxx encodings and LATINX encodings can handle two > langauges, English and at least one other. Sometimes they can handle > several langauges besides English, and are actually designed to handle a > family of langauges. And that's where the confusion with locales started, I think. But it's not really true. Every encoding can handle the Latin/English based computer (C) locale plus one other. In one specific case, it's the C locale plus (almost real) English, even. > The ONLY encodings that can handle a significant amount of multiple > langauges and character sets are the ISO/UTF/UCS series. (UCS is giving > way to UTF). Providing a little emphasis, here: Unicode is a character set. There is currently one encoding defined for it, but there are several transformations of that encoding, which we refer to as UTF-n where n tells us something about the bit width that the transformation is optimized for. In theory, there could be other encodings of Unicode. (No one expects it to actually happen, but they did try to leave a door open, just in case.) > In fact they can handle every human langauge ever used, Again, adding some emphasis here, (almost) every language currently known in our modern society can be handled about as well as or better wth Unicode than we were able to handle English with just ASCII. (I have a personal interest in some of the dark corners, but that's OT.) > plus some esoteric ones postulated, and there is room for future > languages. (Can't you just feel the chills run up and down your spine? This could be a wild ride, boys! heh. A little drama for your lunch hour.) > ... > > UTF8/16/32 is built the same way. However, this only applies per > character, and only works painlessly on UTF32, which has fixed width > characters. Again, a little point of emphasis, UTF32 is fixed width on the code points, but there are still composite characters. (You only thought it was safe to go back to the beach.) > UTF8/16 OTOH, have variable length characters (in multiples > of 8 or 16 >bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't > sort correctly under SQL_ASCII locale, I believe. It might almost sort well enough to cause you real pain later, too. > Tatsuo Ishii wrote: > > >>On Wed, 23 Jun 2004, Dennis Gearon wrote: > >> > >> > >>>This is what has to be eventually done:(as sybase, and probably others do it) > >>> > >>> http://www.ianywhere.com/whitepapers/unicode.html > >> > >>Actually, what probably has to be eventually done is what's in the SQL > >>spec. > >> > >>Which is AFAICS basically: > >> Allow multiple encodings > >> Allow multiple character sets (within an encoding) > > > > > > Could Please explain more details for above. In my understanding a > > character set can have multiple encodings but... Well, UTF-8 was originally, IIRC, intended to be a _Universal_ly applicable transformation. (The scheme would apply as easily to the JIS character sets.) But I don't think that's what they were talking about. I think they are talking about multiple major locales (or character subsets) within Unicode. (I could be wrong, of course.) >>> ... -- Joel <rees@ddcom.co.jp>
On Fri, 25 Jun 2004 10:19:05 +0900 (JST) Tatsuo Ishii <t-ishii@sra.co.jp> wrote > > All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimesthey can handle several langauges besides English, and are actually designed to handle a family of langauges. > > ISO 8xxx series are not encodings but character sets. For example, > ISO-8859-1 can be expressed in 8-bit encoding form, it also can be > expressed in 7-bit encoding form. This is called ISO-2022. I know that > PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand > for "8-bit encoded ISO-8859-1". > > Also, let's not mix together "languages" and "character > sets". Langugaes are defined by human, not by computers. While > character sets are perfectly definable by computers. More important > thing is that a language can be expressed in several character > sets. For example language Japanese can be expressed in EUC-JP of > cousrse. It also can be expressed in ASCII by using ROMAJI script. (Which isn't to say that everyone will find romanized Japanese easy to read for meaning.) But we should point out that there are several variations on the romanization of Japanese (some of which are anything but regular). > What I want to say here is talking about "languages" is almost > useless and we have to talk about character sets and encodings. > > > The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCSseries. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esotericones postulated, and there is room for future languages. > > > > So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family thatthe database's encoding was defined for(in postgres currently, choosing encoding down to the column level is availableon several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't haveUTF16 or UTF32 available) > > > > Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of anykind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, whichas an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instanceof Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/characterset encoding is chosen for the database. > > The sorting order problem is not neccessary limited to "clutser > vs. locale" one. My example about ROMAJI above raises another question > "How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we > could sort it in alphabetical order. But if we regard it as Japanaese, > probably sorting in alphabetical order is not appropreate. I think we should say that, while there are some contexts in which ordinary alphabetic order would be okay, there are some, for instance, in which we'd want to mirror the kana order as much as possible. (Not exactly a straightforward map-this-code-point-to-this-collation-value exercise, but should be doable.) > This > example shows that the sorting order should be defined by users or > applications, not by systems or DBMSs. This is why the SQL standard > has "COLLATION" concept IMO. > ... -- Joel <rees@ddcom.co.jp>