Re: String encoding during connection "handshake" - Mailing list pgsql-hackers
From | sulfinu@gmail.com |
---|---|
Subject | Re: String encoding during connection "handshake" |
Date | |
Msg-id | 200711282017.53764.sulfinu@gmail.com Whole thread Raw |
In response to | Re: String encoding during connection "handshake" (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: String encoding during connection "handshake"
Re: String encoding during connection "handshake" Re: String encoding during connection "handshake" |
List | pgsql-hackers |
On Wednesday 28 November 2007, Alvaro Herrera wrote: > sulfinu@gmail.com escribió: > > Martijn, > > > > :) don't take it personal, I am just trying to obtain confirmation that I > > > > understood well the problem. Afterall, it's just that C has a very > > outdated notion of "char"s (and no notion of Unicode). I was naively > > under the impression that "char"s have evolved in nowadays C. > > This is not the language's fault in any way. We support plenty of > encodings beyond UTF-8. Yes, you support (and worry about) encodings simply because of a C limitation dating from 1974, if I recall correctly... In Java, for example, a "char" is a very well defined datum, namely a Unicode point. While in C it can be some char or another (or an error!) depending on what encoding was used. The only definition that stands up is that a "char" is a byte. Its interpretation is unsure and unsafe (see my original problem). On Wednesday 28 November 2007, Martijn van Oosterhout wrote: > On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu@gmail.com wrote: > > Regarding the problem of "One True Encoding", the answer seems obvious to > > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or > > another Unicode-aware scheme, whichever yields a statistically smaller > > database for the languages employed by the users in their data. This > > encoding should be a one time choice! De facto, this is already happening > > now, because one cannot change collation rules after a cluster has been > > created. > > Umm, each database in a cluster can have a different encoding, so there > is no such thing as the "cluster's encoding". I implied that a cluster should have a single encoding that covers the whole Unicode set. That would certainly satisfy everybody. > You can certainly argue > that it should be a one time choice, but I doubt you'll get people to > remove the possibilites we have now. If fact, if anything we'd probably > go the otherway, allow you to select the collation on a per > database/table/column level (SQL complaince requires this). The collation order is implemented in close relationship with the byte representation of strings, but conceptually depends on the locale solely and has nothing to do with the encoding. > This has nothing to do with C by the way. C has many features that > allow you to work with different encodings. It just doesn't force you > to use any particular one. Yes, my point exactly! C forces you to worry about encoding. I mean, if you're not an ASCII-only user ;) Think of it this way: if I give you a Java String you will perfectly know what I meant; if I send you a C char* you don't know what it is in the absence of extra information - you can even use it as a uint8*, as it is actually done in md5.c. I consider this matter closed from my point of view and I have modified the JDBC driver according to my needs. Thank you all for the help.
pgsql-hackers by date: