Home > mailing lists

Re: a strange order by behavior - Mailing list pgsql-sql

From	Pavel Stehule
Subject	Re: a strange order by behavior
Date	June 22, 2011 05:46:31
Msg-id	BANLkTim9sFLt3e2+auF_oTr68UETiW83VA@mail.gmail.com Whole thread Raw
In response to	Re: a strange order by behavior (Samuel Gendler <sgendler@ideasculptor.com>)
Responses	Re: a strange order by behavior
List	pgsql-sql

Tree view

Hello

a equalent of C collate for UTF8 is ucs_basic

Regards

Pavel Stehule

2011/6/22 Samuel Gendler <sgendler@ideasculptor.com>:
>
>
> On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal@impactsoft.co.il> wrote:
>>
>> the database collation is: en_US.UTF-8
>> drop table t1;
>> create table t1 (recid int ,f1 varchar(20));
>> insert into t1 values (1,'a');
>> insert into t1 values (2,' ');
>> insert into t1 values (3,'aa');
>> insert into t1 values (4,' a');
>> select * from t1 order by f1
>> result:
>> recid  f1
>> 2      " "
>> 1      "a"        -- 2 comes before 1 because space is smaller then 'a'.
>> fine.
>> 4      " a"       -- now is see that 1 comes before 4 because space is
>> greater then 'a' !?
>> 3      "aa"       -- now again, 4 comes before 3 because space is smaller
>> the 'a' !?!
>
> I seem to recall a thread here about it ignoring spaces entirely in that
> collation (and maybe ignoring capitalization, too?).  I went to go test that
> assertion by initializing a database with C collation and got some complaint
> about it being incompatible with my template1 template database.  I
> initialized a db off of template0 and then got side tracked and you've only
> just reminded me of it.  I was planning to test whether it is safe to use
> UTF-8 for encoding but use C collation, and then maybe investigate other
> collations.
> This worked:
> createdb  -E UTF-8 --lc-collate=C some_db
> so it should be easy enough to play around with it some.  I'm not sure how
> to get a list of valid collations for any given charset, and it seems like C
> collation would generate somewhat strange results with non-ascii characters
> (french accented characters are supposed to alphabetize in some unexpected
> manner, I believe), so there must be something better - closer to UTF-8
> collation but without ignoring whitespace and such.  A quick google search
> reveals that there is some kind of standard for unicode collation
> (http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what
> is represented by the en_US.UTF-8 collation or not.  I've got no real
> experience with this stuff.
> It appears that there are differences regarding collation in recent versions
> - the beta docs for 9.1 show that you can set collation on individual
> operations or differently for different columns
> (http://www.postgresql.org/docs/9.1/static/collation.html ).  I see nothing
> like that in 8.4 docs.
> It definitely looks like we both need to have a read of the localization
> chapter of the docs for our database version as there is a bunch of stuff in
> there that I was surprised to read when I just did a quick scan - like using
> anything but C or posix is much slower and can produce incorrect results in
> a 'like' query
> It looks like the docs prior to 9.1beta have no real reference to collation
> at all, so it's down to trial and error unless someone in the know speaks
> up.
> --sam
>
>

pgsql-sql by date:

From: Samuel Gendler
Date: 22 June 2011, 05:43:28
Subject: Re: a strange order by behavior

From: Samuel Gendler
Date: 22 June 2011, 06:47:08
Subject: Re: a strange order by behavior

Re: a strange order by behavior - Mailing list pgsql-sql

Previous

Next