Home > mailing lists

Re: a strange order by behavior - Mailing list pgsql-sql

From	Samuel Gendler
Subject	Re: a strange order by behavior
Date	June 22, 2011 05:43:28
Msg-id	BANLkTi=PXV+FrAJHHX56D41Xo4DzG7vADA@mail.gmail.com Whole thread Raw
In response to	a strange order by behavior (Eyal Wilde <eyal@impactsoft.co.il>)
Responses	Re: a strange order by behavior Re: a strange order by behavior
List	pgsql-sql

Tree view

On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal@impactsoft.co.il> wrote:

the database collation is: en_US.UTF-8

drop table t1;
create table t1 (recid int ,f1 varchar(20));
insert into t1 values (1,'a');
insert into t1 values (2,' ');
insert into t1 values (3,'aa');
insert into t1 values (4,' a');
select * from t1 order by f1

result:

recid f1
2 " "
1 "a" -- 2 comes before 1 because space is smaller then 'a'. fine.
4 " a" -- now is see that 1 comes before 4 because space is greater then 'a' !?
3 "aa" -- now again, 4 comes before 3 because space is smaller the 'a' !?!

I seem to recall a thread here about it ignoring spaces entirely in that collation (and maybe ignoring capitalization, too?). I went to go test that assertion by initializing a database with C collation and got some complaint about it being incompatible with my template1 template database. I initialized a db off of template0 and then got side tracked and you've only just reminded me of it. I was planning to test whether it is safe to use UTF-8 for encoding but use C collation, and then maybe investigate other collations.

This worked:

createdb -E UTF-8 --lc-collate=C some_db

so it should be easy enough to play around with it some. I'm not sure how to get a list of valid collations for any given charset, and it seems like C collation would generate somewhat strange results with non-ascii characters (french accented characters are supposed to alphabetize in some unexpected manner, I believe), so there must be something better - closer to UTF-8 collation but without ignoring whitespace and such. A quick google search reveals that there is some kind of standard for unicode collation (http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what is represented by the en_US.UTF-8 collation or not. I've got no real experience with this stuff.

It appears that there are differences regarding collation in recent versions - the beta docs for 9.1 show that you can set collation on individual operations or differently for different columns (http://www.postgresql.org/docs/9.1/static/collation.html ). I see nothing like that in 8.4 docs.

It definitely looks like we both need to have a read of the localization chapter of the docs for our database version as there is a bunch of stuff in there that I was surprised to read when I just did a quick scan - like using anything but C or posix is much slower and can produce incorrect results in a 'like' query

It looks like the docs prior to 9.1beta have no real reference to collation at all, so it's down to trial and error unless someone in the know speaks up.

--sam

pgsql-sql by date:

From: Eyal Wilde
Date: 22 June 2011, 03:00:26
Subject: a strange order by behavior

From: Pavel Stehule
Date: 22 June 2011, 05:46:31
Subject: Re: a strange order by behavior

Re: a strange order by behavior - Mailing list pgsql-sql

Previous

Next