Re: a strange order by behavior - Mailing list pgsql-sql
From | Samuel Gendler |
---|---|
Subject | Re: a strange order by behavior |
Date | |
Msg-id | BANLkTi=PXV+FrAJHHX56D41Xo4DzG7vADA@mail.gmail.com Whole thread Raw |
In response to | a strange order by behavior (Eyal Wilde <eyal@impactsoft.co.il>) |
Responses |
Re: a strange order by behavior
Re: a strange order by behavior |
List | pgsql-sql |
On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal@impactsoft.co.il> wrote:
the database collation is: en_US.UTF-8drop table t1;create table t1 (recid int ,f1 varchar(20));insert into t1 values (1,'a');insert into t1 values (2,' ');insert into t1 values (3,'aa');insert into t1 values (4,' a');select * from t1 order by f1result:recid f12 " "1 "a" -- 2 comes before 1 because space is smaller then 'a'. fine.4 " a" -- now is see that 1 comes before 4 because space is greater then 'a' !?3 "aa" -- now again, 4 comes before 3 because space is smaller the 'a' !?!
I seem to recall a thread here about it ignoring spaces entirely in that collation (and maybe ignoring capitalization, too?). I went to go test that assertion by initializing a database with C collation and got some complaint about it being incompatible with my template1 template database. I initialized a db off of template0 and then got side tracked and you've only just reminded me of it. I was planning to test whether it is safe to use UTF-8 for encoding but use C collation, and then maybe investigate other collations.
This worked:
createdb -E UTF-8 --lc-collate=C some_db
so it should be easy enough to play around with it some. I'm not sure how to get a list of valid collations for any given charset, and it seems like C collation would generate somewhat strange results with non-ascii characters (french accented characters are supposed to alphabetize in some unexpected manner, I believe), so there must be something better - closer to UTF-8 collation but without ignoring whitespace and such. A quick google search reveals that there is some kind of standard for unicode collation (http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what is represented by the en_US.UTF-8 collation or not. I've got no real experience with this stuff.
It appears that there are differences regarding collation in recent versions - the beta docs for 9.1 show that you can set collation on individual operations or differently for different columns (http://www.postgresql.org/docs/9.1/static/collation.html ). I see nothing like that in 8.4 docs.
It definitely looks like we both need to have a read of the localization chapter of the docs for our database version as there is a bunch of stuff in there that I was surprised to read when I just did a quick scan - like using anything but C or posix is much slower and can produce incorrect results in a 'like' query
It looks like the docs prior to 9.1beta have no real reference to collation at all, so it's down to trial and error unless someone in the know speaks up.
--sam