Re: a strange order by behavior - Mailing list pgsql-sql
From | Pavel Stehule |
---|---|
Subject | Re: a strange order by behavior |
Date | |
Msg-id | BANLkTim9sFLt3e2+auF_oTr68UETiW83VA@mail.gmail.com Whole thread Raw |
In response to | Re: a strange order by behavior (Samuel Gendler <sgendler@ideasculptor.com>) |
Responses |
Re: a strange order by behavior
|
List | pgsql-sql |
Hello a equalent of C collate for UTF8 is ucs_basic Regards Pavel Stehule 2011/6/22 Samuel Gendler <sgendler@ideasculptor.com>: > > > On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal@impactsoft.co.il> wrote: >> >> the database collation is: en_US.UTF-8 >> drop table t1; >> create table t1 (recid int ,f1 varchar(20)); >> insert into t1 values (1,'a'); >> insert into t1 values (2,' '); >> insert into t1 values (3,'aa'); >> insert into t1 values (4,' a'); >> select * from t1 order by f1 >> result: >> recid f1 >> 2 " " >> 1 "a" -- 2 comes before 1 because space is smaller then 'a'. >> fine. >> 4 " a" -- now is see that 1 comes before 4 because space is >> greater then 'a' !? >> 3 "aa" -- now again, 4 comes before 3 because space is smaller >> the 'a' !?! > > I seem to recall a thread here about it ignoring spaces entirely in that > collation (and maybe ignoring capitalization, too?). I went to go test that > assertion by initializing a database with C collation and got some complaint > about it being incompatible with my template1 template database. I > initialized a db off of template0 and then got side tracked and you've only > just reminded me of it. I was planning to test whether it is safe to use > UTF-8 for encoding but use C collation, and then maybe investigate other > collations. > This worked: > createdb -E UTF-8 --lc-collate=C some_db > so it should be easy enough to play around with it some. I'm not sure how > to get a list of valid collations for any given charset, and it seems like C > collation would generate somewhat strange results with non-ascii characters > (french accented characters are supposed to alphabetize in some unexpected > manner, I believe), so there must be something better - closer to UTF-8 > collation but without ignoring whitespace and such. A quick google search > reveals that there is some kind of standard for unicode collation > (http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what > is represented by the en_US.UTF-8 collation or not. I've got no real > experience with this stuff. > It appears that there are differences regarding collation in recent versions > - the beta docs for 9.1 show that you can set collation on individual > operations or differently for different columns > (http://www.postgresql.org/docs/9.1/static/collation.html ). I see nothing > like that in 8.4 docs. > It definitely looks like we both need to have a read of the localization > chapter of the docs for our database version as there is a bunch of stuff in > there that I was surprised to read when I just did a quick scan - like using > anything but C or posix is much slower and can produce incorrect results in > a 'like' query > It looks like the docs prior to 9.1beta have no real reference to collation > at all, so it's down to trial and error unless someone in the know speaks > up. > --sam > >