Inconsistent string comparison using modified ICU collations - Mailing list pgsql-hackers
From | Oleg Tselebrovskiy |
---|---|
Subject | Inconsistent string comparison using modified ICU collations |
Date | |
Msg-id | 7461f00ebf361d41731df5b913a35b0b@postgrespro.ru Whole thread Raw |
Responses |
Re: Inconsistent string comparison using modified ICU collations
Re: Inconsistent string comparison using modified ICU collations |
List | pgsql-hackers |
Greetings, everyone! I've discovered a bug with string comparison using modified ICU collations Using a direct comparison and sorting values gives different results The easiest way to reproduce is the following: postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider = icu, locale = 'en-US-u-kr-latn-digit'); CREATE COLLATION postgres=# select ('a' < '0' collate "en-US-u-kr-latn-digit-x-icu"); ?column? ---------- f (1 row) postgres=# select * from (values ('0'),('a')) t(x) order by x collate "en-US-u-kr-latn-digit-x-icu"; x --- a 0 (2 rows) Why does this happen: In the first example of simple comparison, function varstr_cmp is called and it uses ucol_strcoll[UTF8] function to compare two strings, and it seems to ignore reordering of character groups in collation; In the second example of sorting values, function varstr_abbrev_convert is called and somewhere deep it uses ucol_getSortKey/ucol_nextSortKeyPart to transform source string to SortKey and this transformation takes reordering of character groups into account Other way to reproduce this behaviour is to create table, insert the data into it, create btree index over this table and sometimes you wouldn't get the data that is definitely in the table (if you force postgres to use seqscan the query works): postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider = icu, locale = 'en-US-u-kr-latn-digit'); create table test (col text COLLATE "en-US-u-kr-latn-digit-x-icu"); insert into test values ('a'), ('0'); create index test_idx ON test USING btree (col); set enable_seqscan = off; select * from test where col = 'a'; select * from test where col = '0'; CREATE COLLATION CREATE TABLE INSERT 0 2 CREATE INDEX SET col ----- (0 rows) col ----- (0 rows) This happens because selecting one row triggers a binary search over the index, rows in the index are stored according to modified collation (letters before digits), since the creating of index triggers sorting of rows and, therefore, usage of ucol_getSortKey/ucol_nextSortKeyPart. But the WHERE filter uses ucol_strcoll[UTF-8], that doesn't take modified collation into account This can be reproduced from REL_13_STABLE up to the current master (41084409f635453efce03f1114880189b4f6ce4c) I've opened an issue in ICU Jira[1] where I have reproduced this behaviour using minimal C code To compose the collation name I have read and used an article by Peter Eisentraut on ICU collation settings[2] Unfortunately, I don't have any proposed solution for this issue, but I thought it was important to highlight it Oleg Tselebrovskiy, Postgres Pro [1] https://unicode-org.atlassian.net/browse/ICU-23016 [2] http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings
pgsql-hackers by date: