Home > mailing lists

Inconsistent string comparison using modified ICU collations - Mailing list pgsql-hackers

From	Oleg Tselebrovskiy
Subject	Inconsistent string comparison using modified ICU collations
Date	January 22, 2025 12:03:39
Msg-id	7461f00ebf361d41731df5b913a35b0b@postgrespro.ru Whole thread Raw
Responses	Re: Inconsistent string comparison using modified ICU collations Re: Inconsistent string comparison using modified ICU collations
List	pgsql-hackers

Tree view

Greetings, everyone!

I've discovered a bug with string comparison using modified ICU 
collations
Using a direct comparison and sorting values gives different results

The easiest way to reproduce is the following:

    postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider = 
icu, locale = 'en-US-u-kr-latn-digit');
    CREATE COLLATION
    postgres=# select ('a' < '0' collate "en-US-u-kr-latn-digit-x-icu");
    ?column?
    ----------
    f
    (1 row)
    postgres=# select * from (values ('0'),('a')) t(x) order by x collate 
"en-US-u-kr-latn-digit-x-icu";
    x
    ---
    a
    0
    (2 rows)

Why does this happen:
In the first example of simple comparison, function varstr_cmp is called 
and it
uses ucol_strcoll[UTF8] function to compare two strings, and it seems to 
ignore
reordering of character groups in collation;
In the second example of sorting values, function varstr_abbrev_convert 
is called
and somewhere deep it uses ucol_getSortKey/ucol_nextSortKeyPart to 
transform
source string to SortKey and this transformation takes reordering of
character groups into account

Other way to reproduce this behaviour is to create table, insert the 
data into it,
create btree index over this table and sometimes you wouldn't get the 
data that is
definitely in the table (if you force postgres to use seqscan the query 
works):

    postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider = 
icu, locale = 'en-US-u-kr-latn-digit');
    create table test (col text COLLATE "en-US-u-kr-latn-digit-x-icu");
    insert into test values ('a'), ('0');
    create index test_idx ON test USING btree (col);
    set enable_seqscan = off;
    select * from test where col = 'a';
    select * from test where col = '0';
    CREATE COLLATION
    CREATE TABLE
    INSERT 0 2
    CREATE INDEX
    SET
    col
    -----
    (0 rows)

    col
    -----
    (0 rows)

This happens because selecting one row triggers a binary search over the 
index,
rows in the index are stored according to modified collation (letters 
before digits),
since the creating of index triggers sorting of rows and, therefore, 
usage of
ucol_getSortKey/ucol_nextSortKeyPart. But the WHERE filter uses 
ucol_strcoll[UTF-8],
that doesn't take modified collation into account

This can be reproduced from REL_13_STABLE up to the current master
(41084409f635453efce03f1114880189b4f6ce4c)

I've opened an issue in ICU Jira[1] where I have reproduced this 
behaviour using
minimal C code

To compose the collation name I have read and used an article by Peter 
Eisentraut
on ICU collation settings[2]

Unfortunately, I don't have any proposed solution for this issue, but I 
thought
it was important to highlight it

Oleg Tselebrovskiy, Postgres Pro

[1] https://unicode-org.atlassian.net/browse/ICU-23016
[2] 
http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

pgsql-hackers by date:

From: Benoit Lobréau
Date: 22 January 2025, 11:54:19
Subject: Re: Doc: Move standalone backup section, mention -X argument

From: Daniel Gustafsson
Date: 22 January 2025, 12:17:13
Subject: Re: Windows: openssl & gssapi dislike each other

Inconsistent string comparison using modified ICU collations - Mailing list pgsql-hackers

Previous

Next