On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
>
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III' COLLATE "tr_TR");
> show_trgm
> -------------------------
> {" i"," ii","ii ",iii}
> (1 row)
>
> But when using tr_TR as default locale of the database the following
> happens:
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III');sü
> show_trgm
> ---------------------------------------
> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
>
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.
Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.
> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.
I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.
Although, we probably should be using case-folding rather than
lower-casing with ICU locales anyway. Case-folding is designed for
string matching. It'd be a backwards-compatibility breaking change, though.
- Heikki