Re: Reduce build times of pg_trgm GIN indexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Reduce build times of pg_trgm GIN indexes
Date
Msg-id 2e11134f-02c3-43da-8c39-fb520a1a251d@iki.fi
Whole thread Raw
In response to Re: Reduce build times of pg_trgm GIN indexes  (David Geier <geidav.pg@gmail.com>)
List pgsql-hackers
On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
> 
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III' COLLATE "tr_TR");
>          show_trgm
> -------------------------
>   {"  i"," ii","ii ",iii}
> (1 row)
> 
> But when using tr_TR as default locale of the database the following
> happens:
> 
> postgres=# select lower('III' COLLATE "tr_TR");
>   lower
> -------
>   ııı
> 
> postgres=# select show_trgm('III');sü
>                 show_trgm
> ---------------------------------------
>   {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
> 
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
> 
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-column 
collation support, so I guess we never really thought through how it 
should interact with COLLATE clauses.

> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate one 
character at a time. In ICU locales, lower-casing a character can depend 
on the surrounding characters, so you cannot just test the conversion of 
every ascii character individually. It would make sense for libc locales 
though, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather than 
lower-casing with ICU locales anyway. Case-folding is designed for 
string matching. It'd be a backwards-compatibility breaking change, though.

- Heikki




pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication
Next
From: Melanie Plageman
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)