On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > > I feel your first patch in the series is something you can just
> > > commit.
> >
> > Done.
> >
> > I combined your patches and mine into the attached v10 series.
>
> Here's v12 after committing a few of the earlier patches.
I collected some performance numbers for a worst case on UTF8. This is
where each row is million characters wide and each one is greater than
MAX_SIMPLE_CHAR (U+07FF):
create table wide (t text);
insert into wide
select repeat('カ', 1048576)
from generate_series(1,1000) g;
select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])'
collate "the_collation";
results:
master patched
C 3736 3589
pg_c_utf8 19500 23404
en_US 10251 12396
en-US-x-icu 10264 11963
And a separate test for ILIKE on en_US.iso885915 where each character
is beyond the ASCII range and needs to be lowercased using the
optimization for single-byte encodings in Generic_Text_IC_like:
create table sb (t text);
insert into sb
select repeat('É', 1048576)
from generate_series(1, 3000) g;
select 1 from sb where t ilike '%á%';
results:
master patched
C 2900 2812
en_US 2203 3702
en-US-x-icu 17483 18123
The numbers from both tests show a slowdown. The worst one is probably
tolower() for libc in LATIN9, which appears to be heavily optimized,
and the extra indirection for a method call slows things down quite a
bit.
This is a bit unfortunate because the method table feels like the right
code organization. Having special cases at the call sites (aside from
ctype_is_c) is not great. Are the above numbers bad enough that we need
to give up on this method-ization approach? Or should we say that the
above cases don't represent reality, and a moderate regression there is
OK?
Or perhaps someone has an idea how to mitigate the regression? I could
imagine another cache of character properties, like an extensible
pg_char_properties. I'm not sure if the extra complexity is worth it,
though.
Regards,
Jeff Davis