Home > mailing lists

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From	Victor Yegorov
Subject	Re: Improve the performance of Unicode Normalization Forms.
Date	September 10 21:50:12
Msg-id	CAGnEbohehx6sty5LFBkXqYKs7sB1qpy2YV1yu=n9X63ereosjQ@mail.gmail.com Whole thread Raw
In response to	Re: Improve the performance of Unicode Normalization Forms. (Alexander Borisov <lex.borisov@gmail.com>)
Responses	Re: Improve the performance of Unicode Normalization Forms.
List	pgsql-hackers

Tree view

ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex.borisov@gmail.com>:

Hi, Jeff, hackers!

As promised, refactoring the C code for Unicode Normalization Forms.

In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
rewritten.

I am confident that we have achieved excellent results.

Hey.

I've looked into these patches.

Patches apply, compilation succeedes, make check and make installcheck shows
no errors.

Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.

Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )

For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and measured
normalize() calls.

CREATE UNLOGGED TABLE strings_nfd (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';

CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';

SELECT count( normalize( str, NFD ) ) FROM strings_nfd, generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc, generate_series( 1, 10 ) x;

And I've got the following numbers:

Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms

Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%

Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for Committer.

[1] https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com

Victor Yegorov

pgsql-hackers by date:

From: Marcos Pegoraro
Date: 10 September, 21:28:49
Subject: Re: [PATCH] Generate random dates/times in a specified range

From: Zsolt Parragi
Date: 10 September, 21:50:17
Subject: Re: OAuth client code doesn't work with Google OAuth

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

Previous

Next