Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From Victor Yegorov
Subject Re: Improve the performance of Unicode Normalization Forms.
Date
Msg-id CAGnEbohehx6sty5LFBkXqYKs7sB1qpy2YV1yu=n9X63ereosjQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve the performance of Unicode Normalization Forms.  (Alexander Borisov <lex.borisov@gmail.com>)
Responses Re: Improve the performance of Unicode Normalization Forms.
List pgsql-hackers
ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex.borisov@gmail.com>:
Hi, Jeff, hackers!

As promised, refactoring the C code for Unicode Normalization Forms.

In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
     a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
     but uint8 is allocated for the entire size for the CCC cache, which
     boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
     rewritten.

I am confident that we have achieved excellent results.

Hey.

I've looked into these patches.

Patches apply, compilation succeedes, make check and make installcheck shows
no errors.
 
Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.

Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module.  Perhaps it'd be valuable to add a section into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )

For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and measured
normalize() calls.

    CREATE UNLOGGED TABLE strings_nfd (
      str   text STORAGE PLAIN NOT NULL
    );
    COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
   
    CREATE UNLOGGED TABLE strings_nfc (
      str   text STORAGE PLAIN NOT NULL
    );
    COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';
   
    SELECT count( normalize( str, NFD ) ) FROM strings_nfd, generate_series( 1, 10 ) x;
    SELECT count( normalize( str, NFC ) ) FROM strings_nfc, generate_series( 1, 10 ) x;

And I've got the following numbers:

Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms

Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%

Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for Committer.

[1] https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com

--
Victor Yegorov

pgsql-hackers by date:

Previous
From: Marcos Pegoraro
Date:
Subject: Re: [PATCH] Generate random dates/times in a specified range
Next
From: Zsolt Parragi
Date:
Subject: Re: OAuth client code doesn't work with Google OAuth