Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From Alexander Borisov
Subject Improve the performance of Unicode Normalization Forms.
Date
Msg-id 844d3dd7-2955-4794-95d1-7f4c13cb89fc@gmail.com
Whole thread Raw
List pgsql-hackers
Hi, hackers!

As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.

The attachment contains three patches:
1. Transfer of functions for building a range index to a separate
    Perl module.
2-small. In this patch, the mapping tables are reduced, but this
    increases the number of branches for searching for the desired value.
2-big. In this patch, the tables are large, but the number of branches
    for searching for the desired value is small.

In other words, we choose between two patches, either small or big.

What did we manage to achieve?
1. The uint32 field, which stored the unicode codepoint record, was
    removed from the main structure/table. This was necessary for perfect
    hashing.
2. Thanks to the first point, we managed to get rid of duplicates and
    reduce the main table from 6843 to 3943.
3. We managed to get rid of duplicates in the table with codepoints for
    decomposition from 5138 to 4304 (uint32).
4. These manipulations allowed us to increase the speed by up to 40%
    in some cases (benchmarks below).
5. The server part "lost weight" in the binary, but the frontend
    "gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

Benchmarks.

How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.

Legend.
Patch (big table) — this is a bloated mapping table, but with a small
     number of branches for searching.
Patch (small table) — these are small tables, but with a large number
     of branches for searching.
Without — the original Postgres without patches.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

NFC:
Patch (big table): tps = 5057.624509
Patch (small table): tps = 4727.158048
Without: tps = 3268.654867

NFD:
Patch (big table): tps = 1145.730247
Patch (small table): tps = 1073.084633
Without: tps = 689.627239

NFKC:
Patch (big table): tps = 4821.700702
Patch (small table): tps = 4662.640629
Without: tps = 3450.254581

NFKD:
Patch (big table): tps = 1142.644341
Patch (small table): tps = 951.405893
Without: tps = 636.161496

How the size of object files and libraries has changed (in bytes):
Patch (big table):
unicode_norm.o: 138896
unicode_norm_srv.o: 191336
libpq.a: 364782
libpq.so.5.18: 480944
libpgcommon.a: 584432
libpgcommon_shlib.a: 568724

Patch (small table):
unicode_norm.o: 86416
unicode_norm_srv.o: 138824
libpq.a: 364782
libpq.so.5.18: 423600
libpgcommon.a: 531952
libpgcommon_shlib.a: 516244

Without:
unicode_norm.o: 79488
unicode_norm_srv.o: 165504
libpq.a: 364782
libpq.so.5.18: 419288
libpgcommon.a: 525024
libpgcommon_shlib.a: 509316

[1] 
https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com

--
Best regards,
Alexander Borisov

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Correcting freeze conflict horizon calculation
Next
From: Masahiko Sawada
Date:
Subject: Re: Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.