Hi, hackers!
As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.
The attachment contains three patches:
1. Transfer of functions for building a range index to a separate
Perl module.
2-small. In this patch, the mapping tables are reduced, but this
increases the number of branches for searching for the desired value.
2-big. In this patch, the tables are large, but the number of branches
for searching for the desired value is small.
In other words, we choose between two patches, either small or big.
What did we manage to achieve?
1. The uint32 field, which stored the unicode codepoint record, was
removed from the main structure/table. This was necessary for perfect
hashing.
2. Thanks to the first point, we managed to get rid of duplicates and
reduce the main table from 6843 to 3943.
3. We managed to get rid of duplicates in the table with codepoints for
decomposition from 5138 to 4304 (uint32).
4. These manipulations allowed us to increase the speed by up to 40%
in some cases (benchmarks below).
5. The server part "lost weight" in the binary, but the frontend
"gained weight" a little.
I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.
Benchmarks.
How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.
Legend.
Patch (big table) — this is a bloated mapping table, but with a small
number of branches for searching.
Patch (small table) — these are small tables, but with a large number
of branches for searching.
Without — the original Postgres without patches.
* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
NFC:
Patch (big table): tps = 5057.624509
Patch (small table): tps = 4727.158048
Without: tps = 3268.654867
NFD:
Patch (big table): tps = 1145.730247
Patch (small table): tps = 1073.084633
Without: tps = 689.627239
NFKC:
Patch (big table): tps = 4821.700702
Patch (small table): tps = 4662.640629
Without: tps = 3450.254581
NFKD:
Patch (big table): tps = 1142.644341
Patch (small table): tps = 951.405893
Without: tps = 636.161496
How the size of object files and libraries has changed (in bytes):
Patch (big table):
unicode_norm.o: 138896
unicode_norm_srv.o: 191336
libpq.a: 364782
libpq.so.5.18: 480944
libpgcommon.a: 584432
libpgcommon_shlib.a: 568724
Patch (small table):
unicode_norm.o: 86416
unicode_norm_srv.o: 138824
libpq.a: 364782
libpq.so.5.18: 423600
libpgcommon.a: 531952
libpgcommon_shlib.a: 516244
Without:
unicode_norm.o: 79488
unicode_norm_srv.o: 165504
libpq.a: 364782
libpq.so.5.18: 419288
libpgcommon.a: 525024
libpgcommon_shlib.a: 509316
[1]
https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com
--
Best regards,
Alexander Borisov