As promised, refactoring the C code for Unicode Normalization Forms.
In general terms, here's what has changed: 1. Recursion has been removed; now data is generated using a Perl script. 2. Memory is no longer allocated for uint32 for the entire size, but uint8 is allocated for the entire size for the CCC cache, which boosts performance significantly. 3. The code for the unicode_normalize() function has been completely rewritten.
I am confident that we have achieved excellent results.
Hey.
I've looked into these patches.
Patches apply, compilation succeedes, make check and make installcheck shows no errors.
Code quality is good, although I suggest a native english speaker to review comments and commit messages — a bit difficult to follow.
Description of the Sparse Array approach is done in the newly introduced GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section into the src/common/unicode/README, it'll get more visibility. ( Not insisting here. )
For performance testing I've used an approach by Jeff Davis. [1] I've prepared NFC and NFD files, loaded them into UNLOGGED tables and measured normalize() calls.
CREATE UNLOGGED TABLE strings_nfd ( str text STORAGE PLAIN NOT NULL ); COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
CREATE UNLOGGED TABLE strings_nfc ( str text STORAGE PLAIN NOT NULL ); COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';