Home > mailing lists

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From	Alexander Borisov
Subject	Re: Improve the performance of Unicode Normalization Forms.
Date	August 11 17:21:04
Msg-id	cffe5df6-b2de-4f72-ad22-023208409aa5@gmail.com Whole thread Raw
In response to	Re: Improve the performance of Unicode Normalization Forms. (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Improve the performance of Unicode Normalization Forms.
List	pgsql-hackers

Tree view

09.08.2025 02:17, Jeff Davis пишет:
> On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
>> Version 3 patches. In version 2 "make -s headerscheck" did not work.
> 
> I ran my own performance tests. What I did was get some test data from
> ICU v76.1 by doing:
> 

[..]

> Results with perfect hashing (100 iterations):
> 
>    Normalization from NFC to  NFD with  PG: 010.009
>    Normalization from NFC to  NFD with ICU: 001.580
>    Normalization from NFC to NFKD with  PG: 009.376
>    Normalization from NFC to NFKD with ICU: 000.857
>    Normalization from NFD to  NFC with  PG: 016.026
>    Normalization from NFD to  NFC with ICU: 001.205
>    Normalization from NFD to NFKC with  PG: 015.903
>    Normalization from NFD to NFKC with ICU: 000.654
> 
> Results with your code (100 iterations):
> 
>    Normalization from NFC to  NFD with  PG: 004.626
>    Normalization from NFC to  NFD with ICU: 001.577
>    Normalization from NFC to NFKD with  PG: 004.024
>    Normalization from NFC to NFKD with ICU: 000.861
>    Normalization from NFD to  NFC with  PG: 006.846
>    Normalization from NFD to  NFC with ICU: 001.209
>    Normalization from NFD to NFKC with  PG: 006.655
>    Normalization from NFD to NFKC with ICU: 000.651
> 
> Your patches are a major improvement, but I'm trying to figure out why
> ICU still wins by so much. Thoughts? I didn't investigate much myself
> yet, so it's quite possible there's a bug in my test or something.

I had some experimented a bit, and took a look at the ICU code.

1. The performance test is not entirely accurate.

The thing is that in ICU (unorm_normalize()), we pass the output
buffer and its size.
That is, ICU does not calculate how much data we will have at the
output; we pass it our buffer. ICU simply checks whether the data
fits into the buffer or not.

In our case, PG (unicode_normalize()) only accepts the input buffer
and first calculates the size of the output buffer.
This means we are doing double the work, as we have already gone
through the input data at least once too many times.

Here, it would be more honest to call the function for calculating
the buffer in ICU, and only then call data normalization.

2. In PG, unnecessary table lookups (get_code_entry()) that can
    clearly be avoided.
3. In PG, the algorithm is not optimized.

We could eliminate the decompose_code() function, which is called
for each code point.
In this function, we can remove recursion and prepare the data in
advance. That is, we can perform data expansion in the Perl script.
If we do this, we will have code that is in place and without recursion.
And then there are all sorts of other minor details.

Of course, there are other approaches.

I did this comparison (your test, Jeff):

1. Without patch.

Normalization from NFC to  NFD with  PG: 009.121
Normalization from NFC to  NFD with ICU: 000.954
Normalization from NFC to NFKD with  PG: 009.048
Normalization from NFC to NFKD with ICU: 000.965
Normalization from NFD to  NFC with  PG: 014.525
Normalization from NFD to  NFC with ICU: 000.623
Normalization from NFD to NFKC with  PG: 014.380
Normalization from NFD to NFKC with ICU: 000.666

2. With patch.

Normalization from NFC to  NFD with  PG: 005.743
Normalization from NFC to  NFD with ICU: 001.005
Normalization from NFC to NFKD with  PG: 005.902
Normalization from NFC to NFKD with ICU: 000.963
Normalization from NFD to  NFC with  PG: 007.837
Normalization from NFD to  NFC with ICU: 000.640
Normalization from NFD to NFKC with  PG: 008.036
Normalization from NFD to NFKC with ICU: 000.667

3. With a patch, but! With direct access to tables, i.e., I created one
    large table (index table, unit16_t) where there is no search, direct
    access to data.

Normalization from NFC to  NFD with  PG: 002.792
Normalization from NFC to  NFD with ICU: 000.953
Normalization from NFC to NFKD with  PG: 002.865
Normalization from NFC to NFKD with ICU: 000.958
Normalization from NFD to  NFC with  PG: 004.361
Normalization from NFD to  NFC with ICU: 000.651
Normalization from NFD to NFKC with  PG: 004.474
Normalization from NFD to NFKC with ICU: 000.668

It is evident that direct access provides x2 from the patch, but adds
+1.5MB to the object file size.
This is just to check the time difference in access.

As a result, I would not look into ICU at the moment, especially since
we have a different approach.
I am currently working on optimizing unicode_normalize().
I am trying to come up with an improved version of the algorithm in C
by the next commitfest (which will be in September).

P.S.: In general, I strive to surpass ICU in terms of speed.
I think we're making good progress. Let's see what happens.

--
Regards,
Alexander Borisov

pgsql-hackers by date:

From: Tomas Vondra
Date: 11 August, 17:16:05
Subject: Re: index prefetching

From: Antonin Houska
Date: 11 August, 17:22:00
Subject: Re: Adding REPACK [concurrently]

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

Previous

Next