Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers
From | Alexander Borisov |
---|---|
Subject | Re: Improve the performance of Unicode Normalization Forms. |
Date | |
Msg-id | cffe5df6-b2de-4f72-ad22-023208409aa5@gmail.com Whole thread Raw |
In response to | Re: Improve the performance of Unicode Normalization Forms. (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Improve the performance of Unicode Normalization Forms.
|
List | pgsql-hackers |
09.08.2025 02:17, Jeff Davis пишет: > On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote: >> Version 3 patches. In version 2 "make -s headerscheck" did not work. > > I ran my own performance tests. What I did was get some test data from > ICU v76.1 by doing: > [..] > Results with perfect hashing (100 iterations): > > Normalization from NFC to NFD with PG: 010.009 > Normalization from NFC to NFD with ICU: 001.580 > Normalization from NFC to NFKD with PG: 009.376 > Normalization from NFC to NFKD with ICU: 000.857 > Normalization from NFD to NFC with PG: 016.026 > Normalization from NFD to NFC with ICU: 001.205 > Normalization from NFD to NFKC with PG: 015.903 > Normalization from NFD to NFKC with ICU: 000.654 > > Results with your code (100 iterations): > > Normalization from NFC to NFD with PG: 004.626 > Normalization from NFC to NFD with ICU: 001.577 > Normalization from NFC to NFKD with PG: 004.024 > Normalization from NFC to NFKD with ICU: 000.861 > Normalization from NFD to NFC with PG: 006.846 > Normalization from NFD to NFC with ICU: 001.209 > Normalization from NFD to NFKC with PG: 006.655 > Normalization from NFD to NFKC with ICU: 000.651 > > Your patches are a major improvement, but I'm trying to figure out why > ICU still wins by so much. Thoughts? I didn't investigate much myself > yet, so it's quite possible there's a bug in my test or something. I had some experimented a bit, and took a look at the ICU code. 1. The performance test is not entirely accurate. The thing is that in ICU (unorm_normalize()), we pass the output buffer and its size. That is, ICU does not calculate how much data we will have at the output; we pass it our buffer. ICU simply checks whether the data fits into the buffer or not. In our case, PG (unicode_normalize()) only accepts the input buffer and first calculates the size of the output buffer. This means we are doing double the work, as we have already gone through the input data at least once too many times. Here, it would be more honest to call the function for calculating the buffer in ICU, and only then call data normalization. 2. In PG, unnecessary table lookups (get_code_entry()) that can clearly be avoided. 3. In PG, the algorithm is not optimized. We could eliminate the decompose_code() function, which is called for each code point. In this function, we can remove recursion and prepare the data in advance. That is, we can perform data expansion in the Perl script. If we do this, we will have code that is in place and without recursion. And then there are all sorts of other minor details. Of course, there are other approaches. I did this comparison (your test, Jeff): 1. Without patch. Normalization from NFC to NFD with PG: 009.121 Normalization from NFC to NFD with ICU: 000.954 Normalization from NFC to NFKD with PG: 009.048 Normalization from NFC to NFKD with ICU: 000.965 Normalization from NFD to NFC with PG: 014.525 Normalization from NFD to NFC with ICU: 000.623 Normalization from NFD to NFKC with PG: 014.380 Normalization from NFD to NFKC with ICU: 000.666 2. With patch. Normalization from NFC to NFD with PG: 005.743 Normalization from NFC to NFD with ICU: 001.005 Normalization from NFC to NFKD with PG: 005.902 Normalization from NFC to NFKD with ICU: 000.963 Normalization from NFD to NFC with PG: 007.837 Normalization from NFD to NFC with ICU: 000.640 Normalization from NFD to NFKC with PG: 008.036 Normalization from NFD to NFKC with ICU: 000.667 3. With a patch, but! With direct access to tables, i.e., I created one large table (index table, unit16_t) where there is no search, direct access to data. Normalization from NFC to NFD with PG: 002.792 Normalization from NFC to NFD with ICU: 000.953 Normalization from NFC to NFKD with PG: 002.865 Normalization from NFC to NFKD with ICU: 000.958 Normalization from NFD to NFC with PG: 004.361 Normalization from NFD to NFC with ICU: 000.651 Normalization from NFD to NFKC with PG: 004.474 Normalization from NFD to NFKC with ICU: 000.668 It is evident that direct access provides x2 from the patch, but adds +1.5MB to the object file size. This is just to check the time difference in access. As a result, I would not look into ICU at the moment, especially since we have a different approach. I am currently working on optimizing unicode_normalize(). I am trying to come up with an improved version of the algorithm in C by the next commitfest (which will be in September). P.S.: In general, I strive to surpass ICU in terms of speed. I think we're making good progress. Let's see what happens. -- Regards, Alexander Borisov
pgsql-hackers by date: