Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsFU7C5cHCLfERcf+nNTvCJcW-hBboJP4shwKVvm-qegbA@mail.gmail.com
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers


I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1, but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could probably fold all three into the SSE4.2 CRC check and have a single symbol to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the following:

master

 mixed | ascii
-------+-------
   757 |   366

Lemire demo:

 mixed | ascii
-------+-------
   172 |   168

This one lacks an ascii fast path, but the AVX2 version in the same file has one that could probably be easily adapted. With that, I think this would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern C++.
 
Simdjson can use AVX -- do you happen to know which target it got compiled to? AVX vectors are 256-bits wide and that requires OS support. The OS's we care most about were updated 8-12 years ago, but that would still be something to check, in addition to more configure checks.

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: Peter Geoghegan
Date:
Subject: 64-bit XIDs in deleted nbtree pages