RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers
| From | Amonson, Paul D |
|---|---|
| Subject | RE: Popcount optimization using AVX512 |
| Date | |
| Msg-id | BL1PR11MB5304165E7A81E0107E70B069DC242@BL1PR11MB5304.namprd11.prod.outlook.com Whole thread Raw |
| In response to | Re: Popcount optimization using AVX512 (Nathan Bossart <nathandbossart@gmail.com>) |
| Responses |
Re: Popcount optimization using AVX512
|
| List | pgsql-hackers |
> -----Original Message-----
> From: Nathan Bossart <nathandbossart@gmail.com>
> Sent: Thursday, March 7, 2024 1:36 PM
> Subject: Re: Popcount optimization using AVX512
I will be splitting the request into 2 patches. I am attaching the first patch (refactoring only) and I updated the
commitfestentry to match this patch. I have a question however:
Do I need to wait for the refactor patch to be merged before I post the AVX portion of this feature in this thread?
> > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
>
> I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
> machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
> other instructions required that need -mavx512f, then we might need to
> expand the test.
First, nice catch on the required flags to build! When I changed my algorithm, dependence on the -mavx512f flag was no
longerneeded, In the second patch (AVX specific) I will fix this.
> I still think it's worth breaking this change into at least 2 patches. In particular,
> I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
> and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These
> changes are likely straightforward, and getting them out of the way early
> would make it easier to focus on the more interesting changes. IMHO there
> are a lot of moving parts in this patch.
As stated above I am doing this in 2 patches. :)
> > +#undef HAVE__GET_CPUID_COUNT
> > +
> > +/* Define to 1 if you have immintrin. */ #undef HAVE__IMMINTRIN
>
> Is this missing HAVE__CPUIDEX?
Yes I missed it, I will include in the second patch (AVX specific) of the 2 patches.
> > uint64
> > -pg_popcount(const char *buf, int bytes)
> > +pg_popcount_slow(const char *buf, int bytes)
> > {
> > uint64 popcnt = 0;
> >
> > -#if SIZEOF_VOID_P >= 8
> > +#if SIZEOF_VOID_P == 8
> > /* Process in 64-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(8, buf))
> > {
> > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> >
> > buf = (const char *) words;
> > }
> > -#else
> > +#elif SIZEOF_VOID_P == 4
> > /* Process in 32-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(4, buf))
> > {
>
> Apologies for harping on this, but I'm still not seeing the need for these
> SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
> difference, I see no reason to more strictly check SIZEOF_VOID_P here.
I got rid of the second occurrence as I agree it is not needed but unless you see something I don't how to know which
functionto call between a 32-bit and 64-bit architecture? Maybe I am missing something obvious? What exactly do you
suggesthere? I am happy to always call either pg_popcount32() or pg_popcount64() with the understanding that it may not
beoptimal, but I do need to know which to use.
> > + /* Process any remaining bytes */
> > + while (bytes--)
> > + popcnt += pg_number_of_ones[(unsigned char) *buf++];
> > + return popcnt;
> > +#else
> > + return pg_popcount_slow(buf, bytes);
> > +#endif /* USE_AVX512_CODE */
>
> nitpick: Could we call pg_popcount_slow() in a common section for these
> "remaining bytes?"
Agreed, will fix in the second patch as well.
> > +#if defined(_MSC_VER)
> > + pg_popcount_indirect = pg_popcount512_fast; #else
> > + pg_popcount = pg_popcount512_fast; #endif
> These _MSC_VER sections are interesting. I'm assuming this is the
> workaround for the MSVC linking issue you mentioned above. I haven't
> looked too closely, but I wonder if the CRC32C code (see
> src/include/port/pg_crc32c.h) is doing something different to avoid this issue.
Using the latest master branch, I see what the needed changes are, I will implement using PGDLLIMPORT macro in the
secondpatch.
> Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned
> through this thread and didn't see any recent benchmark results for the latest
> form of the patch. I think it's worth verifying that we are still seeing the
> expected improvements.
I will get new benchmarks using the same process I used before (from Akash) so I get apples to apples. These are
pendingcompletion of the second patch which is still in progress.
Just a reminder, I asked questions above about 1) multi-part dependent patches and, 2) What specifically to do about
theSIZE_VOID_P checks. :)
Thanks,
Paul
Attachment
pgsql-hackers by date: