Re: define pg_structiszero(addr, s, r) - Mailing list pgsql-hackers

From Ranier Vilela
Subject Re: define pg_structiszero(addr, s, r)
Date
Msg-id CAEudQAqGprE1QjctScmJqpSM_vgqOmKsChMiPUeKX1R8MrPZ-A@mail.gmail.com
Whole thread Raw
In response to Re: define pg_structiszero(addr, s, r)  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers


Em qua., 13 de nov. de 2024 às 04:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> So that seems worth the addition, especially for
> smaller sizes where this is 6 times faster here.

So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
attached?
I ran the latest version (allzeros_small.c) with v12.
 

If so, that gives us:

== with BLCKSZ 32

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 22421 nanoseconds
size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per byte)
SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per byte)
SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per byte)


== with BLCKSZ 63

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 29525 nanoseconds
size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per byte)
SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per byte)
SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per byte)
gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per byte)
 

== with BLCKSZ 256

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 120483 nanoseconds
size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per byte)
SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per byte)
SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per byte)
 

== with BLCKSZ 8192

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 3393459 nanoseconds
size_t: done in 707304 nanoseconds (4.79774 times faster than byte per byte)
SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per byte)
SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per byte)
SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per byte
 

That's better for small size but given the extra len checks that
has been added I think we're back to David's point in [1]: What if the function
is not inlined for some reason?

So, out of curiosity, let's see what happens if not inlined in [2] (see the
-O2 -DNOT_INLINE compiler window):

- if a[3]: it looks like gcc is smart enough to create an optimized version
for that size using constant propagation
- if a[63]: Same as above
- if a[256]: Same as above
- if a[8192]: Same as above

I did a quick check with clang and it looks like it is not as smart as gcc
for the non inline case.

Anyway it's not like we have the choice: we need (at least) one len check for
safety reason (to not crash or read invalid data).

So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?
I think that's good enough.

best regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Alexander Kukushkin
Date:
Subject: Re: Infinite loop in XLogPageRead() on standby
Next
From: Alvaro Herrera
Date:
Subject: Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT