Thread: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Hi,
we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Testing environment:
- PostgreSQL version: 17.2
- Operating System: Ubuntu 22.04
- Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge
and m8g.24xlarge)

Our benchmark results on PGBench select-only without
pg_stat_statements.track_planning:
```
# Load DB on m7g.16xlarge
$ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85
--username=postgres pgtest

# Without patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
    "transaction type: <builtin: select only>",
    "scaling factor: 5644",
    "query mode: prepared",
    "number of clients: 256",
    "number of threads: 96",
    "duration: 600 s",
    "number of transactions actually processed: 359864937",
    "latency average = 0.420 ms",
    "latency stddev = 1.755 ms",
    "tps = 599770.727316 (including connections establishing)",
    "tps = 599826.788919 (excluding connections establishing)"


# With patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
    "transaction type: <builtin: select only>",
    "scaling factor: 5644",
    "query mode: prepared",
    "number of clients: 256",
    "number of threads: 96",
    "duration: 600 s",
    "number of transactions actually processed: 405891881",
    "latency average = 0.371 ms",
    "latency stddev = 0.569 ms",
    "tps = 676480.900049 (including connections establishing)",
    "tps = 676523.557293 (excluding connections establishing)"
```

[0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan

Attachment
On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
<dipietro.salvatore@gmail.com> wrote:
> we would like to propose the removal of the Instruction
> Synchronization Barrier (isb) for aarch64 architectures. Based on our
> testing on Graviton instances (m7g.16xlarge), we can see on average
> over multiple iterations up to 12% better performance using PGBench
> select-only and up to 9% with Sysbench oltp_read_only workloads. On
> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
> select-only and up to 6% with Sysbench oltp_read_only workloads.
> We have also tested it putting more pressure on the spin_delay
> function, enabling pg_stat_statements.track_planning with PGBench
> read-only [0] and, on average, the patch shows up to 27% better
> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Hmm. This was added only 3 years ago, supposedly because it made
performance better:

commit a82a5eee314df52f3183cedc0ecbcac7369243b1
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Apr 6 18:57:57 2022 -0400

    Use ISB as a spin-delay instruction on ARM64.

    This seems beneficial on high-core-count machines, and not harmful
    on lesser hardware.  However, older ARM32 gear doesn't have this
    instruction, so restrict the patch to ARM64.

    Geoffrey Blake

    Discussion:
https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com

I think you should make some kind of argument about why the previous
conclusion was wrong, or why something's changed between then and now.

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
> <dipietro.salvatore@gmail.com> wrote:
>> we would like to propose the removal of the Instruction
>> Synchronization Barrier (isb) for aarch64 architectures. Based on our
>> testing on Graviton instances (m7g.16xlarge), we can see on average
>> over multiple iterations up to 12% better performance using PGBench
>> select-only and up to 9% with Sysbench oltp_read_only workloads. On
>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
>> select-only and up to 6% with Sysbench oltp_read_only workloads.
>> We have also tested it putting more pressure on the spin_delay
>> function, enabling pg_stat_statements.track_planning with PGBench
>> read-only [0] and, on average, the patch shows up to 27% better
>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

> I think you should make some kind of argument about why the previous
> conclusion was wrong, or why something's changed between then and now.

TBH, those numbers are large enough that I flat out don't believe
them.  As noted in the previous thread, we've managed to squeeze out
a lot of our dependencies on spinlock performance, via algorithmic
changes, migration to atomic ops, etc.  So I think ten-percent-ish
improvement on a plain pgbench test just isn't very plausible.
We certainly didn't see that kind of effect when we were doing that
earlier round of testing --- we had to use a custom testing lashup
to get numbers that were outside the noise at all.

Of course, microbenchmarking is a tricky business, so it's possible
that a different custom testing lashup would show the opposite
results.  But what's quoted above is sufficiently unlike our prior
results that I can't help thinking something is wrong.

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability.  Maybe it needs to stop using
spinlocks to protect pgssEntry updates.  But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

            regards, tom lane



On Thu, May 01, 2025 at 02:48:59PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
>> <dipietro.salvatore@gmail.com> wrote:
>>> we would like to propose the removal of the Instruction
>>> Synchronization Barrier (isb) for aarch64 architectures. Based on our
>>> testing on Graviton instances (m7g.16xlarge), we can see on average
>>> over multiple iterations up to 12% better performance using PGBench
>>> select-only and up to 9% with Sysbench oltp_read_only workloads. On
>>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
>>> select-only and up to 6% with Sysbench oltp_read_only workloads.
>>> We have also tested it putting more pressure on the spin_delay
>>> function, enabling pg_stat_statements.track_planning with PGBench
>>> read-only [0] and, on average, the patch shows up to 27% better
>>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.
> 
>> I think you should make some kind of argument about why the previous
>> conclusion was wrong, or why something's changed between then and now.
> 
> TBH, those numbers are large enough that I flat out don't believe
> them.  As noted in the previous thread, we've managed to squeeze out
> a lot of our dependencies on spinlock performance, via algorithmic
> changes, migration to atomic ops, etc.  So I think ten-percent-ish
> improvement on a plain pgbench test just isn't very plausible.
> We certainly didn't see that kind of effect when we were doing that
> earlier round of testing --- we had to use a custom testing lashup
> to get numbers that were outside the noise at all.

I'm not sure there's actually any hot spinlock involved in a regular
select-only pgbench (unless perhaps pg_stat_statements was enabled or
something).  But I do know pg_stat_statements.track_planning stresses the
spinlock code quite a bit, and commit 3d0b4b1 recently added a non-locking
initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
appropriate.  It'd be interesting to see the performance difference of
removing the ISB with and without commit 3d0b4b1 applied.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> ... commit 3d0b4b1 recently added a non-locking
> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
> appropriate.  It'd be interesting to see the performance difference of
> removing the ISB with and without commit 3d0b4b1 applied.

Oh!  That's an excellent point.  The OP didn't mention if their tests
were done before or after 3d0b4b1, but that might well matter.

I still think pgbench is a very blunt tool for this type of testing,
though.  I recommend resurrecting the test_shm_mq-based hack discussed
in the prior thread and seeing what that shows.

            regards, tom lane



On Thu, May 01, 2025 at 04:08:06PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> ... commit 3d0b4b1 recently added a non-locking
>> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
>> appropriate.  It'd be interesting to see the performance difference of
>> removing the ISB with and without commit 3d0b4b1 applied.
> 
> Oh!  That's an excellent point.  The OP didn't mention if their tests
> were done before or after 3d0b4b1, but that might well matter.
> 
> I still think pgbench is a very blunt tool for this type of testing,
> though.  I recommend resurrecting the test_shm_mq-based hack discussed
> in the prior thread and seeing what that shows.

Well, I have interesting results.  This is all on a c8g.24xlarge (96 cores,
Neoverse-V2, Armv9.0-a).

For the first test_shm_mq test, I ran the following:

    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 2);
    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
    ...

This gave me the following results (values are in seconds):

            w/o 3d0b4b1     w/ 3d0b4b1
            ISB     no ISB  ISB     no ISB
    1       1.4     1.6     1.5     1.6
    2       2.1     2.0     2.1     2.1
    4       3.2     3.5     3.3     3.5
    8       7.4     8.1     7.2     8.4
    16      18.0    35.9    22.7    23.4
    32      35.7    85.6    53.7    49.5
    64      85.1    ?       147.6   100.1

For the second test_shm_mq test, I ran at higher concurrency, so I had to
reduce the loop counts:

    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 100000, 32);
    ...

That gave me the following:

            w/o 3d0b4b1     w/ 3d0b4b1
            ISB     no ISB  ISB     no ISB
    32      0.4     0.8     0.5     0.6
    64      2.0     4.8     1.3     1.1
    128     6.1     29.3    7.5     2.1
    256     43.0    66.4    24.4    4.5

Finally, I ran the pgbench select-only test with
pg_stat_statements.track_planning enabled (values are in thousands of
transactions per second):

    w/o 3d0b4b1     w/ 3d0b4b1
    ISB     no ISB  ISB     no ISB
    71.4    67.4    538.2   891.2

So...

* The ISB does seem to have a positive effect without commit 3d0b4b1
  applied.

* With commit 3d0b4b1 applied, removing the ISB seems to have a positive
  effect at high concurrencies.  This is especially pronounced in the
  pgbench test.

* With commit 3d0b4b1 applied, removing the ISB doesn't change much at
  lower concurrencies, and there might even be a small regression.

* At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
  some test_shm_mq tests.  Removing the ISB instruction appears to help in
  some cases, but not all.

-- 
nathan



On Thu, 1 May 2025 at 13:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh!  That's an excellent point.  The OP didn't mention if their tests
> were done before or after 3d0b4b1, but that might well matter.

The benchmarks we conducted are based on REL_17_2 branch which do not
include the TAS_SPIN(lock) change for ARM yet.