Thread: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Salvatore Dipietro
Date:
Hi, we would like to propose the removal of the Instruction Synchronization Barrier (isb) for aarch64 architectures. Based on our testing on Graviton instances (m7g.16xlarge), we can see on average over multiple iterations up to 12% better performance using PGBench select-only and up to 9% with Sysbench oltp_read_only workloads. On Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench select-only and up to 6% with Sysbench oltp_read_only workloads. We have also tested it putting more pressure on the spin_delay function, enabling pg_stat_statements.track_planning with PGBench read-only [0] and, on average, the patch shows up to 27% better performance on m6g.16xlarge and up to 37% on m7g.16xlarge. Testing environment: - PostgreSQL version: 17.2 - Operating System: Ubuntu 22.04 - Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge and m8g.24xlarge) Our benchmark results on PGBench select-only without pg_stat_statements.track_planning: ``` # Load DB on m7g.16xlarge $ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85 --username=postgres pgtest # Without patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: <builtin: select only>", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 359864937", "latency average = 0.420 ms", "latency stddev = 1.755 ms", "tps = 599770.727316 (including connections establishing)", "tps = 599826.788919 (excluding connections establishing)" # With patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: <builtin: select only>", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 405891881", "latency average = 0.371 ms", "latency stddev = 0.569 ms", "tps = 676480.900049 (including connections establishing)", "tps = 676523.557293 (excluding connections establishing)" ``` [0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan
Attachment
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Robert Haas
Date:
On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro <dipietro.salvatore@gmail.com> wrote: > we would like to propose the removal of the Instruction > Synchronization Barrier (isb) for aarch64 architectures. Based on our > testing on Graviton instances (m7g.16xlarge), we can see on average > over multiple iterations up to 12% better performance using PGBench > select-only and up to 9% with Sysbench oltp_read_only workloads. On > Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench > select-only and up to 6% with Sysbench oltp_read_only workloads. > We have also tested it putting more pressure on the spin_delay > function, enabling pg_stat_statements.track_planning with PGBench > read-only [0] and, on average, the patch shows up to 27% better > performance on m6g.16xlarge and up to 37% on m7g.16xlarge. Hmm. This was added only 3 years ago, supposedly because it made performance better: commit a82a5eee314df52f3183cedc0ecbcac7369243b1 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Apr 6 18:57:57 2022 -0400 Use ISB as a spin-delay instruction on ARM64. This seems beneficial on high-core-count machines, and not harmful on lesser hardware. However, older ARM32 gear doesn't have this instruction, so restrict the patch to ARM64. Geoffrey Blake Discussion: https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com I think you should make some kind of argument about why the previous conclusion was wrong, or why something's changed between then and now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro > <dipietro.salvatore@gmail.com> wrote: >> we would like to propose the removal of the Instruction >> Synchronization Barrier (isb) for aarch64 architectures. Based on our >> testing on Graviton instances (m7g.16xlarge), we can see on average >> over multiple iterations up to 12% better performance using PGBench >> select-only and up to 9% with Sysbench oltp_read_only workloads. On >> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench >> select-only and up to 6% with Sysbench oltp_read_only workloads. >> We have also tested it putting more pressure on the spin_delay >> function, enabling pg_stat_statements.track_planning with PGBench >> read-only [0] and, on average, the patch shows up to 27% better >> performance on m6g.16xlarge and up to 37% on m7g.16xlarge. > I think you should make some kind of argument about why the previous > conclusion was wrong, or why something's changed between then and now. TBH, those numbers are large enough that I flat out don't believe them. As noted in the previous thread, we've managed to squeeze out a lot of our dependencies on spinlock performance, via algorithmic changes, migration to atomic ops, etc. So I think ten-percent-ish improvement on a plain pgbench test just isn't very plausible. We certainly didn't see that kind of effect when we were doing that earlier round of testing --- we had to use a custom testing lashup to get numbers that were outside the noise at all. Of course, microbenchmarking is a tricky business, so it's possible that a different custom testing lashup would show the opposite results. But what's quoted above is sufficiently unlike our prior results that I can't help thinking something is wrong. One other thing that comes to mind is that pg_stat_statements has stretched the intention of "short straight-line code segment" to the point of unrecognizability. Maybe it needs to stop using spinlocks to protect pgssEntry updates. But I'm not sure if that would move the needle on whether ISB is a good idea or not. regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Nathan Bossart
Date:
On Thu, May 01, 2025 at 02:48:59PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro >> <dipietro.salvatore@gmail.com> wrote: >>> we would like to propose the removal of the Instruction >>> Synchronization Barrier (isb) for aarch64 architectures. Based on our >>> testing on Graviton instances (m7g.16xlarge), we can see on average >>> over multiple iterations up to 12% better performance using PGBench >>> select-only and up to 9% with Sysbench oltp_read_only workloads. On >>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench >>> select-only and up to 6% with Sysbench oltp_read_only workloads. >>> We have also tested it putting more pressure on the spin_delay >>> function, enabling pg_stat_statements.track_planning with PGBench >>> read-only [0] and, on average, the patch shows up to 27% better >>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge. > >> I think you should make some kind of argument about why the previous >> conclusion was wrong, or why something's changed between then and now. > > TBH, those numbers are large enough that I flat out don't believe > them. As noted in the previous thread, we've managed to squeeze out > a lot of our dependencies on spinlock performance, via algorithmic > changes, migration to atomic ops, etc. So I think ten-percent-ish > improvement on a plain pgbench test just isn't very plausible. > We certainly didn't see that kind of effect when we were doing that > earlier round of testing --- we had to use a custom testing lashup > to get numbers that were outside the noise at all. I'm not sure there's actually any hot spinlock involved in a regular select-only pgbench (unless perhaps pg_stat_statements was enabled or something). But I do know pg_stat_statements.track_planning stresses the spinlock code quite a bit, and commit 3d0b4b1 recently added a non-locking initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still appropriate. It'd be interesting to see the performance difference of removing the ISB with and without commit 3d0b4b1 applied. -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > ... commit 3d0b4b1 recently added a non-locking > initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still > appropriate. It'd be interesting to see the performance difference of > removing the ISB with and without commit 3d0b4b1 applied. Oh! That's an excellent point. The OP didn't mention if their tests were done before or after 3d0b4b1, but that might well matter. I still think pgbench is a very blunt tool for this type of testing, though. I recommend resurrecting the test_shm_mq-based hack discussed in the prior thread and seeing what that shows. regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Nathan Bossart
Date:
On Thu, May 01, 2025 at 04:08:06PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> ... commit 3d0b4b1 recently added a non-locking >> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still >> appropriate. It'd be interesting to see the performance difference of >> removing the ISB with and without commit 3d0b4b1 applied. > > Oh! That's an excellent point. The OP didn't mention if their tests > were done before or after 3d0b4b1, but that might well matter. > > I still think pgbench is a very blunt tool for this type of testing, > though. I recommend resurrecting the test_shm_mq-based hack discussed > in the prior thread and seeing what that shows. Well, I have interesting results. This is all on a c8g.24xlarge (96 cores, Neoverse-V2, Armv9.0-a). For the first test_shm_mq test, I ran the following: SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 2); SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); ... This gave me the following results (values are in seconds): w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 1 1.4 1.6 1.5 1.6 2 2.1 2.0 2.1 2.1 4 3.2 3.5 3.3 3.5 8 7.4 8.1 7.2 8.4 16 18.0 35.9 22.7 23.4 32 35.7 85.6 53.7 49.5 64 85.1 ? 147.6 100.1 For the second test_shm_mq test, I ran at higher concurrency, so I had to reduce the loop counts: SELECT test_shm_mq_pipelined(16384, 'xyzzy', 100000, 32); ... That gave me the following: w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 32 0.4 0.8 0.5 0.6 64 2.0 4.8 1.3 1.1 128 6.1 29.3 7.5 2.1 256 43.0 66.4 24.4 4.5 Finally, I ran the pgbench select-only test with pg_stat_statements.track_planning enabled (values are in thousands of transactions per second): w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 71.4 67.4 538.2 891.2 So... * The ISB does seem to have a positive effect without commit 3d0b4b1 applied. * With commit 3d0b4b1 applied, removing the ISB seems to have a positive effect at high concurrencies. This is especially pronounced in the pgbench test. * With commit 3d0b4b1 applied, removing the ISB doesn't change much at lower concurrencies, and there might even be a small regression. * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress some test_shm_mq tests. Removing the ISB instruction appears to help in some cases, but not all. -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
From
Salvatore Dipietro
Date:
On Thu, 1 May 2025 at 13:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh! That's an excellent point. The OP didn't mention if their tests > were done before or after 3d0b4b1, but that might well matter. The benchmarks we conducted are based on REL_17_2 branch which do not include the TAS_SPIN(lock) change for ARM yet.