Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Date
Msg-id fgsf5ofxte7er3z6t2womog6t3nlhiwklyy5bg6jfshj3maln2@enb6qeculxlm
Whole thread Raw
In response to Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
List pgsql-hackers
Hi,

On 2025-08-15 12:57:52 -0500, Nathan Bossart wrote:
> On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:
> > On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:
> >> However, changing that spinlock to an lwlock doesn't look easy, because of
> >> the way each pgss entry is created as a dynahash entry, and then deallocated
> >> from there.  With spinlocks we can just reinit the spinlock each time, but
> >> that doesn't work with lwlocks.  We have no easy way to associate then
> >> disassociate each entry from a specific lwlock.
> > 
> > I'm not following? The lwlock can just be inside the struct, just like the
> > spinlock is? "Association" is just LWLockInitialize() and deassociation is not
> > needed.
> 
> Indeed.  I rebased an old patch that I had lying around to demonstrate.  If
> my past testing [0] is to be trusted, this actually hurts performance,
> unfortunately.

FWIW, rather interesting result of testing the patch briefly:

On my older workstation, the patch is a substantial *gain* when there's a lot
of contention. But on my newer workstation it's a *loss*.

The penalty from enabling pg_stat_statements for readonly pgbench on the newer
workstation is rather bad - about 1/3 the throughput.


I think the main reason that lwlocks loose on the newer machine is that we
loose spinning. The newer machine has more cores and more numa domains and the
fairer locks lead to more cacheline pingpong...


IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats. Even if you were
to move pgss successfully to atomics, the cacheline contention still would be
terrible for performance.

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: index prefetching
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching