Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Burd, Greg |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | 628EE169-6901-466E-9191-B33DBAB05B26@burd.me Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-09 12:55:51 -0400, Greg Burd wrote: >> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: >> >>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist >>> entirely. While clocksweep is perhaps minutely slower in a single >>> thread than >>> the freelist, clock sweep scales *considerably* better [1]. As it's rather >>> rare to be bottlenecked on clock sweep speed for a single thread >>> (rather then >>> IO or memory copy overhead), I think it's worth favoring clock sweep. >> >> Hey Andres, thanks for spending time on this. I've worked before on >> freelist implementations (last one in LMDB) and I think you're onto >> something. I think it's an innovative idea and that the speed >> difference will either be lost in the noise or potentially entirely >> mitigated by avoiding duplicate work. > > Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE > perform better because it doesn't need to maintain the freelist anymore... > > >>> Also needing to switch between getting buffers from the freelist and >>> the sweep >>> makes the code more expensive. I think just having the buffer in the sweep, >>> with a refcount / usagecount of zero would suffice. >> >> If you're not already coding this, I'll jump in. :) > > My experimental patch is literally a four character addition ;), namely adding > "0 &&" to the relevant code in StrategyGetBuffer(). > > Obviously a real patch would need to do some more work than that. Feel free > to take on that project, I am not planning on tackling that in near term. > I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the workand cross-reference when I have something reasonable to show (hopefully later today). > There's other things around this that could use some attention. It's not hard > to see clock sweep be a bottleneck in concurrent workloads - partially due to > the shared maintenance of the clock hand. A NUMAed clock sweep would address > that. Working on it. Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to re-implementefficiently and/or make NUMA aware. Or maybe I can remove that too? It is used in autoprewarm.c and possiblyother extensions, but no where else in core. > However, we also maintain StrategyControl->numBufferAllocs, which is a > significant contention point and would not necessarily be removed by a > NUMAificiation of the clock sweep. Yep, I noted this counter and its potential for contention too. Fortunately, it seems like it is only used so that "bgwritercan estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned counter,perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read. A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a "movingaverage" and other voodoo there I've yet to fully grok. Any thoughts on this approximate count approach? Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1] while I'm there? I guess I'd have to try it out,measure it a lot and see if there are any material benefits. Maybe I'll keep that for a future patch, or at least layerit... back to work! > Greetings, > > Andres Freund best. -greg [1] https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html
pgsql-hackers by date: