Re: Page replacement algorithm in buffer cache - Mailing list pgsql-hackers
From | Merlin Moncure |
---|---|
Subject | Re: Page replacement algorithm in buffer cache |
Date | |
Msg-id | CAHyXU0xsN4p1H4CqQMCR0zDnXNNX=xKFaxKSJ19XUsnF1ys84Q@mail.gmail.com Whole thread Raw |
In response to | Re: Page replacement algorithm in buffer cache (Ants Aasma <ants@cybertec.at>) |
Responses |
Re: Page replacement algorithm in buffer cache
|
List | pgsql-hackers |
On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma <ants@cybertec.at> wrote: > On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> well if you do a non-locking test first you could at least avoid some >> cases (and, if you get the answer wrong, so what?) by jumping to the >> next buffer immediately. if the non locking test comes good, only >> then do you do a hardware TAS. >> >> you could in fact go further and dispense with all locking in front of >> usage_count, on the premise that it's only advisory and not a real >> refcount. so you only then lock if/when it's time to select a >> candidate buffer, and only then when you did a non locking test first. >> this would of course require some amusing adjustments to various >> logical checks (usage_count <= 0, heh). > > Moreover, if the buffer happens to miss a decrement due to a data > race, there's a good chance that the buffer is heavily used and > wouldn't need to be evicted soon anyway. (if you arrange it to be a > read-test-inc/dec-store operation then you will never go out of > bounds) yeah. There's something to be said to have an upper bound in the length of time to get a page out (except in the special case when most of them are pinned). Right now, any page contention on a buffer header for any reason can shut down buffer allocation, and that's just not good. It's obviously not very likely to happen but I think it can does does happen. The more I think about it the more I think's a bad idea to spin during buffer allocation for any reason, ever. > However, clocksweep and usage_count maintenance is not what is > causing contention because that workload is distributed. The issue is > pinning and unpinning. There we need an accurate count and there are > some pages like index roots that get hit very heavily. Things to do > there would be in my opinion convert to a futex based spinlock so when > there is contention it doesn't completely kill performance and then > try to get rid of the contention. Converting to lock-free pinning > won't help much here as what is killing us here is the cacheline > bouncing. Yup -- futexes are another way to go. They are linux specific though. > One way to get rid of contention is the buffer nailing idea that > Robert came up with. If some buffer gets so hot that maintaining > refcount on the buffer header leads to contention, promote that buffer > to a nailed status, let everyone keep their pin counts locally and > sometime later revisit the nailing decision and if necessary convert > pins back to the buffer header. Yeah this is a more general (albeit more complicated) solution and would likely be fantastic. Is it safe to assume that refcounting is the only likely cause of contention? > One other interesting idea I have seen is closeable scalable nonzero > indication (C-SNZI) from scalable rw-locks [1]. The idea there is to > use a tree structure to dynamically stripe access to the shared lock > counter when contention is detected. Downside is that considerable > amount of shared memory is needed so there needs to be some way to > limit the resource usage. This is actually somewhat isomorphic to the > nailing idea. > > The issue with the current buffer management algorithm is that it > seems to scale badly with increasing shared_buffers. I think the > improvements should concentrate on finding out what is the problem > there and figuring out how to fix it. A simple idea to test would be > to just partition shared buffers along with the whole clock sweep > machinery into smaller ones, like the buffer mapping hash tables > already are. This should at the very least reduce contention for the > clock sweep even if it doesn't reduce work done per page miss. > > [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf I'll have to take a look. Removing *all spinning* from from page allocation though feels like it might be worthwhile to test (got to give some bonus points for being a very local change and simple to implement). I wonder if with more shared buffers you tend to sweep more buffers per allocation. (IIRC Jeff J was skeptical of that). merlin
pgsql-hackers by date: