Re: Page replacement algorithm in buffer cache - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Page replacement algorithm in buffer cache |
Date | |
Msg-id | CA+CSw_sMmHt-Up+SiyL4OPkrgfECQVnMeOpyJz+=M7x=DNTzWg@mail.gmail.com Whole thread Raw |
In response to | Re: Page replacement algorithm in buffer cache (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: Page replacement algorithm in buffer cache
Re: Page replacement algorithm in buffer cache Re: Page replacement algorithm in buffer cache Re: Page replacement algorithm in buffer cache |
List | pgsql-hackers |
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > well if you do a non-locking test first you could at least avoid some > cases (and, if you get the answer wrong, so what?) by jumping to the > next buffer immediately. if the non locking test comes good, only > then do you do a hardware TAS. > > you could in fact go further and dispense with all locking in front of > usage_count, on the premise that it's only advisory and not a real > refcount. so you only then lock if/when it's time to select a > candidate buffer, and only then when you did a non locking test first. > this would of course require some amusing adjustments to various > logical checks (usage_count <= 0, heh). Moreover, if the buffer happens to miss a decrement due to a data race, there's a good chance that the buffer is heavily used and wouldn't need to be evicted soon anyway. (if you arrange it to be a read-test-inc/dec-store operation then you will never go out of bounds) However, clocksweep and usage_count maintenance is not what is causing contention because that workload is distributed. The issue is pinning and unpinning. There we need an accurate count and there are some pages like index roots that get hit very heavily. Things to do there would be in my opinion convert to a futex based spinlock so when there is contention it doesn't completely kill performance and then try to get rid of the contention. Converting to lock-free pinning won't help much here as what is killing us here is the cacheline bouncing. One way to get rid of contention is the buffer nailing idea that Robert came up with. If some buffer gets so hot that maintaining refcount on the buffer header leads to contention, promote that buffer to a nailed status, let everyone keep their pin counts locally and sometime later revisit the nailing decision and if necessary convert pins back to the buffer header. One other interesting idea I have seen is closeable scalable nonzero indication (C-SNZI) from scalable rw-locks [1]. The idea there is to use a tree structure to dynamically stripe access to the shared lock counter when contention is detected. Downside is that considerable amount of shared memory is needed so there needs to be some way to limit the resource usage. This is actually somewhat isomorphic to the nailing idea. The issue with the current buffer management algorithm is that it seems to scale badly with increasing shared_buffers. I think the improvements should concentrate on finding out what is the problem there and figuring out how to fix it. A simple idea to test would be to just partition shared buffers along with the whole clock sweep machinery into smaller ones, like the buffer mapping hash tables already are. This should at the very least reduce contention for the clock sweep even if it doesn't reduce work done per page miss. [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: