Re: Clock sweep not caching enough B-Tree leaf pages? - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Clock sweep not caching enough B-Tree leaf pages? |
Date | |
Msg-id | CAM-w4HMnOJ=ACBjW9GuPMgYpjqD1QH_tcLyjCOquXCetdtWR6A@mail.gmail.com Whole thread Raw |
In response to | Re: Clock sweep not caching enough B-Tree leaf pages? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Clock sweep not caching enough B-Tree leaf pages?
|
List | pgsql-hackers |
On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > This isn't a fundamental property of the usage-count idea; it's an > artifact of the fact that usage count decreases are tied to eviction > pressure rather than access pressure. For example, suppose we made a > rule that if the total usage counts of all buffers exceed 3 * > NBuffers, then every time you bump the usage count of a buffer from N > to N+1, you're required to advance the clock sweep far enough to > decrease the reference count of a buffer by one. This sounds like the right way to reason about it. From what I remember in school the idea with the clock sweep is to set the usage flags to the maximum whenever the buffer is used and decrement (actually iirc typically shift right) it when the clock sweep goes by. Ie, simulate a LRU where when the buffer is accessed it jumps to the head of the list and when the clock comes by it moves gradually down the list. What you're pointing out is that the clock might not come by very often resulting everything being at the head of the list. In that case I'm not clear it really matters what gets evicted though. And the cpu effort of running the clock n times sounds bad but doing the work earlier doesn't really change the amount of work being done, it just amortizes it over more calls. But if you want to do that it seems to me the way to do it is every time a buffer is pinned set to the maximum and then run the clock max_value - previous_value. So the total usage counts of all buffers remains constant. If that results in contention one way to reduce it is to do this probabilistically. Run the clock 1% of the time but run it 100x as much as you would normally. But I think you've misidentified the problem and what those other algorithms are trying to solve. The problem is not that Postgres will pick a bad buffer to evict. If all the buffers have been since the last time the clock came around then they're all "hot" anyways and it doesn't really matter which one we evict. The problem is that we expend an inordinate amount of work finding the few non-hot buffers. When you have a really large amount of memory and 99.9% of it is hot but 0.1% is whatever random non-hot page was needed last then there's an obvious buffer to evict when you need a new one. But we spend a lot of work decrementing every hot buffer's usage count 4 times only to have them immediately incremented again just to find the 1 buffer where the usage count was 4 or 3. The goal of these algorithms that divide the buffers into groups is to avoid having to do so much work to find the colder buffers. Once the hot buffers migrate to the hot pool we only need to run the clock there when we find we have new hot pages that we want to promote. All the thrashing in the cold pool can be more efficient because there's many fewer pages to consider. -- greg
pgsql-hackers by date: