Re: Clock sweep not caching enough B-Tree leaf pages? - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Clock sweep not caching enough B-Tree leaf pages? |
Date | |
Msg-id | CA+CSw_tkkZfrdNKbA8BbDX_WDJFaO8PcVXVHiVkajo9dvWZWGg@mail.gmail.com Whole thread Raw |
In response to | Clock sweep not caching enough B-Tree leaf pages? (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Clock sweep not caching enough B-Tree leaf pages?
|
List | pgsql-hackers |
On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > PostgreSQL implements a clock sweep algorithm, which gets us something > approaching an LRU for the buffer manager in trade-off for less > contention on core structures. Buffers have a usage_count/"popularity" > that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK > algorithm only has one bit for what approximates our "usage_count" (so > it's either 0 or 1). I think that at its core CLOCK is an algorithm > that has some very desirable properties that I am sure must be > preserved. Actually, I think it's more accurate to say we use a > variant of clock pro, a refinement of the original CLOCK. PostgreSQL replacement algorithm is more similar to Generalized CLOCK or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm that approximates LIRS[3]. LIRS is what MySQL implements[4] and CLOCK-Pro is implemented by NetBSD [5] and there has been some work on trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double the cache size metadata entries and detect pages that have been recently referenced. Basically they provide an adaptive tradeoff between LRU and LFU. > In the past, various hackers have noted problems they've observed with > this scheme. A common pathology is to see frantic searching for a > victim buffer only to find all buffer usage_count values at 5. It may > take multiple revolutions of the clock hand before a victim buffer is > found, as usage_count is decremented for each and every buffer. Also, > BufFreelistLock contention is considered a serious bottleneck [1], > which is related. There's a paper on a non blocking GCLOCK algorithm, that does lock free clock sweep and buffer pinning[7]. If we decide to stay with GCLOCK it may be interesting, although I still believe that some variant of buffer nailing[8] is a better idea, my experience shows that most of the locking overhead is cache line bouncing ignoring the extreme cases where our naive spinlock implementation blows up. > Let's leave aside inner/root pages though, because they're so > dramatically useful when in a primary index on a tpb-b table that > they'll always be cached by any non-terrible algorithm. It beggars > belief that the still relatively dense (and consequently *popular*) > B+Tree leaf pages get so little credit for being of such long-term > utility (in the view of our clock sweep algorithm as implemented). The > algorithm has what could be loosely described as an excessively > short-term perspective. There is clearly a better balance to be had > here. I don't think the answer is that we have the B-Tree code give > its pages some special treatment that makes them harder to evict, > although I will admit that I briefly entertained the idea. There has been some research that indicates that for TPC-A workloads giving index pages higher weights increases hitrates[1]. I think the hardest hurdle for any changes in this area will be showing that we don't have any nasty regressions. I think the best way to do that would be to study separately the performance overhead of the replacement algorithm and optimality of the replacement choices. If we capture a bunch of buffer reference traces by instrumenting PinBuffer, we can pretty accurately simulate the behavior of different algorithm and tuning choices with different shared buffer sizes. Obviously full scale tests are still needed due to interactions with OS, controller and disk caches and other miscellaneous influences. But even so, simulation would get us much better coverage of various workloads and at least some confidence that it's a good change overall. It will be very hard and time consuming to gather equivalent evidence with full scale tests. [1] http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf [2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf [3] http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf [4] http://lists.mysql.com/commits/28601 [5] http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD [6] http://lwn.net/Articles/147879/ [7] http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf [8] http://www.postgresql.org/message-id/CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: