Re: Clock with Adaptive Replacement - Mailing list pgsql-hackers
From | ben.manes |
---|---|
Subject | Re: Clock with Adaptive Replacement |
Date | |
Msg-id | 1526057854777-0.post@n3.nabble.com Whole thread Raw |
In response to | Re: [HACKERS] Clock with Adaptive Replacement (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: Clock with Adaptive Replacement
|
List | pgsql-hackers |
I have been working on caching as a hobby project for the last few years and would be happy to collaborate on an analysis and design. I've tackled a lot of these issues, wrote widely used implementations, and co-authored a paper for ACM's Transactions on Storage on an eviction policy. A short summary of my results is described in this HighScalability article [1]. I very much agree with the requests for traces to use real-world data. I wrote a simulator that supports a large number of policies and traces [2]. It was invaluable. I believe ARC's DS1 should be the closest to a Postgres workload. Their OLTP trace appears to be the I/O of the WAL, making it recency-biased and not very interesting. ** Eviction ** ARC-based policies (CAR, CART) are only modestly better than LRU in most workloads. It does much better in Zipf, but poorly in loopy workloads like glimpse. I think that the problem is that it does not capture and use the history (ghost lists) very effectively. LIRS-based policies (ClockPro, FRD) are very strong. Unfortunately the papers do a poor job in describing them, it is complex to implement, and most are flawed so as to diminish the hit rate. As described in their paper, LIRS has an unbounded history that can cause memory leaks if not capped and the pruning is O(n). I try to avoid Clock-based policies in my designs due to their O(n) scans. This can cause latency spikes that are difficult to reproduce because it depends on the policy's state. Random sampling policies don't have this problem, but tend to under perform due to a lack of history. I prefer amortized O(1) policies, but those are difficult due to concurrency (solved below). I chose TinyLFU with an admission window, called W-TinyLFU. This uses a frequency sketch with saturating counters for its history (e.g, CountMinSketch w/ 4-bit counters). The counters are halved every sample period, e.g. 10x the maximum size, which efficiently ages the LFU. Instead of trying to order for eviction, TinyLFU avoids cache pollution by rejecting candidates with a low probability of reuse. This is done by checking the candidate's frequency against the eviction policy's victim, and admitting only if the candidate offers a better hit rate. The LRU admission window helps for recency-based traces where the candidate is rejected prematurely and when accepted is unlikely to be used again soon. This simple and compact scheme has the highest hit rate in most real-world workloads that I have tested. To bypass the ACM paywall, you can use the author link on my github project's README [3]. We are currently working on making W-TinyLFU adaptive by dynamically resizing the window size. I can send the draft paper privately if anyone is interested. I am a Bay Area engineer working with researchers at Israel's Technion, so the mixture has been fruitful. ** Concurrency ** As you know, LRU's O(1) nature is very attractive but the list manipulations are not friendly for multi-threading. The lock hold times are small, but the access rate is high and causes a lot of contention. The solution is to use the WAL approach to record the event into a buffer and replay it under a tryLock. This way adding to the ring buffer is a cheap CAS and threads won't block as their work has been handed off. When the buffers fill up they are replayed under the lock in a batch, thereby isolating the policy from concurrent mutations. I use multiple, lossy ring buffers to capture reads and replaying is delayed until beneficial. For writes, I use a ring buffer that is replayed immediately and when full causes back pressure (though is rarely filled). In my benchmarks [4] the overhead quite small on a Zipf workload (to simulate hot entries). The read throughput is 33% of a lock-free hash table and scales linearly. The write throughput is nearly the same due to contention when updating the hot entries. The read throughput is high enough to meet performance budgets and its overhead is only observable in synthetic micro-benchmarks. The benefit of this approach is that for a very small penalty to reads, the policy is mostly isolated from the multi-threading. This allows efficient and complex O(1) algorithms that are not thread-safe. For example I use W-TinyLFU (LRU+SLRU) for eviction and a Hierarchical TimerWheel for expiration. While the initial infrastructure is more work, its been a boon due to composability, a simpler mental model, and not having to use algorithms that degrade in difficult to debug ways. Hope that helps, Ben [1] http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html [2] https://github.com/ben-manes/caffeine/wiki/Simulator [3] https://github.com/ben-manes/caffeine [4] https://github.com/ben-manes/caffeine/wiki/Benchmarks -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
pgsql-hackers by date: