Thinking about breaking up the BufMgrLock - Mailing list pgsql-hackers
| From | Tom Lane | 
|---|---|
| Subject | Thinking about breaking up the BufMgrLock | 
| Date | |
| Msg-id | 17716.1107736237@sss.pgh.pa.us Whole thread Raw | 
| Responses | Re: Thinking about breaking up the BufMgrLock Re: Thinking about breaking up the BufMgrLock Re: Thinking about breaking up the BufMgrLock Re: Thinking about breaking up the BufMgrLock | 
| List | pgsql-hackers | 
We've been speculating for awhile about ways to reduce contention for the BufMgrLock, in hopes of fixing the "context swap storm" problem exhibited by certain patterns of concurrent queries. The traditional way to fix such problems is to divide what is protected by the lock into finer-grain lockable objects. (I'm not totally convinced that this approach can help us, because more locks means more low-level locking overhead. But let's pursue it and see where it goes; we sure don't have any other ideas at the moment.) Neil Conway took a stab at this about a year ago: http://archives.postgresql.org/pgsql-hackers/2004-01/msg00173.php but that approach had some problems IMHO, and in any case he never got the patch to the point of working. Here's a bit of fresh thinking about the problem. To fix the test cases we have been looking at, there are two things we need to fix the performance of: ReadBuffer for the case where the requested page is already in a shared buffer, and ReleaseBuffer. The test cases essentially put multiple backends into tight loops doing these operations. These cases do not require I/O or any kernel call, so ideally they should be fast, but what we are seeing is that they suffer from excessive contention for the BufMgrLock. ReadBuffer needs to do a lookup to map the page ID to a buffer ID, which in principle requires only a shared lock on the page-to-buffer mapping (embodied in the buf_table hash table). Assuming success, it also needs to mark the buffer pinned and update the LRU-list position of the buffer. Marking pinned is certainly a buffer-local change, so we could imagine splitting up the BufMgrLock like this: 1. A global LWLock for the page-to-buffer mapping, call it say BufMappingLock. Share lock on this is sufficient to allow reading the hash table; must get exclusive lock when reassigning any buffer's page identity. 2. A global LWLock for the LRU strategy information, say BufLRULock or BufStrategyLock. 3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of each buffer header; you need this lock to examine or change a buffer header. ReleaseBuffer has no locking problems in this formulation: it just grabs the per-buffer lock, decrements its refcount, releases the lock. ReadBuffer looks like: * Acquire share lock on BufMappingLock.* Search hash table for desired ID. (we assume success)* acquire per-buffer lock.*increment buffer refcount.* release per-buffer lock.* release share lock on BufMappingLock.* update the LRU state. (We have to bump the pin count on the target buffer before releasing the BufMappingLock, otherwise someone could reassign the buffer as soon as we release BufMappingLock.) This would be pretty good from a locking point of view, except that "update the LRU state" seems to require taking an exclusive lock on a global data structure, which puts us about back where we were. Contention for a global BufLRULock might be a bit better than for the existing overall BufMgrLock, but it'll still cripple the performance of ReadBuffer. Perhaps someone knows of a neat data structure that can maintain an LRU list with only local updates? I don't though. The best idea I've had about it is to add flag bits to the per-buffer state that essentially indicate "pending move to head of T1" or "pending move to head of T2". We would make ReadBuffer set the appropriate bit when it grabs a buffer, but not actually update the global LRU state. Operations such as the periodic bgwriter scan for work to do, or an attempt to locate a reusable buffer, would scan the LRU list from tail to front. When they hit a buffer that's got one of these bits set, they would not process it immediately, but would move it to the list head and clear the bit. Doing this would require write lock on the LRU list, as well as per-buffer locks on each buffer as it's examined or changed, but *not* any lock on BufMappingLock. So these operations wouldn't interfere too badly with the success path in ReadBuffer. This would convert the existing "strict LRU" behavior into an "approximate LRU". I'm worried that the change might be penny-wise and pound-foolish: if a poorer cache management algorithm causes us to have to do more I/O, it's probably a net loss to save some lock contention. But without a smart idea about data structures I don't see how to do better. Thoughts? regards, tom lane
pgsql-hackers by date: