Analysis of ganged WAL writes - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Analysis of ganged WAL writes |
Date | |
Msg-id | 6433.1033863379@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Analysis of ganged WAL writes
|
List | pgsql-hackers |
I do not think the situation for ganging of multiple commit-record writes is quite as dire as has been painted. There is a simple error in the current code that is easily corrected: in XLogFlush(), the wait to acquire WALWriteLock should occur before, not after, we try to acquire WALInsertLock and advance our local copy of the write request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip ought to be moved down to before line 1275, inside the "if" that tests whether we are going to call XLogWrite.) Given that change, what will happen during heavy commit activity is like this: 1. Transaction A is ready to commit. It calls XLogInsert to insert its commit record into the WAL buffers (thereby transiently acquiring WALInsertLock) and then it calls XLogFlush to write and sync the log through the commit record. XLogFlush acquires WALWriteLock and begins issuing the needed I/O request(s). 2. Transaction B is ready to commit. It gets through XLogInsert and then blocks on WALWriteLock inside XLogFlush. 3. Transactions C, D, E likewise insert their commit records and then block on WALWriteLock. 4. Eventually, transaction A finishes its I/O, advances the "known flushed" pointer past its own commit record, and releases the WALWriteLock. 5. Transaction B now acquires WALWriteLock. Given the code change I recommend, it will choose to flush the WAL *through the last queued commit record as of this instant*, not the WAL endpoint as of when it started to wait. Therefore, this WAL write will handle all of the so-far-queued commits. 6. More transactions F, G, H, ... arrive to be committed. They will likewise insert their COMMIT records into the buffer and block on WALWriteLock. 7. When B finishes its write and releases WALWriteLock, it will have set the "known flushed" pointer past E's commit record. Therefore, transactions C, D, E will fall through their tests without calling XLogWrite at all. When F gets the lock, it will conclude that it should write the data queued up to that time, and so it will handle the commit records for G, H, etc. (The fact that lwlock.c will release waiters in order of arrival is important here --- we want C, D, E to get out of the queue before F decides it needs to write.) It seems to me that this behavior will provide fairly effective ganging of COMMIT flushes under load. And it's self-tuning; no need to fiddle with weird parameters like commit_siblings. We automatically gang as many COMMITs as arrive during the time it takes to write and flush the previous gang of COMMITs. Comments? regards, tom lane
pgsql-hackers by date: