Re: [PERFORM] DELETE vs TRUNCATE explanation - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [PERFORM] DELETE vs TRUNCATE explanation |
Date | |
Msg-id | 10862.1342724246@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [PERFORM] DELETE vs TRUNCATE explanation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [PERFORM] DELETE vs TRUNCATE explanation
|
List | pgsql-hackers |
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 19, 2012 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What about checking just the immediately previous entry? This would >> at least fix the problem for bulk-load situations, and the cost ought >> to be about negligible compared to acquiring the LWLock. > 2. You say "fix the problem" but I'm not exactly clear what problem > you think this fixes. What I'm concerned about is that there is going to be a great deal more fsync request queue traffic in 9.2 than there ever was before, as a consequence of the bgwriter/checkpointer split. The design expectation for this mechanism was that most fsync requests would be generated locally inside the bgwriter and thus go straight into the hash table without having to go through the shared-memory queue. I admit that we have seen no benchmarks showing that there's a problem, but that's because up till yesterday the bgwriter was failing to transmit such messages at all. So I'm looking for ways to cut the overhead. But having said that, maybe we should not panic until we actually see some benchmarks showing the problem. Meanwhile, we do know there's a problem with FORGET_RELATION_FSYNC. I have been looking at the two-hash-tables design I suggested before, and realized that there's a timing issue: if we just stuff "forget" requests into a separate table, there is no method for determining whether a given fsync request arrived before or after a given forget request. This is problematic if the relfilenode gets recycled: we need to be able to guarantee that a previously-posted forget request won't cancel a valid fsync for the new relation. I believe this is soluble though, if we merge the "forget" requests with unlink requests, because a relfilenode can't be recycled until we do the unlink. So as far as the code goes: 1. Convert the PendingUnlinkEntry linked list to a hash table keyed by RelFileNode. It acts the same as before, and shouldn't be materially slower to process, but now we can determine in O(1) time whether there is a pending unlink for a relfilenode. 2. Treat the existence of a pending unlink request as a relation-wide fsync cancel; so the loop in mdsync needs one extra hashtable lookup to determine validity of a PendingOperationEntry. As before, this should not matter much considering that we're about to do an fsync(). 3. Tweak mdunlink so that it does not send a FORGET_RELATION_FSYNC message if it is sending an UNLINK_RELATION_REQUEST. (A side benefit is that this gives us another 2X reduction in fsync queue traffic, and not just any queue traffic but the type of traffic that we must not fail to queue.) The FORGET_RELATION_FSYNC code path will still exist, and will still require a full hashtable scan, but we don't care because it isn't being used in common situations. It would only be needed for stuff like killing an init fork. The argument that this is safe involves these points: * mdunlink cannot send UNLINK_RELATION_REQUEST until it's done ftruncate on the main fork's first segment, because otherwise that segment could theoretically get unlinked from under it before it can do the truncate. But this is okay since the ftruncate won't cause any fsync the checkpointer might concurrently be doing to fail. The request *will* be sent before we unlink any other files, so mdsync will be able to recover if it gets an fsync failure due to concurrent unlink. * Because a relfilenode cannot be recycled until we process and delete the PendingUnlinkEntry during mdpostckpt, it is not possible for valid new fsync requests to arrive while the PendingUnlinkEntry still exists to cause them to be considered canceled. * Because we only process and delete PendingUnlinkEntrys that have been there since before the checkpoint started, we can be sure that any PendingOperationEntrys referring to the relfilenode will have been scanned and deleted by mdsync before we remove the PendingUnlinkEntry. Unless somebody sees a hole in this logic, I'll go make this happen. regards, tom lane
pgsql-hackers by date: