Re: [PERFORM] DELETE vs TRUNCATE explanation - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [PERFORM] DELETE vs TRUNCATE explanation |
Date | |
Msg-id | 2358.1342646246@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [PERFORM] DELETE vs TRUNCATE explanation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [PERFORM] DELETE vs TRUNCATE explanation
|
List | pgsql-hackers |
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, that argument is exactly why the code is designed the way it is... >> but we are now finding out that sending useless fsync requests isn't as >> cheap as all that. > I agree, but I think the problem can be solved for a pretty modest > amount of effort without needing to make fsync PGC_POSTMASTER. Your > proposal to refactor the pendingOpsTable representation seems like it > will help a lot. Perhaps you should do that first and then we can > reassess. > ... > In my view, the elephant in the room here is that it's dramatically > inefficient for every backend to send an fsync request on every block > write. For many users, in many workloads, all of those requests will > be for just a tiny handful of relation segments. The fsync queue > compaction code works as well as it does for precisely that reason - > when it triggers, we typically can compact a list of thousands or > millions of entries down to less than two dozen. In other words, as I > see it, the issue here is not so much that 100% of the fsync requests > are useless when fsync=off, but rather that 99.9% of them are useless > even when fsync=on. > In any case, I'm still of the opinion that we ought to try making one > fix (your proposed refactoring of the pendingOpsTable) and then see > where we're at. I've been chewing on this issue some more, and no longer like my previous proposal, which was >>> ... What I'm thinking about >>> is reducing the hash key to just RelFileNodeBackend + ForkNumber, >>> so that there's one hashtable entry per fork, and then storing a >>> bitmap to indicate which segment numbers need to be sync'd. At >>> one gigabyte to the bit, I think we could expect the bitmap would >>> not get terribly large. We'd still have a "cancel" flag in each >>> hash entry, but it'd apply to the whole relation fork not each >>> segment. The reason that's not so attractive is the later observation that what we really care about optimizing is FORGET_RELATION_FSYNC for all the forks of a relation at once, which we could produce just one request for with trivial refactoring of smgrunlink/mdunlink. The above representation doesn't help for that. So what I'm now thinking is that we should create a second hash table, with key RelFileNode only, carrying two booleans: a cancel-previous-fsyncs bool and a please-unlink-after-checkpoint bool. (The latter field would allow us to drop the separate pending-unlinks data structure.) Entries would be made in this table when we got a FORGET_RELATION_FSYNC or UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get both message types for each relation, since they're both created during DROP. (Maybe we could even combine these request types.) To use the table, as we scan the existing per-fork-and-segment hash table, we'd have to do a lookup in the per-relation table to see if there was a later cancel message for that relation. Now this does add a few cycles to the processing of each pendingOpsTable entry in mdsync ... but considering that the major work in that loop is an fsync call, it is tough to believe that anybody would notice an extra hashtable lookup. However, I also came up with an entirely different line of thought, which unfortunately seems incompatible with either of the improved table designs above. It is this: instead of having a request queue that feeds into a hash table hidden within the checkpointer process, what about storing the pending-fsyncs table as a shared hash table in shared memory? That is, ForwardFsyncRequest would not simply try to add the request to a linear array, but would do a HASH_ENTER call on a shared hash table. This means the de-duplication occurs for free and we no longer need CompactCheckpointerRequestQueue at all. Basically, this would amount to saying that the original design was wrong to try to micro-optimize the time spent in ForwardFsyncRequest, and that we'd rather pay a little more per ForwardFsyncRequest call to avoid the enormous response-time spike that will occur when CompactCheckpointerRequestQueue has to run. (Not to mention that the checkpointer would eventually have to do HASH_ENTER anyway.) I think this would address your observation above that the request queue tends to contain an awful lot of duplicates. But I only see how to make that work with the existing hash table structure, because with either of the other table designs, it's difficult to set a predetermined limit on the amount of shared memory needed. The segment-number bitmaps could grow uncomfortably large in the first design, while in the second there's no good way to know how large the per-relation table has to be to cover a given size for the per-fork-and-segment table. (The sore spot here is that once we've accepted a per-fork entry, failing to record a relation-level cancel for it is not an option, so we can't just return failure.) So if we go that way it seems like we still have the problem of having to do hash_seq_search to implement a cancel. We could possibly arrange for that to be done under shared rather than exclusive lock of the hash table, but nonetheless it's not really fixing the originally complained-of O(N^2) problem. Another issue, which might be fatal to the whole thing, is that it's not clear that a shared hash table similar in size to the existing request array is big enough. The entries basically need to live for about one checkpoint cycle, and with a slow cycle you could need an arbitrarily large number of them. A variant that might work a little better is to keep the main request table still in checkpointer private memory, but to have *both* a small hash table and a request queue in shared memory. The idea is that you first try to enter your request in the hash table; if successful, done (and de-duping has happened automatically). If no room left in the hash table, add it to the request queue as normal. The checkpointer periodically empties both the hash table and the queue. The hash table probably doesn't have to be too huge to be effective at de-duping requests ... but having said that, I have no idea exactly how to size it. So that's a brain dump of some half baked ideas. Thoughts anyone? regards, tom lane
pgsql-hackers by date: