Home > mailing lists

Re: [PERFORM] DELETE vs TRUNCATE explanation - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [PERFORM] DELETE vs TRUNCATE explanation
Date	July 18, 2012 18:17:47
Msg-id	2358.1342646246@sss.pgh.pa.us Whole thread Raw
In response to	Re: [PERFORM] DELETE vs TRUNCATE explanation (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [PERFORM] DELETE vs TRUNCATE explanation
List	pgsql-hackers

Tree view

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, that argument is exactly why the code is designed the way it is...
>> but we are now finding out that sending useless fsync requests isn't as
>> cheap as all that.

> I agree, but I think the problem can be solved for a pretty modest
> amount of effort without needing to make fsync PGC_POSTMASTER.  Your
> proposal to refactor the pendingOpsTable representation seems like it
> will help a lot.  Perhaps you should do that first and then we can
> reassess.
> ...
> In my view, the elephant in the room here is that it's dramatically
> inefficient for every backend to send an fsync request on every block
> write.  For many users, in many workloads, all of those requests will
> be for just a tiny handful of relation segments.  The fsync queue
> compaction code works as well as it does for precisely that reason -
> when it triggers, we typically can compact a list of thousands or
> millions of entries down to less than two dozen.  In other words, as I
> see it, the issue here is not so much that 100% of the fsync requests
> are useless when fsync=off, but rather that 99.9% of them are useless
> even when fsync=on.

> In any case, I'm still of the opinion that we ought to try making one
> fix (your proposed refactoring of the pendingOpsTable) and then see
> where we're at.

I've been chewing on this issue some more, and no longer like my
previous proposal, which was

>>> ... What I'm thinking about
>>> is reducing the hash key to just RelFileNodeBackend + ForkNumber,
>>> so that there's one hashtable entry per fork, and then storing a
>>> bitmap to indicate which segment numbers need to be sync'd.  At
>>> one gigabyte to the bit, I think we could expect the bitmap would
>>> not get terribly large.  We'd still have a "cancel" flag in each
>>> hash entry, but it'd apply to the whole relation fork not each
>>> segment.

The reason that's not so attractive is the later observation that what
we really care about optimizing is FORGET_RELATION_FSYNC for all the
forks of a relation at once, which we could produce just one request
for with trivial refactoring of smgrunlink/mdunlink.  The above
representation doesn't help for that.  So what I'm now thinking is that
we should create a second hash table, with key RelFileNode only,
carrying two booleans: a cancel-previous-fsyncs bool and a
please-unlink-after-checkpoint bool.  (The latter field would allow us
to drop the separate pending-unlinks data structure.)  Entries would
be made in this table when we got a FORGET_RELATION_FSYNC or
UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
both message types for each relation, since they're both created during
DROP.  (Maybe we could even combine these request types.)  To use the
table, as we scan the existing per-fork-and-segment hash table, we'd
have to do a lookup in the per-relation table to see if there was a
later cancel message for that relation.  Now this does add a few cycles
to the processing of each pendingOpsTable entry in mdsync ... but
considering that the major work in that loop is an fsync call, it is
tough to believe that anybody would notice an extra hashtable lookup.

However, I also came up with an entirely different line of thought,
which unfortunately seems incompatible with either of the improved
table designs above.  It is this: instead of having a request queue
that feeds into a hash table hidden within the checkpointer process,
what about storing the pending-fsyncs table as a shared hash table
in shared memory?  That is, ForwardFsyncRequest would not simply
try to add the request to a linear array, but would do a HASH_ENTER
call on a shared hash table.  This means the de-duplication occurs
for free and we no longer need CompactCheckpointerRequestQueue at all.
Basically, this would amount to saying that the original design was
wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
and that we'd rather pay a little more per ForwardFsyncRequest call
to avoid the enormous response-time spike that will occur when
CompactCheckpointerRequestQueue has to run.  (Not to mention that
the checkpointer would eventually have to do HASH_ENTER anyway.)
I think this would address your observation above that the request
queue tends to contain an awful lot of duplicates.

But I only see how to make that work with the existing hash table
structure, because with either of the other table designs, it's
difficult to set a predetermined limit on the amount of shared
memory needed.  The segment-number bitmaps could grow uncomfortably
large in the first design, while in the second there's no good way
to know how large the per-relation table has to be to cover a given
size for the per-fork-and-segment table.  (The sore spot here is that
once we've accepted a per-fork entry, failing to record a relation-level
cancel for it is not an option, so we can't just return failure.)

So if we go that way it seems like we still have the problem of
having to do hash_seq_search to implement a cancel.  We could
possibly arrange for that to be done under shared rather than
exclusive lock of the hash table, but nonetheless it's not
really fixing the originally complained-of O(N^2) problem.

Another issue, which might be fatal to the whole thing, is that
it's not clear that a shared hash table similar in size to the
existing request array is big enough.  The entries basically need
to live for about one checkpoint cycle, and with a slow cycle
you could need an arbitrarily large number of them.

A variant that might work a little better is to keep the main
request table still in checkpointer private memory, but to have
*both* a small hash table and a request queue in shared memory.
The idea is that you first try to enter your request in the hash
table; if successful, done (and de-duping has happened automatically).
If no room left in the hash table, add it to the request queue as
normal.  The checkpointer periodically empties both the hash table
and the queue.  The hash table probably doesn't have to be too huge
to be effective at de-duping requests ... but having said that,
I have no idea exactly how to size it.

So that's a brain dump of some half baked ideas.  Thoughts anyone?
        regards, tom lane

pgsql-hackers by date:

From: Merlin Moncure
Date: 18 July 2012, 17:58:43
Subject: Re: row literal problem

From: Andrew Dunstan
Date: 18 July 2012, 18:30:33
Subject: Re: bgwriter, regression tests, and default shared_buffers settings

Re: [PERFORM] DELETE vs TRUNCATE explanation - Mailing list pgsql-hackers

Previous

Next