Home > mailing lists

Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From	Noah Misch
Subject	Re: Performance Improvement by reducing WAL for Update Operation
Date	October 26, 2012 22:32:42
Msg-id	20121026223233.GD24744@tornado.leadboat.com Whole thread
In response to	Re: Performance Improvement by reducing WAL for Update Operation (Amit kapila <amit.kapila@huawei.com>)
Responses	Re: Performance Improvement by reducing WAL for Update Operation
List	pgsql-hackers

Tree view

On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:
> I have fixed all the review comments raised in delta encoding approach raised by you (for needs toast, for now I have
keptthe code as it will not go in patch of encoding for it. however it can be changed.).

> I have also fixed the major comment about this patch by Heikki and Tom [use memcmp to find modified columns]. 

Could you elaborate on your reason for continuing to treat TOAST as a special
case?  As best I recall, the only reason to do so before was the fact that
TOAST can change the physical representation of a column even when executor
did not change its logical content.  Since you're no longer relying on the
executor's opinion of what changed, a TOASTed value is not special.

> The patch containing review comments fixed for delta encoding method is attached with this mail.

In my previous review, I said:
Given [not relying on the executor to know which columns changed], why nottreat the tuple as an opaque series of bytes
andnot worry about datumboundaries?  When several narrow columns change together, say a sequenceof sixteen smallint
columns,you will use fewer binary delta commands byrepresenting the change with a single 32-byte substitution.  If an
UPDATEchangesjust part of a long datum, the delta encoding algorithm will stillbe able to save considerable space.
Thatcase arises in many forms:changing one word in a long string, changing one element in a long array,changing one
fieldof a composite-typed column.  Granted, this makes thechoice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that column
boundaries are the only modified-range boundaries worth recognizing.  What is
your willingness to explore general algorithms for choosing such boundaries?
Such an investigation may, of course, be a dead end.

If you conclude that finding sub-column similarity is not worthwhile, at least
teach your algorithm to aggregate runs of changing or unchanging columns into
fewer delta instructions.  If a table contains twenty unchanging bool columns,
you currently use at least 80 bytes to encode that fact.  By treating the run
of columns as a unit for delta encoding purposes, you could encode it in 23
bytes.  The same applies for runs of changing columns.
Like Heikki, I'm left wondering why your custom delta encoding ispreferable to an encoding from the literature.  Your
encodinghas much incommon with VCDIFF, even sharing two exact command names.  If a customencoding is the right thing,
codecomments or a README section should atleast discuss the advantages over an established alternative.

Reflecting on those comments, I'm no longer too concerned about your use of a
novel format to encode the delta.  For example, VCDIFF is designed to be
flexible and self-describing; we don't need that.  But I would still review it
for useful ideas when designing the encoding for PostgreSQL.
Idle thought: it might pay off to use 1-byte sizes and offsets most of thetime.  Tuples shorter than 256 bytes are
common;for longer tuples, we canafford wider offsets.

I still think this deserves attention.  You currently need two bits to select
the delta command.  Here are a few strategies to consider:

1) Store the commands in groups of eight.  Each group starts with two bytes
representing the eight commands, followed by the corresponding 2-byte lengths
and ADD data blocks.  The last group may end early.

2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16.  Now
each command specifier needs three bits.  Again store commands in groups of
eight, this time three bytes per group header.  Expect a 1-byte length for the
original commands and a 2-byte length for _*16 commands.

2) Store each 2-bit command with a 14-bit length in a uint16.  We would need
to do something grotty for --with-blocksize=32, where a 14-bit length is not
always enough.

I mentioned that heap_delta_{encode,decode} should have essentially-inverse
argument lists.  That remains unchanged in this version.

Review the comments added and moved in your patch; some have become obsolete.
At least one is missing detail I requested in the previous review.

> + /*
> +  * get_tuple_info - Gets the tuple offset and value.
> +  *
> +  * calculates the attribute value and offset, where the attribute ends in the
> +  * tuple based on the attribute number and previous fetched attribute info.
> +  *
> +  * offset (I/P and O/P variable) - Input as end of previous attribute offset
> +  *        and incase if it is a first attribute then it's value is zero.
> +  *        Output as end of the current attribute in the tuple.
> +  * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be used or not.
> +  */
> + static void
> + get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
> +                bool hasnulls, int attnum, Datum *value, uint16 *offset,
> +                bool *usecacheoff)

This name is too generic and not particularly tied to what the function does.
As far as I can see, this is a variant of heap_getattr() that also returns the
offset and has some redundant arguments.  I suggest heap_getattr_with_offset()
for a name and making its argument list more like heap_getattr().

> +     if (wal_offset < heaptup->t_len)
> +     {
> +         wal_tup->t_len = wal_offset;
> +         wal_tup->t_self = heaptup->t_self;
> +         wal_tup->t_tableOid = heaptup->t_tableOid;
> + 
> +         return true;
> +     }

The passed-in wal_tup happens to always have MaxHeapTupleSize of storage.
However, heap_delta_encode() cannot verify that and does not document it as an
precondition.  Furthermore, it makes no effort to avoid overrunning the buffer
before finally checking the accumulated length here.  The encoded data could
have grown well beyond MaxHeapTupleSize.

> + 
> +     memcpy(wal_tup, heaptup, sizeof(HeapTuple));
> +     return false;

You would need heap_copytuple_with_tuple() here, but perhaps just specify the
contents of wal_tup as undefined when heap_delta_encode() returns false.

> + }

> + #define MinHeapTupleSizeForDeltaUpdate 64

This is now unused.

Overall, there's still plenty to do before this patch is ready.  I'm marking
it Returned with Feedback.  I look forward to the benefits it will bring when
finished.

nm

pgsql-hackers by date:

From: Andres Freund
Date: 26 October 2012, 22:07:13
Subject: Re: foreign key locks

From: Robert Haas
Date: 27 October 2012, 02:19:02
Subject: Re: splitting *_desc routines

Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

Previous

Next