Thread: [WIP] [B-Tree] Keep indexes sorted by heap physical location

[WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 02:54:25

The attached patch tries to maintain the initial status of B-Tree
indexes, which are created with equal-key runs in physical order,
during the whole life of the B-Tree, and make key-tid pairs
efficiently searchable in the process.

The patch doesn't add an interface to perform the key-tid pairs, nor
does it address all potential issues that could arise from the
implementation, or backward compatibility with earlier-format indexes,
but passes all tests, and seems to work well with the workloads I have
tried until now, which include:

 - dump/restore/compare (with lots of indexes, both unique and non)
 - sanity check of index lookups
 - running a complex application on top of it, no crashes or
corruption (detected)

Recovery hasn't been tested at all, and I'd expect it to have issues -
though I have tried to make the necessary changes I may have missed
more than one. Nor have I tested high-concurrency workloads (although
the application mentioned does produce some of it), so this is an
early WIP in search of feedback on the overall approach. Still, it
should be enough to measure the merits of the idea, namely index size,
performance and code impact.

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

 - introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
 - the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.
 - the patch changes the output of regression tests in
nondeterministic ways, making index scan results always follow
physical order. While that's not only not incorrect but desirable,
lots of tests depended on the almost deterministic order that resulted
from the way a suitable insertion point was selected before. The first
patch addresses that, but at the expense of some "coverage" (ignoring
some lines of reference output, the minimum I could ignore).
 - while I don't see why it would cause more bloat under normal usage
than the previous method, there are probably new/different
pathological cases to account for, and maybe I'm missing one that is a
common pattern
 - it's btree, it doesn't get any more critical than that, it really
needs a heapload of testing before going in, and some of it (crash
recovery) may be tough to get right

On the performance side, I don't see a significant impact. If
anything, it has the potential to speed up things, since scans over
long runs of equal keys will now be done in physical order, and
insertions will be evenly spread around the index leaf pages (due to
the spread of heap tuples), but I haven't done any benchmarks beyond
pgbench to back that up.

It is also an important enabling feature to make vacuum more
efficient, and to implement WARM tuples in a more general way, so that
has to be accounted for when evaluating the potential benefit from
this patch.

So, the numbers.

The size impact I measured on a dump of a test database, and seems
under the noise, which is what I expected since it only affects
non-leaf pages:

patched

db: 23G
hourly_full: 562M + 790M (118M + 118M + 554M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

unpatched

db: 23G
hourly_full: 562M + 789M (118M + 118M + 553M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

Here, numbers are HEAP + INDEXES (i1 + i2 + i3 ...), for tables that
are a mix of wide and narrow, wide indexed columns (varchar) and
narrow (int). Only a few indexes seem to have grown, but this is
before any bloat accumulates (I haven't devised a test yet that
produces the same amount of bloat on both databases to make them
comparable).

pgbench, in-memory, patched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13271118
latency average: 0.181 ms
tps = 44237.026822 (including connections establishing)
tps = 44237.227678 (excluding connections establishing)

pgbench, in-memory, unpatched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13336128
latency average: 0.180 ms
tps = 44453.718954 (including connections establishing)
tps = 44453.895436 (excluding connections establishing)

pgbench, bigger-than-RAM, patched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 71194
latency average: 33.711 ms
tps = 237.269144 (including connections establishing)
tps = 237.270122 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 26748
latency average: 89.726 ms
tps = 89.043800 (including connections establishing)
tps = 89.044218 (excluding connections establishing)


pgbench, bigger-than-RAM, unpatched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 86495
latency average: 27.747 ms
tps = 288.256959 (including connections establishing)
tps = 288.258202 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 27145
latency average: 88.414 ms
tps = 90.461600 (including connections establishing)
tps = 90.462105 (excluding connections establishing)



Which is expected since pgbench exercises none of the potentially
improved paths. I/O here (just a laptop) is horrible so the difference
is probably just noise. The drop in select-only seems strangely high,
I think that's also noise, but I haven't run the tests again yet.

Attachment

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Robert Haas

Date:

18 August 2016, 20:05:52

On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> The attached patch tries to maintain the initial status of B-Tree
> indexes, which are created with equal-key runs in physical order,
> during the whole life of the B-Tree, and make key-tid pairs
> efficiently searchable in the process.

I have two pretty important questions that doesn't seem to be
addressed in your email.

1. How does it do that?

2. What's the benefit?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 20:42:44

On Thu, Aug 18, 2016 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The attached patch tries to maintain the initial status of B-Tree
>> indexes, which are created with equal-key runs in physical order,
>> during the whole life of the B-Tree, and make key-tid pairs
>> efficiently searchable in the process.
>
> I have two pretty important questions that doesn't seem to be
> addressed in your email.

I believe I addressed that in the README, but if I wasn't clear, let
me know and I'll expand there too.

Summarizing here:

> 1. How does it do that?

It adds the block number and the offset number of the index tuple's
t_tid (that is, the pointer into the heap) as a final (implicit) sort
key, as is done already in tuplesort. It just keeps doing that while
comparing keys in the B-Tree when searching the leaf page for
insertion, so the insertion point now points to the exact point where
the insertion has to happen to maintain that condition, and not just
to the first valid position within the same-key run.

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

This representation is chosen to minimize code changes, such that
(itup+1) is usable as before, and also because it's reasonably
compact. But it *is* necessary to check whether it's a leaf or
non-leaf tuple to choose whether to use (itup+1) or just itup, which
is why the patch requires so many changes in so many places.

> 2. What's the benefit?

The idea came up in the discussion about WARM updates.

Basically, in various situations it is convenient to be able to find
the index tuples that point to a specific heap tuple. Without this,
doing so isn't efficient - the only way to find such index tuples is
to find the leftmost index tuple that has the same key, and walk the
index right links until the right tid is found. For non-unique
indexes, but also for unique indexes with heavy bloat, this could
involve a large amount of I/O, so it isn't efficient in several
contexts. Think vacuum and WARM updates mostly (like HOT updates, but
where only some indexes don't need updating, and some do).

Having the ability to find a heap tuple by key-ctid pair is thus
beneficial because it allows efficient implementations for those
operations. It may benefit other parts of the code, but those are the
ones that come to mind.

It also makes index scans of the form

SELECT * FROM t WHERE some_col = some_const;

Scan the heap in sequential order, even if some_col has low
cardinality (ie: the query returns lots of tuples), which is a nice
performance side effect.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Kevin Grittner

Date:

18 August 2016, 21:04:59

On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

> It also makes index scans of the form
>
> SELECT * FROM t WHERE some_col = some_const;
>
> Scan the heap in sequential order, even if some_col has low
> cardinality (ie: the query returns lots of tuples), which is a nice
> performance side effect.

Speaking of performance side effects, does this avoid O(N^2)
performance on index tuple insertion with duplicate values, for all
insertion orderings?  For example, does it descend directly to the
right leaf page for the insert rather than starting at the front of
the block of duplicate values and scanning to the right for a
block with space, with a random chance to split a full block on
each page it moves through?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 21:10:29

On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>
>> It also makes index scans of the form
>>
>> SELECT * FROM t WHERE some_col = some_const;
>>
>> Scan the heap in sequential order, even if some_col has low
>> cardinality (ie: the query returns lots of tuples), which is a nice
>> performance side effect.
>
> Speaking of performance side effects, does this avoid O(N^2)
> performance on index tuple insertion with duplicate values, for all
> insertion orderings?  For example, does it descend directly to the
> right leaf page for the insert rather than starting at the front of
> the block of duplicate values and scanning to the right for a
> block with space, with a random chance to split a full block on
> each page it moves through?

Yes, but only on non-unique indexes.

Unique indexes still need to scan all duplicates to check visibility
and will become O(N^2) there.

The code with the random chance to split is still there, but it should
never have a chance to run because the comparison against the heap
tuple pointer would stop it very quickly (before walking a single
offset I believe).

I thought about cleaning that up, but to make review easier I thought
I'd leave it there for now. Removing it would add a lot of diff noise.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

18 August 2016, 21:15:47

On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> In fact, that's why non-leaf index tuples need a different format,
> because while leaf index tuples contain the heap pointer already,
> non-leaf ones contain only the downlink, not the pointer into the
> heap. To be able to do comparisons and pick the right downlink, the
> original heap pointer in the leaf index tuple is copied into the
> downlink index tuple when splitting pages into an additional
> IndexTupleData header that is prepended only to non-leaf index tuples.

I think that this is a bad idea. We need to implement suffix
truncation of internal page index tuples at some point, to make them
contain less information from the original leaf page index tuple.
That's an important optimization, because it increases fan-in. This
seems like a move in the opposite direction.

ISTM that the way to address this problem is with a duplicate list
and/or prefix compression in leaf pages.

-- 
Peter Geoghegan

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Alvaro Herrera

Date:

18 August 2016, 21:24:03

Claudio Freire wrote:

> Unique indexes still need to scan all duplicates to check visibility
> and will become O(N^2) there.

That scenario doesn't matter, because on unique indexes there aren't
many duplicate values anyway -- only one can be a live tuple.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 21:26:36

On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> In fact, that's why non-leaf index tuples need a different format,
>> because while leaf index tuples contain the heap pointer already,
>> non-leaf ones contain only the downlink, not the pointer into the
>> heap. To be able to do comparisons and pick the right downlink, the
>> original heap pointer in the leaf index tuple is copied into the
>> downlink index tuple when splitting pages into an additional
>> IndexTupleData header that is prepended only to non-leaf index tuples.
>
> I think that this is a bad idea. We need to implement suffix
> truncation of internal page index tuples at some point, to make them
> contain less information from the original leaf page index tuple.
> That's an important optimization, because it increases fan-in. This
> seems like a move in the opposite direction.

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

> ISTM that the way to address this problem is with a duplicate list
> and/or prefix compression in leaf pages.

Prefix compression is another one I will be looking into eventually,
but last time I tried it was far more invasive so I abandoned until I
could find a less invasive way to do it.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Tom Lane

Date:

18 August 2016, 21:27:50

Claudio Freire <klaussfreire@gmail.com> writes:
> On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> Speaking of performance side effects, does this avoid O(N^2)
>> performance on index tuple insertion with duplicate values, for all
>> insertion orderings?  For example, does it descend directly to the
>> right leaf page for the insert rather than starting at the front of
>> the block of duplicate values and scanning to the right for a
>> block with space, with a random chance to split a full block on
>> each page it moves through?

> Yes, but only on non-unique indexes.

How's that work if the existing entries aren't in TID order (which they
will not be, in a pre-existing index)?  Or are you assuming you can blow
off on-disk compatibility of indexes?
        regards, tom lane

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 21:31:43

On Thu, Aug 18, 2016 at 6:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>>> Speaking of performance side effects, does this avoid O(N^2)
>>> performance on index tuple insertion with duplicate values, for all
>>> insertion orderings?  For example, does it descend directly to the
>>> right leaf page for the insert rather than starting at the front of
>>> the block of duplicate values and scanning to the right for a
>>> block with space, with a random chance to split a full block on
>>> each page it moves through?
>
>> Yes, but only on non-unique indexes.
>
> How's that work if the existing entries aren't in TID order (which they
> will not be, in a pre-existing index)?  Or are you assuming you can blow
> off on-disk compatibility of indexes?
>
>                         regards, tom lane

This patch does blow off on-disk compatibility, but the plan is to
re-add it later on.

A flag in the meta page would indicate whether it's a sorted index or
not, and the only way to get a sorted index would be with a reindex.
The code would have to support both for a while until whatever point
we'd figure we can drop support for old format indexes.

Since this is a sort order change I see no alternative, either the
whole index is sorted by TID or not.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

18 August 2016, 21:38:17

On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I see that. I could try to measure average depth to measure the impact
> this had on fan-in.
>
> While it should cut it in half for narrow indexes, half of very high
> is still high. Wide indexes, which are are the ones that would suffer
> from poor fan-in, would feel this far less, since the overhead is
> constant.

You'll also have to consider the impact on the choice of split point,
FWIW. There is currently leeway about what page an index tuple can
land on, if and when it happens to be equal to the high-key. I'm not
sure how important that is, but it's a consideration.

> Even if it does have an impact, I don't see an alternative, without
> also implementing suffix truncation. Perhaps I could try to avoid
> adding the leaf tid header if it isn't necessary, though I would have
> to use up the last available flag bit in t_info for that.

To be clear, I'm particularly worried about painting ourselves into a
corner, suffix-truncation-wise. It's a very common optimization, that
we should really have.

>> ISTM that the way to address this problem is with a duplicate list
>> and/or prefix compression in leaf pages.
>
> Prefix compression is another one I will be looking into eventually,
> but last time I tried it was far more invasive so I abandoned until I
> could find a less invasive way to do it.

Unfortunately, sometimes it is necessary to be very ambitious in order
to solve a problem. The understandable and usually well-reasoned
approach of making progress as incremental as possible occasionally
works against contributors. It's worth considering if this is such a
case.

-- 
Peter Geoghegan

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

18 August 2016, 22:04:15

On Thu, Aug 18, 2016 at 6:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I see that. I could try to measure average depth to measure the impact
>> this had on fan-in.
>>
>> While it should cut it in half for narrow indexes, half of very high
>> is still high. Wide indexes, which are are the ones that would suffer
>> from poor fan-in, would feel this far less, since the overhead is
>> constant.
>
> You'll also have to consider the impact on the choice of split point,
> FWIW. There is currently leeway about what page an index tuple can
> land on, if and when it happens to be equal to the high-key. I'm not
> sure how important that is, but it's a consideration.

When I read the code, it seemed generic enough, using ItemIdGetLength
(which works on non-leaf index tuples correctly, reporting the right
size), so it should still work.

But the choice of split point may make a difference for future
insertions, so I'll look into that.

>> Even if it does have an impact, I don't see an alternative, without
>> also implementing suffix truncation. Perhaps I could try to avoid
>> adding the leaf tid header if it isn't necessary, though I would have
>> to use up the last available flag bit in t_info for that.
>
> To be clear, I'm particularly worried about painting ourselves into a
> corner, suffix-truncation-wise. It's a very common optimization, that
> we should really have.

Well, page version numbers could be used to switch between
semantically similar page formats later on. So if another format is
necessary (say, prefix compression, suffix truncation, etc), it can be
changed on-the-fly on existing indexes by checking page version
numbers and doing the proper switch.

So we won't be painting ourselves into a corner, but picking the wrong
format may indeed be a headache.

I can go the way of the flag in t_info if that's preferrable. Both
would work. It's just that it's the last flag available, and that
would be painting ourselves into a corner regarding flags. To avoid
that, the flag itself could be "has extra data" (instead of has leaf
tid), and add a whole set of flags in the extra data. That could also
help for preffix compression and suffix truncation.

>>> ISTM that the way to address this problem is with a duplicate list
>>> and/or prefix compression in leaf pages.
>>
>> Prefix compression is another one I will be looking into eventually,
>> but last time I tried it was far more invasive so I abandoned until I
>> could find a less invasive way to do it.
>
> Unfortunately, sometimes it is necessary to be very ambitious in order
> to solve a problem. The understandable and usually well-reasoned
> approach of making progress as incremental as possible occasionally
> works against contributors. It's worth considering if this is such a
> case.

I'd agree if this regressed performance considerably without the other
improvements, so I guess this will depend on the fan-in measurements.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Kevin Grittner

Date:

19 August 2016, 14:00:27

On Thu, Aug 18, 2016 at 5:04 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

> But the choice of split point may make a difference for future
> insertions, so I'll look into that.

A database product I worked on a long time ago had an interesting
optimization for indexes that had multiple insertion points, each
of which was increasing.  (Picture, for example 100 case types,
with case numbers inserted in sequence within each case type -- so
the first three felony cases would be CF000001, CF000002, and
CF000003; while the first three small claims cases would be
SC000001, SC000002, and SC000003.)  That's not a terribly rare
usage pattern, in my experience.  An index on such data is most
efficient if you split at the point of the index tuple being
inserted -- making it the last tuple on the left-hand page.  I
don't know whether we might want to consider making that an option
somehow -- it just came to mind because of this discussion.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

19 August 2016, 23:00:20

On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> In fact, that's why non-leaf index tuples need a different format,
>>> because while leaf index tuples contain the heap pointer already,
>>> non-leaf ones contain only the downlink, not the pointer into the
>>> heap. To be able to do comparisons and pick the right downlink, the
>>> original heap pointer in the leaf index tuple is copied into the
>>> downlink index tuple when splitting pages into an additional
>>> IndexTupleData header that is prepended only to non-leaf index tuples.
>>
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> I see that. I could try to measure average depth to measure the impact
> this had on fan-in.
>
> While it should cut it in half for narrow indexes, half of very high
> is still high. Wide indexes, which are are the ones that would suffer
> from poor fan-in, would feel this far less, since the overhead is
> constant.
>
> Even if it does have an impact, I don't see an alternative, without
> also implementing suffix truncation. Perhaps I could try to avoid
> adding the leaf tid header if it isn't necessary, though I would have
> to use up the last available flag bit in t_info for that.

So, I did some tests.

TL;DR version is, as expected, there is a big impact on narrow
indexes, but I don't believe it's something to be worried about.

**Begin LONG version** (skip to end long version if not interested)

Regardless of the fanout (or fanin, if you look at it that way), what
matters is tree depth, so I used pageinspect to check how depth of
indexes changed after patching.

This is all on pristine recently built indexes. I will try to measure
the same on bloated indexes later on, but the tests are painfully
slow.

I used the following query to inspect all user indexes, since
pageinspect failed to work for toast indexes for some reason (it
probably needed qualified relnames or something like that). Anyway
user indexes should be enough.

select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest
from pg_class, bt_metap(relname)
where relkind = 'i' and relnamespace = 2200 and relam = 403
group by level
order by level desc;

I tested it on both patched and unpached versions of both my test
database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale
10000 (146GB).

I believe pgbench is a worst case example, because it only has very
narrow PK indexes, which are the ones that will suffer the most from
this patch.

The numbers:

mat, patched

level;count;biggest
3;18;"1369 MB"
2;60;"322 MB"
1;26;"808 kB"
0;50;"16 kB"

mat, unpatched

level;count;biggest
3;15;"1367 MB"
2;63;"334 MB"
1;26;"800 kB"
0;49;"16 kB"

pgbench, scale 1000, patched

level;count;biggest
3;1;"2145 MB"
1;2;"456 kB"

pgbench, scale 1000, unpatched

level;count;biggest
3;1;"2142 MB"
1;2;"456 kB"

pgbench, scale 10000, patched

level;count;biggest
3;1;"21 GB"
2;1;"2224 kB"
1;1;"240 kB"

pgbench, scale 10000, unpatched

level;count;biggest
3;1;"21 GB"
1;2;"2208 kB"

So, clearly some indexes are pushed to the next level, but it isn't a
widespread effect - it looks more like the threshold index size for
needing a root split is slightly pushed downwards.

It totally agrees with the math. The index tuple header is 8 bytes,
and the datums need to be maxaligned so they will also be 8 bytes at
least. That means the B in B-tree gets cut by 50% (2/3) at the worst
but only for inner tuples, and so you get that

if a * log_(B)(N/B) = log_(B/(3/2))(N/B)

then for big enogh N

a < (3/2)*log_(B)(N) - 2

Which means the depth of huge B-trees is increased by 50%, but the
effects only begins to be seen at level 2, which is what we have here,
and level 2 for such a narrow index seems to be about 2GB. So we're
talking about very big indexes already. If level 2 can hold 2GB of
index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have
yet to see (even in very big databases) a 1TB index on a PK.

If anyone has such an index, yes, this patch will hurt performance by
25% (4 page reads instead of 3).

Bad, but since it only affects very extreme cases, doesn't seem so bad.

**End LONG version**

So, that said, I started mocking up a version of the patch that used a
"has aux data" flag to signal another set of flags from where variable
size attributes can be read, which would enable implementation of
suffix truncation and prefix compression in the future with minimal
data format changes, and was surprised to find that it seems not only
more compact, but cleaner. So I will probably go that way, and do
that.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Amit Kapila

Date:

20 August 2016, 04:05:45

On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>
> A couple of points make me uneasy about this patch, yet I can think of
> no better alternative, so I seek feedback:
>
>  - introducing a different format for inner index tuples makes for an
> invasive patch and quite difficult-to-reason code (it's easy to forget
> whether a page is leaf or inner and that now causes assertion failures
> or segfaults)
>  - the ascent-descent to check for uniqueness when there are large
> dead tuple runs could have locking subtleties that escape me. Perhaps
> there's better ways to solve this.

I have tried to study this part of your patch and it seems somewhat
non-trivial and risky part of proposal.

+ } else {
+ /*
+ * We found the actual first item on another block, so we have to perform
+ * a two-step search - first half, until our write-locked buffer, then another
+ * starting from our write-locked buffer.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
+ true, stack, BT_WRITE, NULL);
+
+ first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
+
+ xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
first_offset, itup_scankey,
+ checkUnique, &is_unique, &speculativeToken);
+
+ _bt_relbuf(rel, nbuf);
+ }

The idea for uniqueness check is that we hold the write lock on
buffer/page on which we are trying to operate (we do take read locks
on the consecutive pages during this check).  Here, in your changed
proposal, you have two buffers (one to which the search led you and
one buffer previous in the chain) and before checking uniqueness, on
one of the buffers you seem to have write lock and on other you have
read lock.  This seems to change the way uniqueness check works till
now, can you explain how this works (can we safely assume that all
searches for uniqueness check will lead to the same buffer first).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways?  I think either
way there are pros and cons.

Also, in the thread below you are talking about using the last bit in
t_info, I want to bring it to your notice that there is a patch of
mine [1] in which I have used it to avoid changing on-disk
compatibility of hash indexes.  I am not saying that we should not
plan it for some other use, but just to keep in mind that there are
other proposals to use it.  We can decide what is best way to proceed,
if we are aware of all the proposals that want to use it.

[1] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

20 August 2016, 04:53:36

On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>
>> A couple of points make me uneasy about this patch, yet I can think of
>> no better alternative, so I seek feedback:
>>
>>  - introducing a different format for inner index tuples makes for an
>> invasive patch and quite difficult-to-reason code (it's easy to forget
>> whether a page is leaf or inner and that now causes assertion failures
>> or segfaults)
>>  - the ascent-descent to check for uniqueness when there are large
>> dead tuple runs could have locking subtleties that escape me. Perhaps
>> there's better ways to solve this.
>
> I have tried to study this part of your patch and it seems somewhat
> non-trivial and risky part of proposal.
>
> + } else {
> + /*
> + * We found the actual first item on another block, so we have to perform
> + * a two-step search - first half, until our write-locked buffer, then another
> + * starting from our write-locked buffer.
> + */
> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> + LockBuffer(buf, BT_WRITE);
> +
> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
> + true, stack, BT_WRITE, NULL);
> +
> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
> +
> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
> first_offset, itup_scankey,
> + checkUnique, &is_unique, &speculativeToken);
> +
> + _bt_relbuf(rel, nbuf);
> + }
>
> The idea for uniqueness check is that we hold the write lock on
> buffer/page on which we are trying to operate (we do take read locks
> on the consecutive pages during this check).  Here, in your changed
> proposal, you have two buffers (one to which the search led you and
> one buffer previous in the chain) and before checking uniqueness, on
> one of the buffers you seem to have write lock and on other you have
> read lock.  This seems to change the way uniqueness check works till
> now, can you explain how this works (can we safely assume that all
> searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

> With this new mechanism, do we have two type of search interfaces
> where one would work for keys (as now) and other for key-ctid or it
> will be just a single interface which works both ways?  I think either
> way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both, but some index ams can't implement the key-ctid operation (hash
indexes), so they will have to be separate interfaces to be able to
properly represent the separate capabilities (besides, other
implementations may well use completely different paths for the two
operations).

> Also, in the thread below you are talking about using the last bit in
> t_info, I want to bring it to your notice that there is a patch of
> mine [1] in which I have used it to avoid changing on-disk
> compatibility of hash indexes.  I am not saying that we should not
> plan it for some other use, but just to keep in mind that there are
> other proposals to use it.  We can decide what is best way to proceed,
> if we are aware of all the proposals that want to use it.
>
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

I wasn't aware of that particular thread, but there's another (the
WARM one) where there's another proposed use of that bit.

IMHO, any use of the bit that doesn't leave room for further
extensions of the bit space is a bad idea, because it closes the door
on (easy) further flags being added. And the fact that there's already
several proposals to use that bit for different purposes only
reinforces that idea.

The idea I'm working on allows further extension, and, although it
uses more space, I believe it's a better way. But we'll discuss that
when I have the implementation I guess.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

20 August 2016, 05:26:35

On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> All uniqueness checks will try to read-lock nbuf
> unless the uniqueness check for that insertion is done.


That should read "all uniqueness checks will try to read-lock the buf
unless the uniqueness check for that insertion is done."

(nbuf -> buf)

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Amit Kapila

Date:

20 August 2016, 07:27:39

On Sat, Aug 20, 2016 at 10:23 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>
>>> A couple of points make me uneasy about this patch, yet I can think of
>>> no better alternative, so I seek feedback:
>>>
>>>  - introducing a different format for inner index tuples makes for an
>>> invasive patch and quite difficult-to-reason code (it's easy to forget
>>> whether a page is leaf or inner and that now causes assertion failures
>>> or segfaults)
>>>  - the ascent-descent to check for uniqueness when there are large
>>> dead tuple runs could have locking subtleties that escape me. Perhaps
>>> there's better ways to solve this.
>>
>> I have tried to study this part of your patch and it seems somewhat
>> non-trivial and risky part of proposal.
>>
>> + } else {
>> + /*
>> + * We found the actual first item on another block, so we have to perform
>> + * a two-step search - first half, until our write-locked buffer, then another
>> + * starting from our write-locked buffer.
>> + */
>> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>> + LockBuffer(buf, BT_WRITE);
>> +
>> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
>> + true, stack, BT_WRITE, NULL);
>> +
>> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
>> +
>> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
>> first_offset, itup_scankey,
>> + checkUnique, &is_unique, &speculativeToken);
>> +
>> + _bt_relbuf(rel, nbuf);
>> + }
>>
>> The idea for uniqueness check is that we hold the write lock on
>> buffer/page on which we are trying to operate (we do take read locks
>> on the consecutive pages during this check).  Here, in your changed
>> proposal, you have two buffers (one to which the search led you and
>> one buffer previous in the chain) and before checking uniqueness, on
>> one of the buffers you seem to have write lock and on other you have
>> read lock.  This seems to change the way uniqueness check works till
>> now, can you explain how this works (can we safely assume that all
>> searches for uniqueness check will lead to the same buffer first).
>
> All uniqueness checks will find the same "nbuf" (read-locked), which
> is the first one that matches the key.
>
> They could have different "buf" buffers (the write-locked) one. But
> since read locks cannot be held while a write-lock is held, I believe
> it doesn't matter. All uniqueness checks will try to read-lock nbuf
> unless the uniqueness check for that insertion is done. When they do,
> they will block until the other insertion is finished, and when that
> happens, either the insert happened, or it returned an xid to wait for
> and retry. If it succeeded, the check attempting the read-lock will
> see the inserted tuple and return the xid of the earlier transaction
> and wait on it. If it failed, no insert happened because there's a
> visible tuple there, and the read-lock will succeed, find the visible
> tuple, and the insert will fail too.
>
> In essence, it is still the case, I believe, that only one insert can
> succeed at a time, all the others will retry. It only happens that
> with the patch, the contention will be between a read lock and a write
> lock, and not two write locks, and there's a little less contention
> (some more work can be done in parallel).
>
>> With this new mechanism, do we have two type of search interfaces
>> where one would work for keys (as now) and other for key-ctid or it
>> will be just a single interface which works both ways?  I think either
>> way there are pros and cons.
>
> The patch doesn't add another interface, but the idea is to add it.
> The implementation will probably fall on a common function that can do
> both
>

That makes sense, but this means there is a chance that the searches
could lead to different buffers in case of uniqueness checks (the
search with key-ctid could lead to a different buffer than the search
with just key).  I am not clear do we have requirement for doing this
uniqueness check for key-ctid search API, because as I understand you
want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
tuples, so is this required for WARM tuples?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

20 August 2016, 16:28:27

On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> All uniqueness checks will find the same "nbuf" (read-locked), which
>> is the first one that matches the key.
>>
>> They could have different "buf" buffers (the write-locked) one. But
>> since read locks cannot be held while a write-lock is held, I believe
>> it doesn't matter. All uniqueness checks will try to read-lock nbuf
>> unless the uniqueness check for that insertion is done. When they do,
>> they will block until the other insertion is finished, and when that
>> happens, either the insert happened, or it returned an xid to wait for
>> and retry. If it succeeded, the check attempting the read-lock will
>> see the inserted tuple and return the xid of the earlier transaction
>> and wait on it. If it failed, no insert happened because there's a
>> visible tuple there, and the read-lock will succeed, find the visible
>> tuple, and the insert will fail too.
>>
>> In essence, it is still the case, I believe, that only one insert can
>> succeed at a time, all the others will retry. It only happens that
>> with the patch, the contention will be between a read lock and a write
>> lock, and not two write locks, and there's a little less contention
>> (some more work can be done in parallel).
>>
>>> With this new mechanism, do we have two type of search interfaces
>>> where one would work for keys (as now) and other for key-ctid or it
>>> will be just a single interface which works both ways?  I think either
>>> way there are pros and cons.
>>
>> The patch doesn't add another interface, but the idea is to add it.
>> The implementation will probably fall on a common function that can do
>> both
>>
>
> That makes sense, but this means there is a chance that the searches
> could lead to different buffers in case of uniqueness checks (the
> search with key-ctid could lead to a different buffer than the search
> with just key).  I am not clear do we have requirement for doing this
> uniqueness check for key-ctid search API, because as I understand you
> want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
> tuples, so is this required for WARM tuples?

Well, I'm not realy sure what exactly would need to be done when doing
the WARM conditional insert in the case of unique indexes, and that's
a strong motivation to not add the interface for inserts just yet.
Vacuum will only need to delete, and in the case of deletes the
operation would be quite straight forward.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Amit Kapila

Date:

21 August 2016, 04:44:09

On Sat, Aug 20, 2016 at 9:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> That makes sense, but this means there is a chance that the searches
>> could lead to different buffers in case of uniqueness checks (the
>> search with key-ctid could lead to a different buffer than the search
>> with just key).  I am not clear do we have requirement for doing this
>> uniqueness check for key-ctid search API, because as I understand you
>> want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
>> tuples, so is this required for WARM tuples?
>
> Well, I'm not realy sure what exactly would need to be done when doing
> the WARM conditional insert in the case of unique indexes, and that's
> a strong motivation to not add the interface for inserts just yet.
> Vacuum will only need to delete, and in the case of deletes the
> operation would be quite straight forward.
>

Right. I think if you initially modify the interface only for deletes
that will reduce the complexity of patch as well.  We can later
enhance it to handle WARM tuple case.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

19 November 2016, 01:05:29

On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I think that this is a bad idea. We need to implement suffix
> truncation of internal page index tuples at some point, to make them
> contain less information from the original leaf page index tuple.
> That's an important optimization, because it increases fan-in. This
> seems like a move in the opposite direction.

Maybe I was too hasty, here. Can you rebase this patch, Claudio?

It's possible that this idea could have further non-obvious benefits.
I see some potential for this to help with nbtree page deletion, item
recycling, etc. For example, maybe VACUUM can manage to do something
much closer to a regular index scan when that seems to make sense --
Andres (CC'd) has talked about this before. And, maybe we get a
benefit from localizing where the LP_DEAD bit can be set for the
kill_prior_tuple stuff to work in UPDATE-heavy workloads. Naturally,
there'd often be some kind of locality in terms of older row versions
belonging earlier in the heap, versions which can be concentrated onto
one leaf page with this patch. As newer updates make it look like the
leaf page needs to be split, there is a concentration of LP_DEAD-set
line items to reclaim space from, letting some UPDATEs (index tuple
inserters) defer the split. There is a lot less value in the LP_DEAD
stuff if reclaimable line items are sprayed all over the place, which
seems quite possible to me on current versions.

I also think that this might help with bitmap index scans.

-- 
Peter Geoghegan

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

19 November 2016, 01:27:16

On Fri, Nov 18, 2016 at 10:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> Maybe I was too hasty, here. Can you rebase this patch, Claudio?

I've been changing the on-disk format considerably, to one that makes
more sense.

I haven't posted it because it still doesn't have suffix (and prefix)
truncation, but it should be compatible with it (ie: it could be
implemented as an optimization that doesn't change the on-disk
format).

However, I haven't tested the current state of the patch thoroughly.

If a WIP is fine, I can post the what I have, rebased.

> It's possible that this idea could have further non-obvious benefits.
> I see some potential for this to help with nbtree page deletion, item
> recycling, etc. For example, maybe VACUUM can manage to do something
> much closer to a regular index scan when that seems to make sense --

That was on my mind too

> I also think that this might help with bitmap index scans.

How so?

AFAIK it helps regular scans on low-cardinality indexes more than
bitmap index scans. With low cardinality indexes, the resulting heap
access pattern will be an unordered series of sequential range scans,
which is better than the fully random scan the current layout causes.
Bitmap index scans, however, deny that benefit.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

19 November 2016, 02:10:02

On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I've been changing the on-disk format considerably, to one that makes
> more sense.

I can see how that would be necessary. I'm going to suggest some more
things that need a new btree version number now, too.

I realized today, quite suddenly, why I disliked your original patch:
it didn't go far enough with embracing the idea of
heap-item-pointer-as-key. In other words, I didn't like that you
didn't change anything in the internal pages. Maybe you should put
heap TIDs someplace in new internal pages, so that even there we treat
them as part of the key. This may significantly benefit UPDATE-heavy
workloads where HOT doesn't work out so well. Particularly for
non-unique indexes, we currently have to wade through a lot of bloat
from the leaf level, rather than jumping straight to the correct leaf
page when updating or inserting.

You might not want to do the same with unique indexes, where we must
initially buffer lock "the first leaf page that a duplicate could be
on" while we potentially look in one or more additional sibling pages
(but bloat won't be so bad there, perhaps). Or, you might want to make
sure that B-Tree tuple insertions still find that "first page" to
check, despite still generally treating heap item pointer as part of
the key proper (in unique indexes, it can still behave like NULL, and
not participate in uniqueness checking, while still essentially being
part of the key/sort order).

Of course, it isn't great to add stuff to the internal pages, but if
you're also implementing suffix truncation, there is probably no harm
at all. You're only going to affect cases that will also greatly
benefit from having that extra information, to guide insertions of new
values to the right place more efficiently (by inserters, I mostly
mean insertions from UPDATEs).

It also occurs to me that our performance for updates in the event of
many NULL values in indexes is probably really bad, and could be
helped by this. You'll want to test all of this with a workload that
isn't very sympathetic to HOT, of course -- most of these benefits are
seen when HOT doesn't help.

> However, I haven't tested the current state of the patch thoroughly.
>
> If a WIP is fine, I can post the what I have, rebased.

I'm mostly curious about the effects on bloat. I now feel like you
haven't sufficiently examined the potential benefits there, since you
never made heap item pointer a first class part of the key space.
Maybe you'd like to look into that yourself first?

>> I also think that this might help with bitmap index scans.
>
> How so?

I was thinking about teaching nbtree to start the scan at the level
above the leaf level. If the heap TID was part of the key proper, one
can imagine it being much cheaper to build a bitmap for a moderate
cardinality value (e.g., NULL values needed for "WHERE foo IS NULL"
index scans). The exact details would need to be worked out, but one
can imagine building a bitmap that is much larger than necessary one
level up from the leaf, then lazily unsetting bits as we descend the
B-Tree to the leaf level and find pages whose bitmap bits can be
unset. We might balance the avoidance of I/O during the scan against
how much we might expect to save in a subsequent bitmap heap scan, and
so on. This might be based on a selectivity estimate.

That's all fairly hand-wavy, certainly, but I see significant
potential along these lines.

-- 
Peter Geoghegan

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

21 November 2016, 17:42:18

On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I've been changing the on-disk format considerably, to one that makes
>> more sense.
>
> I can see how that would be necessary. I'm going to suggest some more
> things that need a new btree version number now, too.
>
> I realized today, quite suddenly, why I disliked your original patch:
> it didn't go far enough with embracing the idea of
> heap-item-pointer-as-key. In other words, I didn't like that you
> didn't change anything in the internal pages.

But it did. In fact it only changed internal pages.

> Maybe you should put
> heap TIDs someplace in new internal pages, so that even there we treat
> them as part of the key.

That is indeed what's happening (both in the old and new version).

The new version also opens up to the possibility of omitting the heap
TID in inner pages, if they're redundant (ie: if two consecutive keys
are different already, the heap TID part of the key is redundant,
since it's not necessary to make tree descent decisions).

> This may significantly benefit UPDATE-heavy
> workloads where HOT doesn't work out so well. Particularly for
> non-unique indexes, we currently have to wade through a lot of bloat
> from the leaf level, rather than jumping straight to the correct leaf
> page when updating or inserting.

That is improved in the patch as well. The lookup for an insertion
point for a heavily bloated (unique or not) index is done efficiently,
instead of the O(N^2) way it used before.

> You might not want to do the same with unique indexes, where we must
> initially buffer lock "the first leaf page that a duplicate could be
> on" while we potentially look in one or more additional sibling pages
> (but bloat won't be so bad there, perhaps).

It's done even for unique indexes. Locking in that case becomes
complex, true, but I believe it's not an insurmountable problem.

> Or, you might want to make
> sure that B-Tree tuple insertions still find that "first page" to
> check, despite still generally treating heap item pointer as part of
> the key proper (in unique indexes, it can still behave like NULL, and
> not participate in uniqueness checking, while still essentially being
> part of the key/sort order).

It behaves like that (even though it's not coded like that)

> It also occurs to me that our performance for updates in the event of
> many NULL values in indexes is probably really bad, and could be
> helped by this. You'll want to test all of this with a workload that
> isn't very sympathetic to HOT, of course -- most of these benefits are
> seen when HOT doesn't help.

I haven't really tested with an overabundance of NULLs, I'll add that
to the tests. I have tested low-cardinality indexes though, but only
for correctness.

>> However, I haven't tested the current state of the patch thoroughly.
>>
>> If a WIP is fine, I can post the what I have, rebased.
>
> I'm mostly curious about the effects on bloat. I now feel like you
> haven't sufficiently examined the potential benefits there, since you
> never made heap item pointer a first class part of the key space.
> Maybe you'd like to look into that yourself first?

What do you mean with "first class part"?

It's not included in the scankey for a variety of reasons, not the
least of which is not breaking the interface for current uses, and
because the tid type is quite limited in its capabilities ATM. Also,
the heap TID is usually there on most AM calls that care about it (ie:
insertions have it, of course), so adding it to the scankey also
seemed redundant.

If not adding to the scankey, what do you mean by "first class" then?

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

21 November 2016, 20:42:47

On Mon, Nov 21, 2016 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I realized today, quite suddenly, why I disliked your original patch:
>> it didn't go far enough with embracing the idea of
>> heap-item-pointer-as-key. In other words, I didn't like that you
>> didn't change anything in the internal pages.
>
> But it did. In fact it only changed internal pages.

Oh, okay.

>> Maybe you should put
>> heap TIDs someplace in new internal pages, so that even there we treat
>> them as part of the key.
>
> That is indeed what's happening (both in the old and new version).

Good.

> The new version also opens up to the possibility of omitting the heap
> TID in inner pages, if they're redundant (ie: if two consecutive keys
> are different already, the heap TID part of the key is redundant,
> since it's not necessary to make tree descent decisions).

Makes sense; this is just another aspect of suffix truncation.

>> This may significantly benefit UPDATE-heavy
>> workloads where HOT doesn't work out so well. Particularly for
>> non-unique indexes, we currently have to wade through a lot of bloat
>> from the leaf level, rather than jumping straight to the correct leaf
>> page when updating or inserting.
>
> That is improved in the patch as well. The lookup for an insertion
> point for a heavily bloated (unique or not) index is done efficiently,
> instead of the O(N^2) way it used before.

Cool.

> It's done even for unique indexes. Locking in that case becomes
> complex, true, but I believe it's not an insurmountable problem.

I actually don't think that that would be all that complicated. It's
just one case where you have to mostly match the existing behavior.

>> Or, you might want to make
>> sure that B-Tree tuple insertions still find that "first page" to
>> check, despite still generally treating heap item pointer as part of
>> the key proper (in unique indexes, it can still behave like NULL, and
>> not participate in uniqueness checking, while still essentially being
>> part of the key/sort order).
>
> It behaves like that (even though it's not coded like that)

Why not? What do you mean?

What I'm interested in evaluating is an implementation where an index
on (foo) has a sort order/key space that is precisely the same as an
index on (foo, ctid), with zero exceptions.

>> It also occurs to me that our performance for updates in the event of
>> many NULL values in indexes is probably really bad, and could be
>> helped by this. You'll want to test all of this with a workload that
>> isn't very sympathetic to HOT, of course -- most of these benefits are
>> seen when HOT doesn't help.
>
> I haven't really tested with an overabundance of NULLs, I'll add that
> to the tests. I have tested low-cardinality indexes though, but only
> for correctness.

I think we'll have to model data distribution to evaluate the idea
well. After all, if there is a uniform distribution of values anyway,
you're going to end up with many dirty leaf pages. Whereas, in the
more realistic case where updates are localized, we can hope to better
amortize the cost of inserting index tuples, and recycling index
tuples.

> What do you mean with "first class part"?
>
> It's not included in the scankey for a variety of reasons, not the
> least of which is not breaking the interface for current uses, and
> because the tid type is quite limited in its capabilities ATM. Also,
> the heap TID is usually there on most AM calls that care about it (ie:
> insertions have it, of course), so adding it to the scankey also
> seemed redundant.

I mean that insertions into indexes that are low cardinality should be
largely guided by TID comparisons.

> If not adding to the scankey, what do you mean by "first class" then?

Just what I said about the sort order of an index on "(foo)" now
precisely matching what we'd get for "(foo, ctid)". There are a couple
of tricky issues with that that you'd have to look out for, like
making sure that the high key continues to hold a real TID, which at a
glance looks like something that just happens already. I'm not sure
that we preserve the item pointer TID in all cases, since the current
assumption is that a high key's TID is just filler -- maybe we can
lose that at some point.

You should use amcheck to specifically verify that that happens
reliably in all cases. Presumably, its use of an insertion scankey
would automatically see the use of TID as a tie-breaker with patched
Postgres amcheck verification, and so amcheck will work for this
purpose unmodified.

-- 
Peter Geoghegan

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

22 November 2016, 01:15:49

On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Or, you might want to make
>>> sure that B-Tree tuple insertions still find that "first page" to
>>> check, despite still generally treating heap item pointer as part of
>>> the key proper (in unique indexes, it can still behave like NULL, and
>>> not participate in uniqueness checking, while still essentially being
>>> part of the key/sort order).
>>
>> It behaves like that (even though it's not coded like that)
>
> Why not? What do you mean?
>
> What I'm interested in evaluating is an implementation where an index
> on (foo) has a sort order/key space that is precisely the same as an
> index on (foo, ctid), with zero exceptions.

Well, the patch does the same sorting, but without explicitly adding
the ctid to the scankey.

When inserting into a unique index, the lookup for the insertion point
proceeds as it would if the key was (foo, ctid), finding a page in the
middle of the range that contain item pointers for foo.

When that happens on a regular index, the insertion is done at that
point and nothing else needs to be done. But when it happens on a
unique index, the code first checks to see if the first item pointer
for foo is on the same page (allegedly a frequent case). If so, the
uniqueness check is done in a very straightforward and efficient
manner. If not, however, the code retraces its steps up the tree to
the first time it followed a key other than foo (that's the only way
it could land at a middle page), and then follows the downlinks
looking at a truncated key (just foo, ignoring ctid).

Kind of what you say, but without the explicit null.

>
>>> It also occurs to me that our performance for updates in the event of
>>> many NULL values in indexes is probably really bad, and could be
>>> helped by this. You'll want to test all of this with a workload that
>>> isn't very sympathetic to HOT, of course -- most of these benefits are
>>> seen when HOT doesn't help.
>>
>> I haven't really tested with an overabundance of NULLs, I'll add that
>> to the tests. I have tested low-cardinality indexes though, but only
>> for correctness.
>
> I think we'll have to model data distribution to evaluate the idea
> well. After all, if there is a uniform distribution of values anyway,
> you're going to end up with many dirty leaf pages. Whereas, in the
> more realistic case where updates are localized, we can hope to better
> amortize the cost of inserting index tuples, and recycling index
> tuples.

Quite possibly

>> What do you mean with "first class part"?
>>
>> It's not included in the scankey for a variety of reasons, not the
>> least of which is not breaking the interface for current uses, and
>> because the tid type is quite limited in its capabilities ATM. Also,
>> the heap TID is usually there on most AM calls that care about it (ie:
>> insertions have it, of course), so adding it to the scankey also
>> seemed redundant.
>
> I mean that insertions into indexes that are low cardinality should be
> largely guided by TID comparisons.

...

>> If not adding to the scankey, what do you mean by "first class" then?
>
> Just what I said about the sort order of an index on "(foo)" now
> precisely matching what we'd get for "(foo, ctid)".

It is the case in both versions of the patch

> There are a couple
> of tricky issues with that that you'd have to look out for, like
> making sure that the high key continues to hold a real TID, which at a
> glance looks like something that just happens already. I'm not sure
> that we preserve the item pointer TID in all cases, since the current
> assumption is that a high key's TID is just filler -- maybe we can
> lose that at some point.

I thought long about that, and inner pages don't need a real TID in
their keys, as those keys only specify key space cutoff points and not
real tids, and high tids in leaf pages are always real.

I haven't had any issue with that, aside from the headaches resulting
from thinking about that for protracted periods of time.

> You should use amcheck to specifically verify that that happens
> reliably in all cases. Presumably, its use of an insertion scankey
> would automatically see the use of TID as a tie-breaker with patched
> Postgres amcheck verification, and so amcheck will work for this
> purpose unmodified.

It's totally on my roadmap. I'll have to adapt it for the new index
format though, it doesn't work without modification.

Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

23 November 2016, 03:27:42

On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> There are a couple
>> of tricky issues with that that you'd have to look out for, like
>> making sure that the high key continues to hold a real TID, which at a
>> glance looks like something that just happens already. I'm not sure
>> that we preserve the item pointer TID in all cases, since the current
>> assumption is that a high key's TID is just filler -- maybe we can
>> lose that at some point.
>
> I thought long about that, and inner pages don't need a real TID in
> their keys, as those keys only specify key space cutoff points and not
> real tids, and high tids in leaf pages are always real.

Not if there are duplicates in the inner pages. Then, you have to add
in the TID, and separate the key space, to guide insertions to the
right leaf page directly (particularly for non-HOT UPDATEs). That's
what I'm mostly interested in investigating, here.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

22 July 2017, 04:31:20

On Wed, Aug 17, 2016 at 7:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> The attached patch tries to maintain the initial status of B-Tree
> indexes, which are created with equal-key runs in physical order,
> during the whole life of the B-Tree, and make key-tid pairs
> efficiently searchable in the process.

I don't see an entry for this in the next CF. Do you have a plan for it?

BTW, I did post some remarks on this patch on another thread recently
[1]. Not sure if any of what I said there is news to you at this
point.

[1] postgr.es/m/CAH2-Wzn=6Lc3OVA88x=E6SKG72ojNUE6ut6RZAqNnQx-YLcw=Q@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

24 July 2017, 04:18:23

On Fri, Jul 21, 2017 at 10:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Aug 17, 2016 at 7:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The attached patch tries to maintain the initial status of B-Tree
>> indexes, which are created with equal-key runs in physical order,
>> during the whole life of the B-Tree, and make key-tid pairs
>> efficiently searchable in the process.
>
> I don't see an entry for this in the next CF. Do you have a plan for it?
>
> BTW, I did post some remarks on this patch on another thread recently
> [1]. Not sure if any of what I said there is news to you at this
> point.
>
> [1] postgr.es/m/CAH2-Wzn=6Lc3OVA88x=E6SKG72ojNUE6ut6RZAqNnQx-YLcw=Q@mail.gmail.com
> --
> Peter Geoghegan

I plan to restart work on it soonishly, but ATM most of my time is
dedicated to the vacuum patch, which is almost done.

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

24 July 2017, 19:51:19

On Wed, Nov 23, 2016 at 12:27 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> There are a couple
>>> of tricky issues with that that you'd have to look out for, like
>>> making sure that the high key continues to hold a real TID, which at a
>>> glance looks like something that just happens already. I'm not sure
>>> that we preserve the item pointer TID in all cases, since the current
>>> assumption is that a high key's TID is just filler -- maybe we can
>>> lose that at some point.
>>
>> I thought long about that, and inner pages don't need a real TID in
>> their keys, as those keys only specify key space cutoff points and not
>> real tids, and high tids in leaf pages are always real.
>
> Not if there are duplicates in the inner pages. Then, you have to add
> in the TID, and separate the key space, to guide insertions to the
> right leaf page directly (particularly for non-HOT UPDATEs). That's
> what I'm mostly interested in investigating, here.

My point was that the TID doesn't have to point to an actual tuple.

It's more of a keyspace thing, so it doesn't need to match real
tuples, it can just divide the keyspace with an arbitrary cutoff, and
should be cheapter to maintain without that requirement.

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

24 July 2017, 20:02:45

On Mon, Jul 24, 2017 at 9:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> My point was that the TID doesn't have to point to an actual tuple.
>
> It's more of a keyspace thing, so it doesn't need to match real
> tuples, it can just divide the keyspace with an arbitrary cutoff, and
> should be cheapter to maintain without that requirement.

I agree, but unless you're using normalized keys, then I don't see
that you get much useful leeway from using fake or truncated TID
values. Presumably the comparison logic will be based on comparing an
ItemPointerData field, which is impractical to truncate.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

24 July 2017, 20:13:19

On Mon, Jul 24, 2017 at 2:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jul 24, 2017 at 9:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> My point was that the TID doesn't have to point to an actual tuple.
>>
>> It's more of a keyspace thing, so it doesn't need to match real
>> tuples, it can just divide the keyspace with an arbitrary cutoff, and
>> should be cheapter to maintain without that requirement.
>
> I agree, but unless you're using normalized keys, then I don't see
> that you get much useful leeway from using fake or truncated TID
> values. Presumably the comparison logic will be based on comparing an
> ItemPointerData field, which is impractical to truncate.

In most cases, the TID itself can be omitted when the key itself
differs already.

When a split happens, a TID will be added referring to a real tid from
a child page iff necessary.

This gives a lot of leeway in choosing the cutoff point, since a
cutoff point is only added when necessary.

Vacuum *might* be able to redistribute tuples too, while holding locks
to all 3 pages (the two children and the parent page), since it can
move the TID to the middle point of two actual child TIDs, mindlessly,
without checking to see if such TID actually exists - it just needs to
choose a TID cutoff point that will distribute item pointers evently.
I haven't tried this, but it is theoretically possible.

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Peter Geoghegan

Date:

24 July 2017, 20:43:57

On Mon, Jul 24, 2017 at 10:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> In most cases, the TID itself can be omitted when the key itself
> differs already.
>
> When a split happens, a TID will be added referring to a real tid from
> a child page iff necessary.
>
> This gives a lot of leeway in choosing the cutoff point, since a
> cutoff point is only added when necessary.

I agree with all that. That just sounds like a basic implementation of
suffix truncation, to support making heap TID a part of the keyspace
without paying a big cost in fan-in.

> Vacuum *might* be able to redistribute tuples too, while holding locks
> to all 3 pages (the two children and the parent page), since it can
> move the TID to the middle point of two actual child TIDs, mindlessly,
> without checking to see if such TID actually exists - it just needs to
> choose a TID cutoff point that will distribute item pointers evently.
> I haven't tried this, but it is theoretically possible.

I don't understand what you mean here. As long as the TIDs are a first
class part of the keyspace, what VACUUM does or doesn't do should not
matter. Page deletion stashes a TID in the highkey position of a leaf
page within _bt_mark_page_halfdead(), but that shouldn't matter to
you.

I guess you're talking about contriving a TID value that is the mean
of two real TID values -- the two that are on each side of the split
point during a leaf page split. While that seems possible, I don't see
much value in it. Unless you have normalized keys, you can only really
truncate whole attributes. And, I think it's a bad idea to truncate
anything other than the new high key for leaf pages, with or without
normalized keys. Changing the keys once they're in internal pages is
never going to be worth it.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

From

Claudio Freire

Date:

24 July 2017, 20:55:31

On Mon, Jul 24, 2017 at 2:43 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jul 24, 2017 at 10:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Vacuum *might* be able to redistribute tuples too, while holding locks
>> to all 3 pages (the two children and the parent page), since it can
>> move the TID to the middle point of two actual child TIDs, mindlessly,
>> without checking to see if such TID actually exists - it just needs to
>> choose a TID cutoff point that will distribute item pointers evently.
>> I haven't tried this, but it is theoretically possible.
>
> I don't understand what you mean here. As long as the TIDs are a first
> class part of the keyspace, what VACUUM does or doesn't do should not
> matter. Page deletion stashes a TID in the highkey position of a leaf
> page within _bt_mark_page_halfdead(), but that shouldn't matter to
> you.
>
> I guess you're talking about contriving a TID value that is the mean
> of two real TID values -- the two that are on each side of the split
> point during a leaf page split. While that seems possible, I don't see
> much value in it. Unless you have normalized keys, you can only really
> truncate whole attributes. And, I think it's a bad idea to truncate
> anything other than the new high key for leaf pages, with or without
> normalized keys. Changing the keys once they're in internal pages is
> never going to be worth it.

That's what I'm saying. It might not be worth it. I haven't tested yet.