Thread: Making strxfrm() blobs in indexes work

Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

30 January 2014, 23:49:33

On more occasions than I care to recall, someone has suggested that it
would be valuable to do something with strxfrm() blobs in order to
have cheaper locale-aware text comparisons. One obvious place to do so
would be in indexes, but in the past that has been dismissed on the
following grounds:

* Index-only scans need fully formed datums to work, and strxfrm() is
a one-way function (or so I believe). There is no reason to think that
the original string can be reassembled from the blob, so clearly that
won't fly.

* There is a high cost to be paid in storage overhead. Even for
collations like "en_US.UTF-8", that can mean that the blob is as much
as 3-4 times larger than the original text string. Who is to say that
we'll come out ahead even with the savings of just doing a strcmp()
rather than a strcoll()?

So, even though using strxfrm() blobs can be much faster, it seems
hard to find a place to use them. A compelling improvement in
performance for index scans is so close, and yet so far. But having
mulled it over during my recent work on various aspects of the B-Tree
code, I believe that there is a way that we can cheat and come out
ahead with strxfrm() blobs in indexes.

I have a sample dataset that I like to work off of. It's the database
of the Mouse Genome Database, available from
http://www.informatics.jax.org/. This is real data, from a real
Postgres database, and so I recommend working off one of their weekly
dumps if you're looking for a real dataset that is big enough to
exercise most interesting things, and yet not too big. It seems that
there is an entire ecosystem of tools for medical researchers around
the dataset, and they have the right attitude about opening it all up,
which is pretty cool.

Anyway, I quickly threw together the following query, which shows the
break-down of leaf pages, inner pages and root pages among the largest
indexes, and the proportion of each per index:

[local] pg@mouse=# with tots as (
SELECT count(*) c, type, relname from       (select relname, relpages, generate_series(1, relpages - 1) i
from pg_class c join pg_namespace n on c.relnamespace = n.oid where
relkind = 'i' and nspname = 'mgd') r,       lateral (select * from bt_page_stats('mgd.' || relname, i)) u
group by relname, type)
select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c)
over(partition by tots.relname) as prop_of_index, type from tots join
pg_class c on c.relname = tots.relname order by 2 desc, 1, type;
                   relname                     | non_meta_pages |   c  |       prop_of_index        | type

------------------------------------------------+----------------+--------+----------------------------+------acc_accession_0
                              |         106181 |
 
520 |     0.00489729801000178940 | iacc_accession_0                                |         106181 |
105660 |     0.99509328410920974562 | lacc_accession_0                                |         106181 |1 |
0.000009417880788464979610| racc_accession_idx_accid                        |         106181 |
 
520 |     0.00489729801000178940 | iacc_accession_idx_accid                        |         106181 |
105660 |     0.99509328410920974562 | lacc_accession_idx_accid                        |         106181 |1 |
0.000009417880788464979610| rmgi_notechunk_idx_note                         |         101127 |
 
4817 |     0.04763317412758214918 | imgi_notechunk_idx_note                         |         101127 |
96309 |     0.95235693731644367973 | lmgi_notechunk_idx_note                         |         101127 |1 |
0.000009888555974171091795| racc_accession_1                                |          80507 |
 
302 |     0.00375122660141354168 | iacc_accession_1                                |          80507 |
80204 |     0.99623635211844932739 | lacc_accession_1                                |          80507 |1 |
0.000012421280137130932714| racc_accession_idx_prefixpart                   |          80507 |
 
302 |     0.00375122660141354168 | iacc_accession_idx_prefixpart                   |          80507 |
80204 |     0.99623635211844932739 | lacc_accession_idx_prefixpart                   |          80507 |1 |
0.000012421280137130932714| r
 

***SNIP***

To the surprise of no one, typically ~99.5% of B-Tree index pages are
leaf pages. I do see one index on a column of very large text strings
(a "notes" column) that is only about 95% leaf pages, which is omitted
here. But the general picture is that over 99% of non-meta pages are
leaf pages. Again, this is hardly surprising: That's more or less why
B-Tree indexes work so well when partially cached.

I'm sure anyone that has read this far knows where I'm going with
this: why can't we just have strxfrm() blobs in the inner pages,
implying large savings for a big majority of text comparisons that
service index scans, without bloating the indexes too badly, and
without breaking anything? We only use inner pages to find leaf pages.
They're redundant copies of the data within the index.

My reading of the page split code is that inserting a downlink into
the parent post-split is totally naive of the details (including the
size) of the pointed-to IndexTuple - _bt_insert_parent() does this:

/* Recursively update the parent */
_bt_insertonpg(rel, pbuf, stack->bts_parent,           new_item, stack->bts_offset + 1,           is_only);

So ISTM that we could come up with an infrastructure, possibly just
for insertion scanKeys (limiting the code footprint of all of this) in
order to inner-page-process datums at this juncture, and store a blob
instead, for later savings on walking the tree. I believe that this
technique has applicability beyond strxfrm(), and a type-naive
infrastructure could be developed to support it, with a degree of
complexity not too far in excess of, say, the SortSupport
infrastructure. The realization that we don't really need to recover
the original information directly from the inner pages gives us scope
to push things in several useful directions, I suspect. Furthermore,
because it takes a long time to fill inner pages, bloating them might
not be that bad in larger indexes. My guess is that on the whole it
would be well worth it.

This is obviously pretty hand-wavy, but I thought it was still worth
sharing while the basic idea was fresh in my mind. Thoughts?
-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Tom Lane

Date:

31 January 2014, 00:34:42

Peter Geoghegan <pg@heroku.com> writes:
> On more occasions than I care to recall, someone has suggested that it
> would be valuable to do something with strxfrm() blobs in order to
> have cheaper locale-aware text comparisons. One obvious place to do so
> would be in indexes, but in the past that has been dismissed on the
> following grounds:

> * Index-only scans need fully formed datums to work, and strxfrm() is
> a one-way function (or so I believe). There is no reason to think that
> the original string can be reassembled from the blob, so clearly that
> won't fly.

> * There is a high cost to be paid in storage overhead. Even for
> collations like "en_US.UTF-8", that can mean that the blob is as much
> as 3-4 times larger than the original text string. Who is to say that
> we'll come out ahead even with the savings of just doing a strcmp()
> rather than a strcoll()?

Quite aside from the index bloat risk, this effect means a 3-4x reduction
in the maximum string length that can be indexed before getting the
dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error.
Worse, a value insertion might well succeed, with the failure happening
only (much?) later when that entry is chosen as a page split boundary.

It's possible that TOAST compression of the strings would save you, but
I'm not convinced of that; it certainly doesn't seem like we could
guarantee no post-insertion failures that way.

Also, detoasting of strings that hadn't been long enough to need toasting
before could easily eat any savings.

> I'm sure anyone that has read this far knows where I'm going with
> this: why can't we just have strxfrm() blobs in the inner pages,
> implying large savings for a big majority of text comparisons that
> service index scans, without bloating the indexes too badly, and
> without breaking anything? We only use inner pages to find leaf pages.
> They're redundant copies of the data within the index.

It's a cute idea though, and perhaps worth pursuing as long as you've
got the pitfalls in mind.
        regards, tom lane

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

31 January 2014, 00:45:50

On Thu, Jan 30, 2014 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Quite aside from the index bloat risk, this effect means a 3-4x reduction
> in the maximum string length that can be indexed before getting the
> dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error.
> Worse, a value insertion might well succeed, with the failure happening
> only (much?) later when that entry is chosen as a page split boundary.

That's not hard to prevent. If that should happen, we don't go with
the strxfrm() datum. We have a spare IndexTuple bit we could use to
mark when the optimization was applied. So we consider the
appropriateness of a regular strcoll() or a strxfrm() in all contexts
(in a generic and extensible manner, but that's essentially what we
do). I'm not too discouraged by this restriction, because in practice
it won't come up very often.

>> I'm sure anyone that has read this far knows where I'm going with
>> this: why can't we just have strxfrm() blobs in the inner pages,
>> implying large savings for a big majority of text comparisons that
>> service index scans, without bloating the indexes too badly, and
>> without breaking anything? We only use inner pages to find leaf pages.
>> They're redundant copies of the data within the index.
>
> It's a cute idea though, and perhaps worth pursuing as long as you've
> got the pitfalls in mind.

I'll think about pursuing it. I might prefer to declare it as fair
game for anyone else that wants to do it.

-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Tom Lane

Date:

31 January 2014, 01:04:20

Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Jan 30, 2014 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Quite aside from the index bloat risk, this effect means a 3-4x reduction
>> in the maximum string length that can be indexed before getting the
>> dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error.
>> Worse, a value insertion might well succeed, with the failure happening
>> only (much?) later when that entry is chosen as a page split boundary.

> That's not hard to prevent. If that should happen, we don't go with
> the strxfrm() datum. We have a spare IndexTuple bit we could use to
> mark when the optimization was applied.

You'd need a bit per column, no?
        regards, tom lane

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

31 January 2014, 01:04:28

On Thu, Jan 30, 2014 at 4:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So we consider the
> appropriateness of a regular strcoll() or a strxfrm() in all contexts
> (in a generic and extensible manner, but that's essentially what we
> do). I'm not too discouraged by this restriction, because in practice
> it won't come up very often.

I meant: We consider the appropriateness of a strxfrm() + strcmp()
against the pre-strfxfrm()'d scanKey datum, when the optimization is
not in force, as against just a plain strcmp() when it is.

-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

31 January 2014, 01:05:34

On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That's not hard to prevent. If that should happen, we don't go with
>> the strxfrm() datum. We have a spare IndexTuple bit we could use to
>> mark when the optimization was applied.
>
> You'd need a bit per column, no?

I don't think so. It would be no big deal if it was all or nothing.


-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

31 January 2014, 02:33:32

On Thu, Jan 30, 2014 at 3:49 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So ISTM that we could come up with an infrastructure, possibly just
> for insertion scanKeys (limiting the code footprint of all of this) in
> order to inner-page-process datums at this juncture, and store a blob
> instead, for later savings on walking the tree. I believe that this
> technique has applicability beyond strxfrm(), and a type-naive
> infrastructure could be developed to support it, with a degree of
> complexity not too far in excess of, say, the SortSupport
> infrastructure. The realization that we don't really need to recover
> the original information directly from the inner pages gives us scope
> to push things in several useful directions, I suspect.

Techniques around B-Trees that are useful for common workloads may
have problems around something that rhymes with "combatants", since it
stands to reason that many were developed within industry, and some of
the major vendors are known to be very combative. This whole area
seems safe, though. After all, Singleton wrote in a 1969 paper
"ALGORITHM 347: AN EFFICIENT ALGORITHM FOR SORTING WITH MINIMAL
STORAGE" that ACM members can download:

"This FORTRAN subroutine was tested on a CDC 6400 computer. For random
uniform numbers, sorting times divided by n log^2 n were nearly
constant at 20.2 X 10^-6 for 100 < n < 10,000, with a time of 0.202
seconds for 1000 items. This subroutine was also hand-compiled for the
same computer to produce a more efficient machine code. In this
version the constant of proportionality was 5.2 X 10^-6, with a time
of 0.052 seconds for 1000 items. In both cases, integer comparisons
were used to order normalized floating-point numbers."

This looks very much like prior art for our purposes. I'm pretty sure
that the problem was back then they didn't have dedicated
floating-point units, so they had to do what a contemporary embedded
systems engineer would call floating point emulation, at great
expense. Better to avoid a more expensive comparisons when you can.

-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

31 January 2014, 04:51:23

On Thu, Jan 30, 2014 at 5:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> That's not hard to prevent. If that should happen, we don't go with
>>> the strxfrm() datum. We have a spare IndexTuple bit we could use to
>>> mark when the optimization was applied.
>>
>> You'd need a bit per column, no?
>
> I don't think so. It would be no big deal if it was all or nothing.

I've done some more digging. It turns out that the 1977 paper "An
Encoding Method for Multifield Sorting and Indexing" describes a
technique that involves concatenating multiple column values and
comparing them using a simple strcmp(). Apparently this is something
that was in system R back in the 1970s. Here is a description of that
paper:

"Sequences of character strings with an order relation imposed between
sequences are considered. An encoding scheme is described which
produces a single, order-preserving string from a sequence of strings.
The original sequence can be recovered from the encoded string, and
one sequence of strings precedes another if and only if the encoding
of the first precedes the encoding of the second. The strings may be
variable length, without a maximum length restriction, and no symbols
need be reserved for control purposes. Hence any symbol may occur in
any string. The scheme is useful for multifield sorting, multifield
indexing, and other applications where ordering on more than one field
is important."

I think we should implement the scheme in this paper, for inner B-Tree
pages. The paper describes how lexicographic sort order must be
preserved, so it's pretty much identical to what I've described,
except it doesn't say anything about inner pages presumably because
there was no Unicode back then, and I don't think that B+Trees/B-link
trees had fully caught on yet. We're already suffering from the fact
that strcoll() mandates a NULL-terminated string, since we have to
copy text datums to a buffer to deal with the impedance mismatch.
Furthermore, multi-column comparisons probably have a lot of overhead
at present from all the indirection alone.

-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

03 February 2014, 01:09:17

On Thu, Jan 30, 2014 at 8:51 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I've done some more digging. It turns out that the 1977 paper "An
> Encoding Method for Multifield Sorting and Indexing" describes a
> technique that involves concatenating multiple column values and
> comparing them using a simple strcmp().

So I thought about it again, and realized that this probably can't be
made to work given our present assumptions. Commit 656beff59 added the
varstr_cmp() strcmp() tie-breaker.  If we cannot trust strcoll() to
indicate equality, why should we be able to trust strxfrm() + strcmp()
to do so, without an equivalent tie-breaker on the original string?
Now, I guess it might be good enough that the comparison in the inner
page indicated "equivalence" provided the comparison on leaf pages
indicated "equality proper". That seems like the kind of distinction
that I'd rather not have to make, though. You could end up with two
non-equal but equivalent tuples in an inner page. Seems pretty woolly
to me.

However, it also occurs to me that strxfrm() blobs have another useful
property: We (as, say, the author of an equality operator on text, an
operator intended for a btree operator class) *can* trust a strcmp()'s
result on blobs, provided the result isn't 0/equal, *even if the blobs
are truncated*. So maybe a better scheme, and certainly a simpler one
would be to have a pseudo-attribute in inner pages only with, say, the
first 8 bytes of a strxfrm() blob formed from the logically-leading
text attribute of the same indexTuple. Because we're only doing this
on inner pages, there is a very good chance that that will be good
enough most of the time. This also allows us to reduce bloat very
effectively.

As I touched on before, other schemes for other types do seem to
suggest themselves. We could support a pseudo leading attribute for
numeric too, for example, where we use a 64-bit integer as a proxy for
the whole number part (with the implementation having special handling
of "out of range" values by means of an internal magic number - as
always with the scheme, equality means "inconclusive, ask logically
leading attribute"). That could win just by mostly avoiding the
serialization overhead.

Now, the obvious counter-argument here is to point out that there are
worst cases where none of this helps. I suspect that those are so rare
in the real world, and the cost of doing all this is so low, that it's
still likely to be well worth it for any given use-case at the macro
level.

-- 
Peter Geoghegan

Re: Making strxfrm() blobs in indexes work

From

Martijn van Oosterhout

Date:

12 February 2014, 23:33:04

On Sun, Feb 02, 2014 at 05:09:06PM -0800, Peter Geoghegan wrote:
> However, it also occurs to me that strxfrm() blobs have another useful
> property: We (as, say, the author of an equality operator on text, an
> operator intended for a btree operator class) *can* trust a strcmp()'s
> result on blobs, provided the result isn't 0/equal, *even if the blobs
> are truncated*. So maybe a better scheme, and certainly a simpler one
> would be to have a pseudo-attribute in inner pages only with, say, the
> first 8 bytes of a strxfrm() blob formed from the logically-leading
> text attribute of the same indexTuple. Because we're only doing this
> on inner pages, there is a very good chance that that will be good
> enough most of the time. This also allows us to reduce bloat very
> effectively.

(A bit late to the party). This idea has come up before and the most
annoying thing is that braindead strxfrm api.  Namely, to strxfrm a
large strings you need to strxfrm it completely even if you only want
the first 8 bytes.

That said, since btree index tuples are limited to <3k anyway, the
overhead probably isn't that bad.

I think it would make a noticable difference if it can be made to work.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Making strxfrm() blobs in indexes work

From

Peter Geoghegan

Date:

12 February 2014, 23:58:14

On Wed, Feb 12, 2014 at 3:30 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> (A bit late to the party). This idea has come up before and the most
> annoying thing is that braindead strxfrm api.  Namely, to strxfrm a
> large strings you need to strxfrm it completely even if you only want
> the first 8 bytes.

I think that in general strxfrm() must have that property. However, it
does not obligate us to store the entire string, provided that we
don't trust blob comparisons that indicate equality (since I believe
that we cannot do so generally with a non-truncated blob, we might as
well take advantage of this, particularly given we're doing this with
already naturally dissimilar inner pages, where we'll mostly get away
with truncation provided there is a reliable tie-breaker).

Besides all this, I'm not particularly worried about the cost of
calling strxfrm() if that only has to occur when a page split occurs,
as we insert a downlink into the parent page. The big picture here is
that we can exploit the properties of inner pages to do the smallest
amount of work for the largest amount of benefit.

-- 
Peter Geoghegan