Thread: Making strxfrm() blobs in indexes work
On more occasions than I care to recall, someone has suggested that it would be valuable to do something with strxfrm() blobs in order to have cheaper locale-aware text comparisons. One obvious place to do so would be in indexes, but in the past that has been dismissed on the following grounds: * Index-only scans need fully formed datums to work, and strxfrm() is a one-way function (or so I believe). There is no reason to think that the original string can be reassembled from the blob, so clearly that won't fly. * There is a high cost to be paid in storage overhead. Even for collations like "en_US.UTF-8", that can mean that the blob is as much as 3-4 times larger than the original text string. Who is to say that we'll come out ahead even with the savings of just doing a strcmp() rather than a strcoll()? So, even though using strxfrm() blobs can be much faster, it seems hard to find a place to use them. A compelling improvement in performance for index scans is so close, and yet so far. But having mulled it over during my recent work on various aspects of the B-Tree code, I believe that there is a way that we can cheat and come out ahead with strxfrm() blobs in indexes. I have a sample dataset that I like to work off of. It's the database of the Mouse Genome Database, available from http://www.informatics.jax.org/. This is real data, from a real Postgres database, and so I recommend working off one of their weekly dumps if you're looking for a real dataset that is big enough to exercise most interesting things, and yet not too big. It seems that there is an entire ecosystem of tools for medical researchers around the dataset, and they have the right attitude about opening it all up, which is pretty cool. Anyway, I quickly threw together the following query, which shows the break-down of leaf pages, inner pages and root pages among the largest indexes, and the proportion of each per index: [local] pg@mouse=# with tots as ( SELECT count(*) c, type, relname from (select relname, relpages, generate_series(1, relpages - 1) i from pg_class c join pg_namespace n on c.relnamespace = n.oid where relkind = 'i' and nspname = 'mgd') r, lateral (select * from bt_page_stats('mgd.' || relname, i)) u group by relname, type) select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c) over(partition by tots.relname) as prop_of_index, type from tots join pg_class c on c.relname = tots.relname order by 2 desc, 1, type; relname | non_meta_pages | c | prop_of_index | type ------------------------------------------------+----------------+--------+----------------------------+------acc_accession_0 | 106181 | 520 | 0.00489729801000178940 | iacc_accession_0 | 106181 | 105660 | 0.99509328410920974562 | lacc_accession_0 | 106181 |1 | 0.000009417880788464979610| racc_accession_idx_accid | 106181 | 520 | 0.00489729801000178940 | iacc_accession_idx_accid | 106181 | 105660 | 0.99509328410920974562 | lacc_accession_idx_accid | 106181 |1 | 0.000009417880788464979610| rmgi_notechunk_idx_note | 101127 | 4817 | 0.04763317412758214918 | imgi_notechunk_idx_note | 101127 | 96309 | 0.95235693731644367973 | lmgi_notechunk_idx_note | 101127 |1 | 0.000009888555974171091795| racc_accession_1 | 80507 | 302 | 0.00375122660141354168 | iacc_accession_1 | 80507 | 80204 | 0.99623635211844932739 | lacc_accession_1 | 80507 |1 | 0.000012421280137130932714| racc_accession_idx_prefixpart | 80507 | 302 | 0.00375122660141354168 | iacc_accession_idx_prefixpart | 80507 | 80204 | 0.99623635211844932739 | lacc_accession_idx_prefixpart | 80507 |1 | 0.000012421280137130932714| r ***SNIP*** To the surprise of no one, typically ~99.5% of B-Tree index pages are leaf pages. I do see one index on a column of very large text strings (a "notes" column) that is only about 95% leaf pages, which is omitted here. But the general picture is that over 99% of non-meta pages are leaf pages. Again, this is hardly surprising: That's more or less why B-Tree indexes work so well when partially cached. I'm sure anyone that has read this far knows where I'm going with this: why can't we just have strxfrm() blobs in the inner pages, implying large savings for a big majority of text comparisons that service index scans, without bloating the indexes too badly, and without breaking anything? We only use inner pages to find leaf pages. They're redundant copies of the data within the index. My reading of the page split code is that inserting a downlink into the parent post-split is totally naive of the details (including the size) of the pointed-to IndexTuple - _bt_insert_parent() does this: /* Recursively update the parent */ _bt_insertonpg(rel, pbuf, stack->bts_parent, new_item, stack->bts_offset + 1, is_only); So ISTM that we could come up with an infrastructure, possibly just for insertion scanKeys (limiting the code footprint of all of this) in order to inner-page-process datums at this juncture, and store a blob instead, for later savings on walking the tree. I believe that this technique has applicability beyond strxfrm(), and a type-naive infrastructure could be developed to support it, with a degree of complexity not too far in excess of, say, the SortSupport infrastructure. The realization that we don't really need to recover the original information directly from the inner pages gives us scope to push things in several useful directions, I suspect. Furthermore, because it takes a long time to fill inner pages, bloating them might not be that bad in larger indexes. My guess is that on the whole it would be well worth it. This is obviously pretty hand-wavy, but I thought it was still worth sharing while the basic idea was fresh in my mind. Thoughts? -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On more occasions than I care to recall, someone has suggested that it > would be valuable to do something with strxfrm() blobs in order to > have cheaper locale-aware text comparisons. One obvious place to do so > would be in indexes, but in the past that has been dismissed on the > following grounds: > * Index-only scans need fully formed datums to work, and strxfrm() is > a one-way function (or so I believe). There is no reason to think that > the original string can be reassembled from the blob, so clearly that > won't fly. > * There is a high cost to be paid in storage overhead. Even for > collations like "en_US.UTF-8", that can mean that the blob is as much > as 3-4 times larger than the original text string. Who is to say that > we'll come out ahead even with the savings of just doing a strcmp() > rather than a strcoll()? Quite aside from the index bloat risk, this effect means a 3-4x reduction in the maximum string length that can be indexed before getting the dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error. Worse, a value insertion might well succeed, with the failure happening only (much?) later when that entry is chosen as a page split boundary. It's possible that TOAST compression of the strings would save you, but I'm not convinced of that; it certainly doesn't seem like we could guarantee no post-insertion failures that way. Also, detoasting of strings that hadn't been long enough to need toasting before could easily eat any savings. > I'm sure anyone that has read this far knows where I'm going with > this: why can't we just have strxfrm() blobs in the inner pages, > implying large savings for a big majority of text comparisons that > service index scans, without bloating the indexes too badly, and > without breaking anything? We only use inner pages to find leaf pages. > They're redundant copies of the data within the index. It's a cute idea though, and perhaps worth pursuing as long as you've got the pitfalls in mind. regards, tom lane
On Thu, Jan 30, 2014 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Quite aside from the index bloat risk, this effect means a 3-4x reduction > in the maximum string length that can be indexed before getting the > dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error. > Worse, a value insertion might well succeed, with the failure happening > only (much?) later when that entry is chosen as a page split boundary. That's not hard to prevent. If that should happen, we don't go with the strxfrm() datum. We have a spare IndexTuple bit we could use to mark when the optimization was applied. So we consider the appropriateness of a regular strcoll() or a strxfrm() in all contexts (in a generic and extensible manner, but that's essentially what we do). I'm not too discouraged by this restriction, because in practice it won't come up very often. >> I'm sure anyone that has read this far knows where I'm going with >> this: why can't we just have strxfrm() blobs in the inner pages, >> implying large savings for a big majority of text comparisons that >> service index scans, without bloating the indexes too badly, and >> without breaking anything? We only use inner pages to find leaf pages. >> They're redundant copies of the data within the index. > > It's a cute idea though, and perhaps worth pursuing as long as you've > got the pitfalls in mind. I'll think about pursuing it. I might prefer to declare it as fair game for anyone else that wants to do it. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Jan 30, 2014 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Quite aside from the index bloat risk, this effect means a 3-4x reduction >> in the maximum string length that can be indexed before getting the >> dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error. >> Worse, a value insertion might well succeed, with the failure happening >> only (much?) later when that entry is chosen as a page split boundary. > That's not hard to prevent. If that should happen, we don't go with > the strxfrm() datum. We have a spare IndexTuple bit we could use to > mark when the optimization was applied. You'd need a bit per column, no? regards, tom lane
On Thu, Jan 30, 2014 at 4:45 PM, Peter Geoghegan <pg@heroku.com> wrote: > So we consider the > appropriateness of a regular strcoll() or a strxfrm() in all contexts > (in a generic and extensible manner, but that's essentially what we > do). I'm not too discouraged by this restriction, because in practice > it won't come up very often. I meant: We consider the appropriateness of a strxfrm() + strcmp() against the pre-strfxfrm()'d scanKey datum, when the optimization is not in force, as against just a plain strcmp() when it is. -- Peter Geoghegan
On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That's not hard to prevent. If that should happen, we don't go with >> the strxfrm() datum. We have a spare IndexTuple bit we could use to >> mark when the optimization was applied. > > You'd need a bit per column, no? I don't think so. It would be no big deal if it was all or nothing. -- Peter Geoghegan
On Thu, Jan 30, 2014 at 3:49 PM, Peter Geoghegan <pg@heroku.com> wrote: > So ISTM that we could come up with an infrastructure, possibly just > for insertion scanKeys (limiting the code footprint of all of this) in > order to inner-page-process datums at this juncture, and store a blob > instead, for later savings on walking the tree. I believe that this > technique has applicability beyond strxfrm(), and a type-naive > infrastructure could be developed to support it, with a degree of > complexity not too far in excess of, say, the SortSupport > infrastructure. The realization that we don't really need to recover > the original information directly from the inner pages gives us scope > to push things in several useful directions, I suspect. Techniques around B-Trees that are useful for common workloads may have problems around something that rhymes with "combatants", since it stands to reason that many were developed within industry, and some of the major vendors are known to be very combative. This whole area seems safe, though. After all, Singleton wrote in a 1969 paper "ALGORITHM 347: AN EFFICIENT ALGORITHM FOR SORTING WITH MINIMAL STORAGE" that ACM members can download: "This FORTRAN subroutine was tested on a CDC 6400 computer. For random uniform numbers, sorting times divided by n log^2 n were nearly constant at 20.2 X 10^-6 for 100 < n < 10,000, with a time of 0.202 seconds for 1000 items. This subroutine was also hand-compiled for the same computer to produce a more efficient machine code. In this version the constant of proportionality was 5.2 X 10^-6, with a time of 0.052 seconds for 1000 items. In both cases, integer comparisons were used to order normalized floating-point numbers." This looks very much like prior art for our purposes. I'm pretty sure that the problem was back then they didn't have dedicated floating-point units, so they had to do what a contemporary embedded systems engineer would call floating point emulation, at great expense. Better to avoid a more expensive comparisons when you can. -- Peter Geoghegan
On Thu, Jan 30, 2014 at 5:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> That's not hard to prevent. If that should happen, we don't go with >>> the strxfrm() datum. We have a spare IndexTuple bit we could use to >>> mark when the optimization was applied. >> >> You'd need a bit per column, no? > > I don't think so. It would be no big deal if it was all or nothing. I've done some more digging. It turns out that the 1977 paper "An Encoding Method for Multifield Sorting and Indexing" describes a technique that involves concatenating multiple column values and comparing them using a simple strcmp(). Apparently this is something that was in system R back in the 1970s. Here is a description of that paper: "Sequences of character strings with an order relation imposed between sequences are considered. An encoding scheme is described which produces a single, order-preserving string from a sequence of strings. The original sequence can be recovered from the encoded string, and one sequence of strings precedes another if and only if the encoding of the first precedes the encoding of the second. The strings may be variable length, without a maximum length restriction, and no symbols need be reserved for control purposes. Hence any symbol may occur in any string. The scheme is useful for multifield sorting, multifield indexing, and other applications where ordering on more than one field is important." I think we should implement the scheme in this paper, for inner B-Tree pages. The paper describes how lexicographic sort order must be preserved, so it's pretty much identical to what I've described, except it doesn't say anything about inner pages presumably because there was no Unicode back then, and I don't think that B+Trees/B-link trees had fully caught on yet. We're already suffering from the fact that strcoll() mandates a NULL-terminated string, since we have to copy text datums to a buffer to deal with the impedance mismatch. Furthermore, multi-column comparisons probably have a lot of overhead at present from all the indirection alone. -- Peter Geoghegan
On Thu, Jan 30, 2014 at 8:51 PM, Peter Geoghegan <pg@heroku.com> wrote: > I've done some more digging. It turns out that the 1977 paper "An > Encoding Method for Multifield Sorting and Indexing" describes a > technique that involves concatenating multiple column values and > comparing them using a simple strcmp(). So I thought about it again, and realized that this probably can't be made to work given our present assumptions. Commit 656beff59 added the varstr_cmp() strcmp() tie-breaker. If we cannot trust strcoll() to indicate equality, why should we be able to trust strxfrm() + strcmp() to do so, without an equivalent tie-breaker on the original string? Now, I guess it might be good enough that the comparison in the inner page indicated "equivalence" provided the comparison on leaf pages indicated "equality proper". That seems like the kind of distinction that I'd rather not have to make, though. You could end up with two non-equal but equivalent tuples in an inner page. Seems pretty woolly to me. However, it also occurs to me that strxfrm() blobs have another useful property: We (as, say, the author of an equality operator on text, an operator intended for a btree operator class) *can* trust a strcmp()'s result on blobs, provided the result isn't 0/equal, *even if the blobs are truncated*. So maybe a better scheme, and certainly a simpler one would be to have a pseudo-attribute in inner pages only with, say, the first 8 bytes of a strxfrm() blob formed from the logically-leading text attribute of the same indexTuple. Because we're only doing this on inner pages, there is a very good chance that that will be good enough most of the time. This also allows us to reduce bloat very effectively. As I touched on before, other schemes for other types do seem to suggest themselves. We could support a pseudo leading attribute for numeric too, for example, where we use a 64-bit integer as a proxy for the whole number part (with the implementation having special handling of "out of range" values by means of an internal magic number - as always with the scheme, equality means "inconclusive, ask logically leading attribute"). That could win just by mostly avoiding the serialization overhead. Now, the obvious counter-argument here is to point out that there are worst cases where none of this helps. I suspect that those are so rare in the real world, and the cost of doing all this is so low, that it's still likely to be well worth it for any given use-case at the macro level. -- Peter Geoghegan
On Sun, Feb 02, 2014 at 05:09:06PM -0800, Peter Geoghegan wrote: > However, it also occurs to me that strxfrm() blobs have another useful > property: We (as, say, the author of an equality operator on text, an > operator intended for a btree operator class) *can* trust a strcmp()'s > result on blobs, provided the result isn't 0/equal, *even if the blobs > are truncated*. So maybe a better scheme, and certainly a simpler one > would be to have a pseudo-attribute in inner pages only with, say, the > first 8 bytes of a strxfrm() blob formed from the logically-leading > text attribute of the same indexTuple. Because we're only doing this > on inner pages, there is a very good chance that that will be good > enough most of the time. This also allows us to reduce bloat very > effectively. (A bit late to the party). This idea has come up before and the most annoying thing is that braindead strxfrm api. Namely, to strxfrm a large strings you need to strxfrm it completely even if you only want the first 8 bytes. That said, since btree index tuples are limited to <3k anyway, the overhead probably isn't that bad. I think it would make a noticable difference if it can be made to work. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Wed, Feb 12, 2014 at 3:30 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > (A bit late to the party). This idea has come up before and the most > annoying thing is that braindead strxfrm api. Namely, to strxfrm a > large strings you need to strxfrm it completely even if you only want > the first 8 bytes. I think that in general strxfrm() must have that property. However, it does not obligate us to store the entire string, provided that we don't trust blob comparisons that indicate equality (since I believe that we cannot do so generally with a non-truncated blob, we might as well take advantage of this, particularly given we're doing this with already naturally dissimilar inner pages, where we'll mostly get away with truncation provided there is a reliable tie-breaker). Besides all this, I'm not particularly worried about the cost of calling strxfrm() if that only has to occur when a page split occurs, as we insert a downlink into the parent page. The big picture here is that we can exploit the properties of inner pages to do the smallest amount of work for the largest amount of benefit. -- Peter Geoghegan