Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com Whole thread Raw |
Responses |
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
RE: Adding skip scan (including MDAM style range skip scan) to nbtree |
List | pgsql-hackers |
Attached is a POC patch that adds skip scan to nbtree. The patch teaches nbtree index scans to efficiently use a composite index on '(a, b)' for queries with a predicate such as "WHERE b = 5". This is feasible in cases where the total number of distinct values in the column 'a' is reasonably small (think tens or hundreds, perhaps even thousands for very large composite indexes). In effect, a skip scan treats this composite index on '(a, b)' as if it was a series of subindexes -- one subindex per distinct value in 'a'. We can exhaustively "search every subindex" using an index qual that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND b = 5" would behave. This approach might be much less efficient than an index scan that can use an index on 'b' alone, but skip scanning can still be orders of magnitude faster than a sequential scan. The user may very well not have a dedicated index on 'b' alone, for whatever reason. Note that the patch doesn't just target these simpler "skip leading index column omitted from the predicate" cases. It's far more general than that -- skipping attributes (or what the patch refers to as skip arrays) can be freely mixed with SAOPs/conventional arrays, in any order you can think of. They can also be combined with inequalities to form range skip arrays. This patch is a direct follow-up to the Postgres 17 work that became commit 5bf748b8. Making everything work well together is an important design goal here. I'll talk more about that further down, and will show a benchmark example query that'll give a good general sense of the value of the patch with these more complicated cases. A note on terminology ===================== The terminology in this area has certain baggage. Many of us will recall the patch that implemented loose index scan. That patch also dubbed itself "skip scan", but that doesn't seem right to me (it's at odds with how other RDBMSs describe features in this area). I would like to address the issues with the terminology in this area now, to avoid creating further confusion. When I use the term "skip scan", I'm referring to a feature that's comparable to the skip scan features from Oracle and MySQL 8.0+. This *isn't* at all comparable to the feature that MySQL calls "loose index scan" -- don't confuse the two features. Loose index scan is a far more specialized technique than skip scan. It only applies within special scans that feed into a DISTINCT group aggregate. Whereas my skip scan patch isn't like that at all -- it's much more general. With my patch, nbtree has exactly the same contract with the executor/core code as before. There are no new index paths generated by the optimizer to make skip scan work, even. Skip scan isn't particularly aimed at improving group aggregates (though the benchmark I'll show happens to involve a group aggregate, simply because the technique works best with large and expensive index scans). My patch is an additive thing, that speeds up what we'd currently refer to as full index scans (as well as range index scans that currently do a "full scan" of a range/subset of an index). These index paths/index scans can no longer really be called "full index scans", of course, but they're still logically the same index paths as before. MDAM and skip scan ================== As I touched on already, the patch actually implements a couple of related optimizations. "Skip scan" might be considered one out of the several optimizations from the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] -- the paper describes skip scan under its "Missing Key Predicate" subsection. I collectively refer to the optimizations from the paper as the "MDAM techniques". Alternatively, you could define these MDAM techniques as each implementing some particular flavor of skip scan, since they all do rather similar things under the hood. In fact, that's how I've chosen to describe things in my patch: it talks about skip scan, and about range skip scan, which is considered a minor variant of skip scan. (Note that the term "skip scan" is never used in the MDAM paper.) MDAM is short for "multidimensional access method". In the context of the paper, "dimension" refers to dimensions in a decision support system. These dimensions are represented by low cardinality columns, each of which appear in a large composite B-Tree index. The emphasis in the paper (and for my patch) is DSS and data warehousing; OLTP apps typically won't benefit as much. Note: Loose index scan *isn't* described by the paper at all. I also wouldn't classify loose index scan as one of the MDAM techniques. I think of it as being in a totally different category, due to the way that it applies semantic information. No MDAM technique will ever apply high-level semantic information about what is truly required by the plan tree, one level up. And so my patch simply teaches nbtree to find the most efficient way of navigating through an index, based solely on information that is readily available to the scan. The same principles apply to all of the other MDAM techniques; they're basically all just another flavor of skip scan (that do some kind of clever preprocessing/transformation that enables reducing the scan to a series of disjunctive accesses, and that could be implemented using the new abstraction I'm calling skip arrays). The paper more or less just applies one core idea, again and again. It's surprising how far that one idea can take you. But it is still just one core idea (don't overlook that point). Range skip scan --------------- To me, the most interesting MDAM technique is probably one that I refer to as "range skip scan" in the patch. This is the technique that the paper introduces first, in its "Intervening Range Predicates" subsection. The best way of explaining it is through an example (you could also just read the paper, which has an example of its own). Imagine a table with just one index: a composite index on "(pdate, customer_id)". Further suppose we have a query such as: SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND '2024-01-30' AND customer_id = 5; -- both index columns (pdate and customer_id) appear in predicate The patch effectively makes the nbtree code execute the index scan as if the query had been written like this instead: SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02', ..., '2024-01-30') AND customer_id = 5; The use of a "range skip array" within nbtree allows the scan to skip when that makes sense, locating the next date with customer_id = 5 each time (we might skip over many irrelevant leaf pages each time). The scan must also *avoid* skipping when it *doesn't* make sense. As always (since commit 5bf748b8 went in), whether and to what extent we skip using array keys depends in large part on the physical characteristics of the index at runtime. If the tuples that we need to return are all clustered closely together, across only a handful of leaf pages, then we shouldn't be skipping at all. When skipping makes sense, we should skip constantly. I'll discuss the trade-offs in this area a little more below, under "Design". Using multiple MDAM techniques within the same index scan (includes benchmark) ------------------------------------------------------------------------------ I recreated the data in the MDAM paper's "sales" table by making inferences from the paper. It's very roughly the same data set as the paper (close enough to get the general idea across). The table size is about 51GB, and the index is about 25GB (most of the attributes from the table are used as index columns). There is nothing special about this data set -- I just thought it would be cool to "recreate" the queries from the paper, as best I could. Thought that this approach might make my points about the design easier to follow. The index we'll be using for this can be created via: "create index mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again, this is per the paper. It's also the order that the columns appear in every WHERE clause in every query from the paper. (That said, the particular column order from the index definition mostly doesn't matter. Every index column is a low cardinality column, so unless the order used completely obviates the need to skip a column that would otherwise need to be skipped, such as "dept", the effect on query execution time from varying column order is in the noise. Obviously that's very much not how users are used to thinking about composite indexes.) The MDAM paper has numerous example queries, each of which builds on the last, adding one more complication each time -- each of which is addressed by another MDAM technique. The query I'll focus on here is an example query that's towards the end of the paper, and so combines multiple techniques together -- it's the query that appears in the "IN Lists" subsection: select dept, sdate, item_class, store, sum(total_sales) from sales_mdam_paper where -- omitted: leading "dept" column from composite index sdate between '1995-06-01' and '1995-06-30' and item_class in (20, 35, 50) and store in (200, 250) group by dept, sdate, item_class, store order by dept, sdate, item_class, store; On HEAD, when we run this query we either get a sequential scan (which is very slow) or a full index scan (which is almost as slow). Whereas with the patch, nbtree will execute the query as a succession of a few thousand very selective primitive index scans (which usually only scan one leaf page, though some may scan two neighboring leaf pages). Results: The full index scan on HEAD takes about 32 seconds. With the patch, the query takes just under 52ms to execute. That works out to be about 630x faster with the patch. See the attached SQL file for full details. It provides all you'll need to recreate this test result with the patch. Nobody would put up with such an inefficient full index scan in the first place, so the behavior on HEAD is not really a sensible baseline -- 630x isn't very meaningful. I could have come up with a case that showed an even larger improvement if I'd felt like it, but that wouldn't have proven anything. The important point is that the patch makes a huge composite index like the one I've built for this actually make sense, for the first time. So we're not so much making something faster as enabling a whole new approach to indexing -- particularly for data warehousing use cases. The way that Postgres DBAs choose which indexes they'll need to create is likely to be significantly changed by this optimization. I'll break down how this is possible. This query makes use of 3 separate MDAM techniques: 1. A "simple" skip scan (on "dept"). 2. A "range" skip scan (on "sdate"). 3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing new here, except that nbtree needs these regular SAOP arrays to roll over the higher-order skip arrays to trigger moving on to the next dept/date.) Internally, we're just doing a series of several thousand distinct non-overlapping accesses, in index key space order (so as to preserve the appearance of one continuous index scan). These accesses starts out like this: dept=INT_MIN, date='1995-06-01', item_class=20, store=200 (Here _bt_advance_array_keys discovers that the actual lowest dept is 1, not INT_MIN) dept=1, date='1995-06-01', item_class=20, store=200 dept=1, date='1995-06-01', item_class=20, store=250 dept=1, date='1995-06-01', item_class=35, store=200 dept=1, date='1995-06-01', item_class=35, store=250 ... (Side note: as I mentioned, each of the two "store" values usually appear together on the same leaf page in practice. Arguably I should have shown 2 lines/accesses here (for "dept=1"), rather than showing 4. The 4 "dept=1" lines shown required only 2 primitive index scans/index descents/leaf page reads. Disjunctive accesses don't necessarily map 1:1 with primitive/physical index scans.) About another ten thousand similar accesses occur (omitted for brevity). Execution of the scan within nbtree finally ends with these primitive index scans/accesses: ... dept=100, date='1995-06-30', item_class=50, store=250 dept=101, date='1995-06-01', item_class=20, store=200 STOP There is no "dept=101" entry in the index (the highest department in the index happens to be 100). The index scan therefore terminates at this point, having run out of leaf pages to scan (we've reached the rightmost point of the rightmost leaf page, as the scan attempts to locate non-existent dept=101 tuples). Design ====== Since index scans with skip arrays work just like index scans with regular arrays (as of Postgres 17), naturally, there are no special restrictions. Associated optimizer index paths have path keys, and so could (just for example) appear in a merge join, or feed into a group aggregate, while avoiding a sort node. Index scans that skip could also feed into a relocatable cursor. As I mentioned already, the patch adds a skipping mechanism that is purely an additive thing. I think that this will turn out to be an important enabler of using the optimizations, even when there's much uncertainty about how much they'll actually help at runtime. Optimizer --------- We make a broad assumption that skipping is always to our advantage during nbtree preprocessing -- preprocessing generates as many skip arrays as could possibly make sense based on static rules (rules that don't apply any kind of information about data distribution). Of course, skipping isn't necessarily the best approach in all cases, but that's okay. We only actually skip when physical index characteristics show that it makes sense. The real decisions about skipping are all made dynamically. That approach seems far more practicable than preempting the problem during planning or during nbtree preprocessing. It seems like it'd be very hard to model the costs statistically. We need revisions to btcostestimate, of course, but the less we can rely on btcostestimate the better. As I said, there are no new index paths generated by the optimizer for any of this. What do you call an index scan where 90% of all index tuples are 1 of only 3 distinct values, while the remaining 10% of index tuples are all perfectly unique in respect of a leading column? Clearly the best strategy when skipping using the leading column to "use skip scan for 90% of the index, and use a conventional range scan for the remaining 10%". Skipping generally makes sense, but we legitimately need to vary our strategy *during the same index scan*. It makes no sense to think of skip scan as a discrete sort of index scan. I have yet to prove that always having the option of skipping (even when it's very unlikely to help) really does "come for free" -- for now I'm just asserting that that's possible. I'll need proof. I expect to hear some principled skepticism on this point. It's probably not quite there in this v1 of the patch -- there'll be some regressions (I haven't looked very carefully just yet). However, we seem to already be quite close to avoiding regressions from excessive/useless skipping. Extensible infrastructure/support functions ------------------------------------------- Currently, the patch only supports skip scan for a subset of all opclasses -- those that have the required support function #6, or "skip support" function. This provides the opclass with (among other things) a way to increment the current skip array value (or to decrement it, in the case of backward scans). In practice we only have this for a handful of discrete integer (and integer-ish) types. Note that the patch currently cannot skip for an index column that happens to be text. Note that even this v1 supports skip scans that use unsupported types, provided that the input opclass of the specific columns we'll need to skip has support. The patch should be able to support every type/opclass as a target for skipping, regardless of whether an opclass support function happened to be available. That could work by teaching the nbtree code to have explicit probes for the next skip array value in the index, only then combining that new value with the qual from the input scan keys/query. I've put that off for now because it seems less important -- it doesn't really affect anything I've said about the core design, which is what I'm focussing on for now. It makes sense to use increment/decrement whenever feasible, even though it isn't strictly necessary (or won't be, once the patch has the required explicit probe support). The only reason to not apply increment/decrement opclass skip support (that I can see) is because it just isn't practical (this is generally the case for continuous types). While it's slightly onerous to have to invent all this new opclass infrastructure, it definitely makes sense. There is a performance advantage to having skip arrays that can increment through each distinct possible indexable value (this increment/decrement stuff comes from the MDAM paper). The MDAM techniques inherently work best when "skipping" columns of discrete types like integer and date, which is why the paper has examples that all look like that. If you look at my example query and its individual accesses, you'll realize why this is so. Thoughts? [1] https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan
Attachment
pgsql-hackers by date: