Thread: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
I've been working on a variety of improvements to nbtree's native ScalarArrayOpExpr execution. This builds on Tom's work in commit 9e8da0f7. Attached patch is still at the prototype stage. I'm posting it as v1 a little earlier than I usually would because there has been much back and forth about it on a couple of other threads involving Tomas Vondra and Jeff Davis -- seems like it would be easier to discuss with working code available. The patch adds two closely related enhancements to ScalarArrayOp execution by nbtree: 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now "advance the scan's array keys locally", which sometimes avoids significant amounts of unneeded pinning/locking of the same set of index pages. SAOP index scans become capable of eliding primitive index scans for the next set of array keys in line in cases where it isn't truly necessary to descend the B-Tree again. Index scans are now capable of "sticking with the existing leaf page for now" when it is determined that the end of the current set of array keys is physically close to the start of the next set of array keys (the next set in line to be materialized by the _bt_advance_array_keys state machine). This is often possible. Naturally, we still prefer to advance the array keys in the traditional way ("globally") much of the time. That means we'll perform another _bt_first/_bt_search descent of the index, starting a new primitive index scan. Whether we try to skip pages on the leaf level or stick with the current primitive index scan (by advancing array keys locally) is likely to vary a great deal. Even during the same index scan. Everything is decided dynamically, which is the only approach that really makes sense. This optimization can significantly lower the number of buffers pinned and locked in cases with significant locality, and/or with many array keys with no matches. The savings (when measured in buffers pined/locked) can be as high as 10x, 100x, or even more. Benchmarking has shown that transaction throughput for variants of "pgbench -S" designed to stress the implementation (hundreds of array constants) under concurrent load can have up to 5.5x higher transaction throughput with the patch. Less extreme cases (10 array constants, spaced apart) see about a 20% improvement in throughput. There are similar improvements to latency for the patch, in each case. 2. The optimizer now produces index paths with multiple SAOP clauses (or other clauses we can safely treat as "equality constraints'') on each of the leading columns from a composite index -- all while preserving index ordering/useful pathkeys in most cases. The nbtree work from item 1 is useful even with the simplest IN() list query involving a scan of a single column index. Obviously, it's very inefficient for the nbtree code to use 100 primitive index scans when 1 is sufficient. But that's not really why I'm pursuing this project. My real goal is to implement (or to enable the implementation of) a whole family of useful techniques for multi-column indexes. I call these "MDAM techniques", after the 1995 paper "Efficient Search of Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional access method". In the context of the paper, "dimension" refers to dimensions in a decision support system. The most compelling cases for the patch all involve multiple index columns with multiple SAOP clauses (especially where each column represents a separate "dimension", in the DSS sense). It's important that index sort be preserved whenever possible, too. Sometimes this is directly useful (e.g., because the query has an ORDER BY), but it's always indirectly needed, on the nbtree side (when the optimizations are applicable at all). The new nbtree code now has special requirements surrounding SAOP search type scan keys with composite indexes. These requirements make changes in the optimizer all but essential. Index order =========== As I said, there are cases where preserving index order is immediately and obviously useful, in and of itself. Let's start there. Here's a test case that you can run against the regression test database: pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty); CREATE INDEX pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS) select ctid, thousand from tenk1 where two in (0,1) and four in (1,2) and twenty in (1,2) order by two, four, twenty limit 20; With the patch, this query gets 13 buffer hits. On the master branch, it gets 1377 buffer hits -- which exceeds the number you'll get from a sequential scan by about 4x. No coaxing was required to get the planner to produce this plan on the master branch. Almost all of the savings shown here are related to heap page buffer hits -- the nbtree changes don't directly help in this particular example (strictly speaking, you only need the optimizer changes to get this result). Obviously, the immediate reason why the patch wins by so much is because it produces a plan that allows the LIMIT to terminate the scan far sooner. Benoit Tigeot (CC'd) happened to run into this issue organically -- that was also due to heap hits, a LIMIT, and so on. As luck would have it, I stumbled upon his problem report (in the Postgres slack channel) while I was working on this patch. He produced a fairly complete test case, which was helpful [3]. This example is more or less just a distillation of his test case, designed to be easy for a Postgres hacker to try out for themselves. There are also variants of this query where a LIMIT isn't the crucial factor, and where index page hits are the problem. This query uses an index-only scan, both on master and with the patch (same index as before): select count(*), two, four, twenty from tenk1 where two in (0, 1) and four in (1, 2, 3, 4) and twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15) group by two, four, twenty order by two, four, twenty; The patch gets 18 buffer hits for this query. That outcome makes intuitive sense, since this query is highly unselective -- it's approaching the selectivity of the query "select count(*) from tenk1". The simple count(*) query gets 19 buffer hits for its own index-only scan, confirming that the patch managed to skip only one or two leaf pages in the complicated "group by" variant of the count(*) query. Overall, the GroupAggregate plan used by the patch is slower than the simple count(*) case (despite touching fewer pages). But both plans have *approximately* the same execution cost, which makes sense, since they both have very similar selectivities. The master branch gets 245 buffer hits for the same group by query. This is almost as many hits as a sequential scan would require -- even though there are precisely zero heap accesses needed by the underlying index-only scan. As with the first example, no planner coaxing was required to get this outcome on master. It is inherently very difficult to predict how selective a query like this will be using conventional statistics. But that's not actually the problem in this example -- the planner gets that part right, on this occasion. The real problem is that there is a multiplicative factor to worry about on master, when executing multiple SAOPs. That makes it almost impossible to predict the number of pages we'll pin. While with the patch, scans with multiple SAOPs are often fairly similar to scans that happen to just have one on the leading column. With the patch, it is simply impossible for an SAOP index scan to visit any single leaf page more than once. Just like a conventional index scan. Whereas right now, on master, using more than one SAOP clause for a multi column index seems to me to be a wildly risky proposition. You can easily have cases that work just fine on master, while only slight variations of the same query see costs explode (especially likely with a LIMIT). ISTM that there is significant value in knowing for sure in having a pretty accurate idea of the worst case in the planner. Giving nbtree the ability to skip or not skip dynamically, based on actual conditions in the index (not on statistics), seems like it has a lot of potential as a way of improving performance *stability*. Personally I'm most interested in this aspect of the project. Note: we can visit internal pages more than once, but that seems to make a negligible difference to the overall cost profile of scans. Our policy is to not charge an I/O cost for those pages. Plus, the number of internal page access is dramatically reduced (it's just not guaranteed that there won't be any repeat accesses for internal pages, is all). Note also: there are hard-to-pin-down interactions between the immediate problem on the nbtree side, and the use of filter quals rather than true index quals, where the use of index quals is possible in principle. Some problematic cases see excessive amounts of heap page hits only (as with my first example query). Other problematic cases see excessive amounts of index page hits, with little to no impact on heap page hits at all (as with my second example query). Some combination of the two is also possible. Safety ====== As mentioned already, the ability to "advance the current set of array keys locally" during a scan (the nbtree work in item 1) actually relies the optimizer work in item 2 -- it's not just a question of unlocking the potential of the nbtree work. Now I'll discuss those aspects in a bit more detail. Without the optimizer work, nbtree will produce wrong answers to queries, in a way that resembles the complaint addressed by historical bugfix commit 807a40c5. This incorrect behavior (if the optimizer were to permit it) would only be seen when there are multiple arrays/columns, and an inequality on a leading column -- just like with that historical bug. (It works both ways, though -- the nbtree changes also make the optimizer changes safe by limiting the worst case, which would otherwise be too much of a risk to countenance. You can't separate one from the other.) The primary change on the optimizer side is the addition of logic to differentiate between the following two cases when building an index path in indxpath.c: * Unsafe: Cases where it's fundamentally unsafe to treat multi-column-with-SAOP-clause index paths as returning tuples in a useful sort order. For example, the test case committed as part of that bugfix involves an inequality, so it continues to be treated as unsafe. * Safe: Cases where (at least in theory) bugfix commit 807a40c5 went further than it really had to. Those cases get to use the optimization, and usually get to have useful path keys. My optimizer changes are very kludgey. I came up with various ad-hoc rules to distinguish between the safe and unsafe cases, without ever really placing those changes into some kind of larger framework. That was enough to validate the general approach in nbtree, but it certainly has problems -- glaring problems. The biggest problem of all may be my whole "safe vs unsafe" framing itself. I know that many of the ostensibly unsafe cases are in fact safe (with the right infrastructure in place), because the MDAM paper says just that. The optimizer can't support inequalities right now, but the paper describes how to support "NOT IN( )" lists -- clearly an inequality! The current ad-hoc rules are at best incomplete, and at worst are addressing the problem in fundamentally the wrong way. CNF -> DNF conversion ===================== Like many great papers, the MDAM paper takes one core idea, and finds ways to leverage it to the hilt. Here the core idea is to take predicates in conjunctive normal form (an "AND of ORs"), and convert them into disjunctive normal form (an "OR of ANDs"). DNF quals are logically equivalent to CNF quals, but ideally suited to SAOP-array style processing by an ordered B-Tree index scan -- they reduce everything to a series of non-overlapping primitive index scans, that can be processed in keyspace order. We already do this today in the case of SAOPs, in effect. The nbtree "next array keys" state machine already materializes values that can be seen as MDAM style DNF single value predicates. The state machine works by outputting the cartesian product of each array as a multi-column index is scanned, but that could be taken a lot further in the future. We can use essentially the same kind of state machine to do everything described in the paper -- ultimately, it just needs to output a list of disjuncts, like the DNF clauses that the paper shows in "Table 3". In theory, anything can be supported via a sufficiently complete CNF -> DNF conversion framework. There will likely always be the potential for unsafe/unsupported clauses and/or types in an extensible system like Postgres, though. So we will probably need to retain some notion of safety. It seems like more of a job for nbtree preprocessing (or some suitably index-AM-agnostic version of the same idea) than the optimizer, in any case. But that's not entirely true, either (that would be far too easy). The optimizer still needs to optimize. It can't very well do that without having some kind of advanced notice of what is and is not supported by the index AM. And, the index AM cannot just unilaterally decide that index quals actually should be treated as filter/qpquals, after all -- it doesn't get a veto. So there is a mutual dependency that needs to be resolved. I suspect that there needs to be a two way conversation between the optimizer and nbtree code to break the dependency -- a callback that does some of the preprocessing work during planning. Tom said something along the same lines in passing, when discussing the MDAM paper last year [2]. Much work remains here. Skip Scan ========= MDAM encompasses something that people tend to call "skip scan" -- terminology with a great deal of baggage. These days I prefer to call it "filling in missing key predicates", per the paper. That's much more descriptive, and makes it less likely that people will conflate the techniques with InnoDB style "loose Index scans" -- the latter is a much more specialized/targeted optimization. (I now believe that these are very different things, though I was thrown off by the superficial similarities for a long time. It's pretty confusing.) I see this work as a key enabler of "filling in missing key predicates". MDAM describes how to implement this technique by applying the same principles that it applies everywhere else: it proposes a scheme that converts predicates from CNF to DNF. With just a little extra logic required to do index probes to feed the DNF-generating state machine, on demand. More concretely, in Postgres terms: skip scan can be implemented by inventing a new placeholder clause that can be composed alongside ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs can be composed together in the patch already. I'm thinking of a type of clause that makes the nbtree code materialize a set of "array keys" for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes (perhaps static approaches would be better for types like boolean, which the paper contemplates). It should be possible to teach the _bt_advance_array_keys state machine to generate those values in approximately the same fashion as it already does for ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a localized fashion, allowing everything else to continue to work in the same way without any special concern. This separation of concerns is a nice consequence of the way that the MDAM design really leverages preprocessing/DNF for everything. Both types of clauses can be treated as part of a general class of ScalarArrayOpExpr-like clauses. Making the rules around "composability" simple will be important. Although skip scan gets a lot of attention, it's not necessarily the most compelling MDAM technique. It's also not especially challenging to implement on top of everything else. It really isn't that special. Right now I'm focussed on the big picture, in any case. I want to emphasize the very general nature of these techniques. Although I'm focussed on SOAPs in the short term, many queries that don't make use of SAOPs should ultimately see similar benefits. For example, the paper also describes transformations that apply to BETWEEN/range predicates. We might end up needing a third type of expression for those. They're all just DNF single value predicates, under the hood. Thoughts? [1] http://vldb.org/conf/1995/P710.PDF [2] https://www.postgresql.org/message-id/2587523.1647982549@sss.pgh.pa.us [3] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote: > > I've been working on a variety of improvements to nbtree's native > ScalarArrayOpExpr execution. This builds on Tom's work in commit > 9e8da0f7. Cool. > Attached patch is still at the prototype stage. I'm posting it as v1 a > little earlier than I usually would because there has been much back > and forth about it on a couple of other threads involving Tomas Vondra > and Jeff Davis -- seems like it would be easier to discuss with > working code available. > > The patch adds two closely related enhancements to ScalarArrayOp > execution by nbtree: > > 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree > index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now > "advance the scan's array keys locally", which sometimes avoids > significant amounts of unneeded pinning/locking of the same set of > index pages. > > SAOP index scans become capable of eliding primitive index scans for > the next set of array keys in line in cases where it isn't truly > necessary to descend the B-Tree again. Index scans are now capable of > "sticking with the existing leaf page for now" when it is determined > that the end of the current set of array keys is physically close to > the start of the next set of array keys (the next set in line to be > materialized by the _bt_advance_array_keys state machine). This is > often possible. > > Naturally, we still prefer to advance the array keys in the > traditional way ("globally") much of the time. That means we'll > perform another _bt_first/_bt_search descent of the index, starting a > new primitive index scan. Whether we try to skip pages on the leaf > level or stick with the current primitive index scan (by advancing > array keys locally) is likely to vary a great deal. Even during the > same index scan. Everything is decided dynamically, which is the only > approach that really makes sense. > > This optimization can significantly lower the number of buffers pinned > and locked in cases with significant locality, and/or with many array > keys with no matches. The savings (when measured in buffers > pined/locked) can be as high as 10x, 100x, or even more. Benchmarking > has shown that transaction throughput for variants of "pgbench -S" > designed to stress the implementation (hundreds of array constants) > under concurrent load can have up to 5.5x higher transaction > throughput with the patch. Less extreme cases (10 array constants, > spaced apart) see about a 20% improvement in throughput. There are > similar improvements to latency for the patch, in each case. Considering that it caches/reuses the page across SAOP operations, can (or does) this also improve performance for index scans on the outer side of a join if the order of join columns matches the order of the index? That is, I believe this caches (leaf) pages across scan keys, but can (or does) it also reuse these already-cached leaf pages across restarts of the index scan/across multiple index lookups in the same plan node, so that retrieval of nearby index values does not need to do an index traversal? > [...] > Skip Scan > ========= > > MDAM encompasses something that people tend to call "skip scan" -- > terminology with a great deal of baggage. These days I prefer to call > it "filling in missing key predicates", per the paper. That's much > more descriptive, and makes it less likely that people will conflate > the techniques with InnoDB style "loose Index scans" -- the latter is > a much more specialized/targeted optimization. (I now believe that > these are very different things, though I was thrown off by the > superficial similarities for a long time. It's pretty confusing.) I'm not sure I understand. MDAM seems to work on an index level to return full ranges of values, while "skip scan" seems to try to allow systems to signal to the index to skip to some other index condition based on arbitrary cutoffs. This would usually be those of which the information is not stored in the index, such as "SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go though the user_id index and skip to the next user_id value when it gets enough rows of a matching result (where "enough" is determined above the index AM's plan node, or otherwise is impossible to determine with only the scan key info in the index AM). I'm not sure how this could work without specifically adding skip scan-related index AM functionality, and I don't see how it fits in with this MDAM/SAOP system. > [...] > > Thoughts? MDAM seems to require exponential storage for "scan key operations" for conditions on N columns (to be precise, the product of the number of distinct conditions on each column); e.g. an index on mytable (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used for each column, that'd be 4^8, etc... With an index column limit of 32, that's quite a lot of memory potentially needed to execute the statement. So, this begs the question: does this patch have the same issue? Does it fail with OOM, does it gracefully fall back to the old behaviour when the clauses are too complex to linearize/compose/fold into the btree ordering clauses, or are scan keys dynamically constructed using just-in-time- or generator patterns? Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Considering that it caches/reuses the page across SAOP operations, can > (or does) this also improve performance for index scans on the outer > side of a join if the order of join columns matches the order of the > index? It doesn't really cache leaf pages at all. What it does is advance the array keys locally, while the original buffer lock is still held on that same page. > That is, I believe this caches (leaf) pages across scan keys, but can > (or does) it also reuse these already-cached leaf pages across > restarts of the index scan/across multiple index lookups in the same > plan node, so that retrieval of nearby index values does not need to > do an index traversal? I'm not sure what you mean. There is no reason why you need to do more than one single descent of an index to scan many leaf pages using many distinct sets of array keys. Obviously, this depends on being able to observe that we really don't need to redescend the index to advance the array keys, again and again. Note in particularly that this usually works across leaf pages. > I'm not sure I understand. MDAM seems to work on an index level to > return full ranges of values, while "skip scan" seems to try to allow > systems to signal to the index to skip to some other index condition > based on arbitrary cutoffs. This would usually be those of which the > information is not stored in the index, such as "SELECT user_id FROM > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go > though the user_id index and skip to the next user_id value when it > gets enough rows of a matching result (where "enough" is determined > above the index AM's plan node, or otherwise is impossible to > determine with only the scan key info in the index AM). I'm not sure > how this could work without specifically adding skip scan-related > index AM functionality, and I don't see how it fits in with this > MDAM/SAOP system. I think of that as being quite a different thing. Basically, the patch that added that feature had to revise the index AM API, in order to support a mode of operation where scans return groupings rather than tuples. Whereas this patch requires none of that. It makes affected index scans as similar as possible to conventional index scans. > > [...] > > > > Thoughts? > > MDAM seems to require exponential storage for "scan key operations" > for conditions on N columns (to be precise, the product of the number > of distinct conditions on each column); e.g. an index on mytable > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > AND h IN (1, 2)" would require 2^8 entries. Note that I haven't actually changed anything about the way that the state machine generates new sets of single value predicates -- it's still just cycling through each distinct set of array keys in the patch. What you describe is a problem in theory, but I doubt that it's a problem in practice. You don't actually have to materialize the predicates up-front, or at all. Plus you can skip over them using the next index tuple. So skipping works both ways. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Considering that it caches/reuses the page across SAOP operations, can > > (or does) this also improve performance for index scans on the outer > > side of a join if the order of join columns matches the order of the > > index? > > It doesn't really cache leaf pages at all. What it does is advance the > array keys locally, while the original buffer lock is still held on > that same page. Hmm, then I had a mistaken understanding of what we do in _bt_readpage with _bt_saveitem. > > That is, I believe this caches (leaf) pages across scan keys, but can > > (or does) it also reuse these already-cached leaf pages across > > restarts of the index scan/across multiple index lookups in the same > > plan node, so that retrieval of nearby index values does not need to > > do an index traversal? > > I'm not sure what you mean. There is no reason why you need to do more > than one single descent of an index to scan many leaf pages using many > distinct sets of array keys. Obviously, this depends on being able to > observe that we really don't need to redescend the index to advance > the array keys, again and again. Note in particularly that this > usually works across leaf pages. In a NestedLoop(inner=seqscan, outer=indexscan), the index gets repeatedly scanned from the root, right? It seems that right now, we copy matching index entries into a local cache (that is deleted on amrescan), then we drop our locks and pins on the buffer, and then start returning values from our local cache (in _bt_saveitem). We could cache the last accessed leaf page across amrescan operations to reduce the number of index traversals needed when the join key of the left side is highly (but not necessarily strictly) correllated. The worst case overhead of this would be 2 _bt_compares (to check if the value is supposed to be fully located on the cached leaf page) plus one memcpy( , , BLCKSZ) in the previous loop. With some smart heuristics (e.g. page fill factor, number of distinct values, and whether we previously hit this same leaf page in the previous scan of this Node) we can probably also reduce this overhead to a minimum if the joined keys are not correllated, but accellerate the query significantly when we find out they are correllated. Of course, in the cases where we'd expect very few distinct join keys the planner would likely put a Memoize node above the index scan, but for mostly unique join keys I think this could save significant amounts of time, if only on buffer pinning and locking. I guess I'll try to code something up when I have the time, as it sounds not quite exactly related to your patch but an interesting improvement nonetheless. Kind regards, Matthias van de Meent
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > We could cache the last accessed leaf page across amrescan operations > to reduce the number of index traversals needed when the join key of > the left side is highly (but not necessarily strictly) correllated. That sounds like block nested loop join. It's possible that that could reuse some infrastructure from this patch, but I'm not sure. In general, SAOP execution/MDAM performs "duplicate elimination before it reads the data" by sorting and deduplicating the arrays up front. While my patch sometimes elides a primitive index scan, primitive index scans are already disjuncts that are combined to create what can be considered one big index scan (that's how the planner and executor think of them). The patch takes that one step further by recognizing that it could quite literally be one big index scan in some cases (or fewer, larger scans, at least). It's a natural incremental improvement, as opposed to inventing a new kind of index scan. If anything the patch makes SAOP execution more similar to traditional index scans, especially when costing them. Like InnoDB style loose index scan (for DISTINCT and GROUP BY optimization), block nested loop join would require inventing a new type of index scan. Both of these other two optimizations involve the use of semantic information that spans multiple levels of abstraction. Loose scan requires duplicate elimination (that's the whole point), while IIUC block nested loop join needs to "simulate multiple inner index scans" by deliberately returning duplicates for each would-be inner index scan. These are specialized things. To be clear, I think that all of these ideas are reasonable. I just find it useful to classify these sorts of techniques according to whether or not the index AM API would have to change or not, and the general nature of any required changes. MDAM can do a lot of cool things without requiring any revisions to the index AM API, which should allow it to play nice with everything else (index path clause safety issues notwithstanding). -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent > > I'm not sure I understand. MDAM seems to work on an index level to > > return full ranges of values, while "skip scan" seems to try to allow > > systems to signal to the index to skip to some other index condition > > based on arbitrary cutoffs. This would usually be those of which the > > information is not stored in the index, such as "SELECT user_id FROM > > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go > > though the user_id index and skip to the next user_id value when it > > gets enough rows of a matching result (where "enough" is determined > > above the index AM's plan node, or otherwise is impossible to > > determine with only the scan key info in the index AM). I'm not sure > > how this could work without specifically adding skip scan-related > > index AM functionality, and I don't see how it fits in with this > > MDAM/SAOP system. > > I think of that as being quite a different thing. > > Basically, the patch that added that feature had to revise the index > AM API, in order to support a mode of operation where scans return > groupings rather than tuples. Whereas this patch requires none of > that. It makes affected index scans as similar as possible to > conventional index scans. Hmm, yes. I see now where my confusion started. You called it out in your first paragraph of the original mail, too, but that didn't help me then: The wiki does not distinguish "Index Skip Scans" and "Loose Index Scans", but these are not the same. In the one page on "Loose indexscan", it refers to MySQL's "loose index scan" documentation, which does handle groupings, and this was targeted with the previous, mislabeled, "Index skipscan" patchset. However, crucially, it also refers to other databases' Index Skip Scan documentation, which document and implement this approach of 'skipping to the next potential key range to get efficient non-prefix qual results', giving me a false impression that those two features are one and the same when they are not. It seems like I'll have to wait a bit longer for the functionality of Loose Index Scans. > > > [...] > > > > > > Thoughts? > > > > MDAM seems to require exponential storage for "scan key operations" > > for conditions on N columns (to be precise, the product of the number > > of distinct conditions on each column); e.g. an index on mytable > > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > > AND h IN (1, 2)" would require 2^8 entries. > > Note that I haven't actually changed anything about the way that the > state machine generates new sets of single value predicates -- it's > still just cycling through each distinct set of array keys in the > patch. > > What you describe is a problem in theory, but I doubt that it's a > problem in practice. You don't actually have to materialize the > predicates up-front, or at all. Yes, that's why I asked: The MDAM paper's examples seem to materialize the full predicate up-front, which would require a product of all indexed columns' quals in size, so that materialization has a good chance to get really, really large. But if we're not doing that materialization upfront, then there is no issue with resource consumption (except CPU time, which can likely be improved with other methods) Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > We could cache the last accessed leaf page across amrescan operations > > to reduce the number of index traversals needed when the join key of > > the left side is highly (but not necessarily strictly) correllated. > > That sounds like block nested loop join. It's possible that that could > reuse some infrastructure from this patch, but I'm not sure. My idea is not quite block nested loop join. It's more 'restart the index scan at the location the previous index scan ended, if heuristics say there's a good chance that might save us time'. I'd say it is comparable to the fast tree descent optimization that we have for endpoint queries, and comparable to this patch's scankey optimization, but across AM-level rescans instead of internal rescans. See also the attached prototype and loosely coded patch. It passes tests, but it might not be without bugs. The basic design of that patch is this: We keep track of how many times we've rescanned, and the end location of the index scan. If a new index scan hits the same page after _bt_search as the previous scan ended, we register that. Those two values - num_rescans and num_samepage - are used as heuristics for the following: If 50% or more of rescans hit the same page as the end location of the previous scan, we start saving the scan's end location's buffer into the BTScanOpaque, so that the next _bt_first can check whether that page might be the right leaf page, and if so, immediately go to that buffer instead of descending the tree - saving one tree descent in the process. Further optimizations of this mechanism could easily be implemented by e.g. only copying the min/max index tuples instead of the full index page, reducing the overhead at scan end. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > Basically, the patch that added that feature had to revise the index > > AM API, in order to support a mode of operation where scans return > > groupings rather than tuples. Whereas this patch requires none of > > that. It makes affected index scans as similar as possible to > > conventional index scans. > > Hmm, yes. I see now where my confusion started. You called it out in > your first paragraph of the original mail, too, but that didn't help > me then: > > The wiki does not distinguish "Index Skip Scans" and "Loose Index > Scans", but these are not the same. A lot of people (myself included) were confused on this point for quite a while. To make matters even more confusing, one of the really compelling cases for the MDAM design is scans that feed into GroupAggregates -- preserving index sort order for naturally big index scans will tend to enable it. One of my examples from the start of this thread showed just that. (It just so happened that that example was faster because of all the "skipping" that nbtree *wasn't* doing with the patch.) > Yes, that's why I asked: The MDAM paper's examples seem to materialize > the full predicate up-front, which would require a product of all > indexed columns' quals in size, so that materialization has a good > chance to get really, really large. But if we're not doing that > materialization upfront, then there is no issue with resource > consumption (except CPU time, which can likely be improved with other > methods) I get why you asked. I might have asked the same question. As I said, the MDAM paper has *surprisingly* little to say about B-Tree executor stuff -- it's almost all just describing the preprocessing/transformation process. It seems as if optimizations like the one from my patch were considered too obvious to talk about and/or out of scope by the authors. Thinking about the MDAM paper like that was what made everything fall into place for me. Remember, "missing key predicates" isn't all that special. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > Basically, the patch that added that feature had to revise the index > > > AM API, in order to support a mode of operation where scans return > > > groupings rather than tuples. Whereas this patch requires none of > > > that. It makes affected index scans as similar as possible to > > > conventional index scans. > > > > Hmm, yes. I see now where my confusion started. You called it out in > > your first paragraph of the original mail, too, but that didn't help > > me then: > > > > The wiki does not distinguish "Index Skip Scans" and "Loose Index > > Scans", but these are not the same. > > A lot of people (myself included) were confused on this point for > quite a while. I've taken the liberty to update the "Loose indexscan" wiki page [0], adding detail that Loose indexscans are distinct from Skip scans, and showing some high-level distinguishing properties. I also split the TODO entry for `` "loose" or "skip" scan `` into two, and added links to the relevant recent threads so that it's clear these are different (and that some previous efforts may have had a confusing name). I hope this will reduce the chance of future confusion between the two different approaches to improving index scan performance. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0]: https://wiki.postgresql.org/wiki/Loose_indexscan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Alena Rybakina
Date:
Hi, all! > CNF -> DNF conversion > ===================== > > Like many great papers, the MDAM paper takes one core idea, and finds > ways to leverage it to the hilt. Here the core idea is to take > predicates in conjunctive normal form (an "AND of ORs"), and convert > them into disjunctive normal form (an "OR of ANDs"). DNF quals are > logically equivalent to CNF quals, but ideally suited to SAOP-array > style processing by an ordered B-Tree index scan -- they reduce > everything to a series of non-overlapping primitive index scans, that > can be processed in keyspace order. We already do this today in the > case of SAOPs, in effect. The nbtree "next array keys" state machine > already materializes values that can be seen as MDAM style DNF single > value predicates. The state machine works by outputting the cartesian > product of each array as a multi-column index is scanned, but that > could be taken a lot further in the future. We can use essentially the > same kind of state machine to do everything described in the paper -- > ultimately, it just needs to output a list of disjuncts, like the DNF > clauses that the paper shows in "Table 3". > > In theory, anything can be supported via a sufficiently complete CNF > -> DNF conversion framework. There will likely always be the potential > for unsafe/unsupported clauses and/or types in an extensible system > like Postgres, though. So we will probably need to retain some notion > of safety. It seems like more of a job for nbtree preprocessing (or > some suitably index-AM-agnostic version of the same idea) than the > optimizer, in any case. But that's not entirely true, either (that > would be far too easy). > > The optimizer still needs to optimize. It can't very well do that > without having some kind of advanced notice of what is and is not > supported by the index AM. And, the index AM cannot just unilaterally > decide that index quals actually should be treated as filter/qpquals, > after all -- it doesn't get a veto. So there is a mutual dependency > that needs to be resolved. I suspect that there needs to be a two way > conversation between the optimizer and nbtree code to break the > dependency -- a callback that does some of the preprocessing work > during planning. Tom said something along the same lines in passing, > when discussing the MDAM paper last year [2]. Much work remains here. > Honestly, I'm just reading and delving into this thread and other topics related to it, so excuse me if I ask you a few obvious questions. I noticed that you are going to add CNF->DNF transformation at the index construction stage. If I understand correctly, you will rewrite restrictinfo node, change boolean "AND" expressions to "OR" expressions, but would it be possible to apply such a procedure earlier? Otherwise I suppose you could face the problem of incorrect selectivity of the calculation and, consequently, the cardinality calculation? I can't clearly understand at what stage it is clear that the such a transformation needs to be applied? -- Regards, Alena Rybakina Postgres Professional
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > My idea is not quite block nested loop join. It's more 'restart the > index scan at the location the previous index scan ended, if > heuristics say there's a good chance that might save us time'. I'd say > it is comparable to the fast tree descent optimization that we have > for endpoint queries, and comparable to this patch's scankey > optimization, but across AM-level rescans instead of internal rescans. Yeah, I see what you mean. Seems related, even though what you've shown in your prototype patch doesn't seem like it fits into my taxonomy very neatly. (BTW, I was a little confused by the use of the term "endpoint" at first, since there is a function that uses that term to refer to a descent of the tree that happens without any insertion scan key. This path is used whenever the best we can do in _bt_first is to descend to the rightmost or leftmost page.) > The basic design of that patch is this: We keep track of how many > times we've rescanned, and the end location of the index scan. If a > new index scan hits the same page after _bt_search as the previous > scan ended, we register that. I can see one advantage that block nested loop join would retain here: it does block-based accesses on both sides of the join. Since it "looks ahead" on both sides of the join, more repeat accesses are likely to be avoided. Not too sure how much that matters in practice, though. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote: > I noticed that you are going to add CNF->DNF transformation at the index > construction stage. If I understand correctly, you will rewrite > restrictinfo node, > change boolean "AND" expressions to "OR" expressions, but would it be > possible to apply such a procedure earlier? Sort of. I haven't really added any new CNF->DNF transformations. The code you're talking about is really just checking that every index path has clauses that we know that nbtree can handle. That's a big, ugly modularity violation -- many of these details are quite specific to the nbtree index AM (in theory we could have other index AMs that are amsearcharray). At most, v1 of the patch makes greater use of an existing transformation that takes place in the nbtree index AM, as it preprocesses scan keys for these types of queries (it's not inventing new transformations at all). This is a slightly creative interpretation, too. Tom's commit 9e8da0f7 didn't actually say anything about CNF/DNF. > Otherwise I suppose you > could face the problem of > incorrect selectivity of the calculation and, consequently, the > cardinality calculation? I can't think of any reason why that should happen as a direct result of what I have done here. Multi-column index paths + multiple SAOP clauses are not a new thing. The number of rows returned does not depend on whether we have some columns as filter quals or not. Of course that doesn't mean that the costing has no problems. The costing definitely has several problems right now. It also isn't necessarily okay that it's "just as good as before" if it turns out that it needs to be better now. But I don't see why it would be. (Actually, my hope is that selectivity estimation might be *less* important as a practical matter with the patch.) > I can't clearly understand at what stage it is clear that the such a > transformation needs to be applied? I don't know either. I think that most of this work needs to take place in the nbtree code, during preprocessing. But it's not so simple. There is a mutual dependency between the code that generates index paths in the planner and nbtree scan key preprocessing. The planner needs to know what kinds of index paths are possible/safe up-front, so that it can choose the fastest plan (the fastest that the index AM knows how to execute correctly). But, there are lots of small annoying nbtree implementation details that might matter, and can change. I think we need to have nbtree register a callback, so that the planner can initialize some preprocessing early. I think that we require a "two way conversation" between the planner and the index AM. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote: > > MDAM seems to require exponential storage for "scan key operations" > > for conditions on N columns (to be precise, the product of the number > > of distinct conditions on each column); e.g. an index on mytable > > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > > AND h IN (1, 2)" would require 2^8 entries. > What you describe is a problem in theory, but I doubt that it's a > problem in practice. You don't actually have to materialize the > predicates up-front, or at all. Plus you can skip over them using the > next index tuple. So skipping works both ways. Attached is v2, which makes all array key advancement take place using the "next index tuple" approach (using binary searches to find array keys using index tuple values). This approach was necessary for fairly mundane reasons (it limits the amount of work required while holding a buffer lock), but it also solves quite a few other problems that I find far more interesting. It's easy to imagine the state machine from v2 of the patch being extended for skip scan. My approach "abstracts away" the arrays. For skip scan, it would more or less behave as if the user had written a query "WHERE a in (<Every possible value for this column>) AND b = 5 ... " -- without actually knowing what the so-called array keys for the high-order skipped column are (not up front, at least). We'd only need to track the current "array key" for the scan key on the skipped column, "a". The state machine would notice when the scan had reached the next-greatest "a" value in the index (whatever that might be), and then make that the current value. Finally, the state machine would effectively instruct its caller to consider repositioning the scan via a new descent of the index. In other words, almost everything for skip scan would work just like regular SAOPs -- and any differences would be well encapsulated. But it's not just skip scan. This approach also enables thinking of SAOP index scans (using nbtree) as just another type of indexable clause, without any special restrictions (compared to true indexable operators such as "=", say). Particularly in the planner. That was always the general thrust of teaching nbtree about SAOPs, from the start. But it's something that should be totally embraced IMV. That's just what the patch proposes to do. In particular, the patch now: 1. Entirely removes the long-standing restriction on generating path keys for index paths with SAOPs, even when there are inequalities on a high order column present. You can mix SAOPs together with other clause types, arbitrarily, and everything still works and works efficiently. For example, the regression test expected output for this query/test (from bugfix commit 807a40c5) is updated by the patch, as shown here: explain (costs off) SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; - QUERY PLAN --------------------------------------------------------------------------------------- - Sort - Sort Key: thousand - -> Index Scan using tenk1_thous_tenthous on tenk1 - Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) -(4 rows) + QUERY PLAN +-------------------------------------------------------------------------------- + Index Scan using tenk1_thous_tenthous on tenk1 + Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) +(2 rows) We don't need a sort node anymore -- even though the leading column here (thousand) uses an inequality, a particularly tricky case. Now it's an index scan, much like any other, with no particular restrictions caused by using a SAOP. 2. Adds an nbtree strategy for non-required equality array scan keys, which is built on the same state machine, with only minor differences to deal with column values "appearing out of key space order". 3. Simplifies the optimizer side of things by consistently avoiding filter quals (except when it's truly unavoidable). The optimizer doesn't even consider alternative index paths with filter quals for lower-order SAOP columns, because they have no possible advantage anymore. On the other hand, as we saw already, upthread, filter quals have huge disadvantages. By always using true index quals, we automatically avoid any question of getting excessive amounts of heap page accesses just to eliminate non-matching rows. AFAICT we don't need to make a trade-off here. The first version of the patch added some crufty code to the optimizer, to account for various restrictions on sort order. This revised version actually removes existing cruft from the same place (indxpath.c) instead. Items 1, 2, and 3 are all closely related. Take the query I've shown for item 1. Bugfix commit 807a40c5 (which added the test query in question) dealt with an oversight in the then-recent original nbtree SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index scans with an inequality on their leading column, we cannot be sure that the output will appear in the same order as the order that one big continuous index scan returns rows in. We can only expect to maintain the illusion that we're doing one continuous index scan when individual primitive index scans access earlier columns via the equality strategy -- we need "equality constraints". In practice, the optimizer (indxpath.c) is very conservative (more conservative than it really needs to be) when it comes to trusting the index scan to output rows in index order, in the presence of SAOPs. All of that now seems totally unnecessary. Again, I don't see a need to make a trade-off here. My observation about this query (and others like it) is: why not literally perform one continuous index scan instead (not multiple primitive index scans)? That is strictly better, given all the specifics here. Once we have a way to do that (which the nbtree executor work listed under item 2 provides), it becomes safe to assume that the tuples will be output in index order -- there is no illusion left to preserve. Who needs an illusion that isn't actually helping us? We actually do less I/O by using this strategy, for the usual reasons (we can avoid repeating index page accesses). A more concrete benefit of the non-required-scankeys stuff can be seen by running Benoit Tigeot's test case [1] with v2. He had a query like this: SELECT * FROM docs WHERE status IN ('draft', 'sent') AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY sent_at DESC NULLS LAST LIMIT 20; And, his test case had an index on "sent_at DESC NULLS LAST, sender_reference, status". This variant was a weak spot for v1. v2 of the patch is vastly more efficient here, since we don't have to go to the heap to eliminate non-matching tuples -- that can happen in the index AM instead. This can easily be 2x-3x faster on a warm cache, and have *hundreds* of times fewer buffer accesses (which Benoit verified with an early version of this v2). All because we now require vastly less heap access -- the quals are fairly selective here, and we have to scan hundreds of leaf pages before the scan can terminate. Avoiding filter quals is a huge win. This particular improvement is hard to squarely attribute to any one of my 3 items. The immediate problem that the query presents us with on the master branch is the problem of filter quals that require heap accesses to do visibility checks (a problem that index quals can never have). That makes it tempting to credit my item 3. But you can't really have item 3 without also having items 1 and 2. Taken together, they eliminate all possible downsides from using index quals. That high level direction (try to have one good choice for the optimizer) seems important to me. Both for this project, and in general. Other changes in v2: * Improved costing, that takes advantage of the fact that nbtree now promises to not repeat any leaf page accesses (unless the scan is restarted or the direction of the scan changes). This makes the worst case far more predictable, and more related to selectivity estimation -- you can't scan more pages than you have in the whole index. Just like with every other sort of index scan. * Support for parallel index scans. The existing approach to array keys for parallel index scan has been adopted to work with individual primitive index scans, not individual array keys. I haven't tested this very thoroughly just yet, but it seems to work well enough already. I think that it's important to not have very much variation between parallel and serial index scans, which I seem to have mostly avoided. [1] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v2, which makes all array key advancement take place using > the "next index tuple" approach (using binary searches to find array > keys using index tuple values). Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc. No notable changes here compared to v2. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v2, which makes all array key advancement take place using > > the "next index tuple" approach (using binary searches to find array > > keys using index tuple values). > > Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc. Attached is v4, which applies cleanly on top of HEAD. This was needed due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan keys required for directional scan in B-tree". Unfortunately I have more or less dealt with the conflicts on HEAD by disabling the optimization from that commit, for the time being. The commit in question is rather poorly documented, and it's not immediately clear how to integrate it with my work. I just want to make sure that there's a testable patch available. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v4, which applies cleanly on top of HEAD. This was needed > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan > keys required for directional scan in B-tree". > > Unfortunately I have more or less dealt with the conflicts on HEAD by > disabling the optimization from that commit, for the time being. Attached is v5, which deals with the conflict with the optimization added by Alexandar Korotkov's commit e0b1ee17 sensibly: the optimization is now only disabled in cases without array scan keys. (It'd be very hard to make it work with array scan keys, since an important principle for my patch is that we can change search-type scan keys right in the middle of any _bt_readpage() call). v5 also fixes a longstanding open item for the patch: we no longer call _bt_preprocess_keys() with a buffer lock held, which was a bad idea at best, and unsafe (due to the syscache lookups within _bt_preprocess_keys) at worst. A new, minimal version of the function (called _bt_preprocess_keys_leafbuf) is called at the same point instead. That change, combined with the array binary search stuff (which was added back in v2), makes the total amount of work performed with a buffer lock held totally reasonable in all cases. It's even okay in extreme or adversarial cases with many millions of array keys. Making this _bt_preprocess_keys_leafbuf approach work has a downside: it requires that _bt_preprocess_keys be a little less aggressive about removing redundant scan keys, in order to meet certain assumptions held by the new _bt_preprocess_keys_leafbuf function. Essentially, _bt_preprocess_keys must now worry about current and future array key values when determining redundancy among scan keys -- not just the current array key values. _bt_preprocess_keys knows nothing about SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict 1:1 correspondence between the number of primitive index scans and the number of array keys (actually, the number of distinct combinations of array keys). Obviously that's no longer the case with the patch (that's the whole point of the patch). It's easiest to understand how elimination of redundant quals needs to work in v5 by way of an example. Consider the following query: select count(*), two, four, twenty, hundred from tenk1 where two in (0, 1) and four in (1, 2, 3) and two < 1; Notice that "two" appears in the where clause twice. First it appears as an SAOP, and then as an inequality. Right now, on HEAD, the primitive index scan where the SAOP's scankey is "two = 0" renders "two < 1" redundant. However, the subsequent primitive index scan where "two = 1" does *not* render "two < 1" redundant. This has implications for the mechanism in the patch, since the patch will perform one big primitive index scan for all array constants, with only a single _bt_preprocess_keys call at the start of its one and only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf calls once we reach the leaf level). The compromise that I've settled on in v5 is to teach _bt_preprocess_keys to *never* treat "two < 1" as redundant with such a query -- even though there is some squishy sense in which "two < 1" is indeed still redundant (for the first SAOP key of value 0). My approach is reasonably well targeted in that it mostly doesn't affect queries that don't need it. But it will add cycles to some badly written queries that wouldn't have had them in earlier Postgres versions. I'm not entirely sure how much this matters, but my current sense is that it doesn't matter all that much. This is the kind of thing that is hard to test and poorly tested, so simplicity is even more of a virtue than usual. Note that the changes to _bt_preprocess_keys in v5 *don't* affect how we determine if the scan has contradictory quals, which is generally more important. With contradictory quals, _bt_first can avoid reading any data from the index. OTOH eliminating redundant quals (i.e. the thing that v5 *does* change) merely makes evaluating index quals less expensive via preprocessing-away unneeded scan keys. In other words, while it's possible that the approach taken by v5 will add CPU cycles in a small number of cases, it should never result in more page accesses. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v4, which applies cleanly on top of HEAD. This was needed > > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan > > keys required for directional scan in B-tree". > > > > Unfortunately I have more or less dealt with the conflicts on HEAD by > > disabling the optimization from that commit, for the time being. > > Attached is v5, which deals with the conflict with the optimization > added by Alexandar Korotkov's commit e0b1ee17 sensibly: the > optimization is now only disabled in cases without array scan keys. > (It'd be very hard to make it work with array scan keys, since an > important principle for my patch is that we can change search-type > scan keys right in the middle of any _bt_readpage() call). I'm planning on reviewing this patch tomorrow, but in an initial scan through the patch I noticed there's little information about how the array keys state machine works in this new design. Do you have a more toplevel description of the full state machine used in the new design? If not, I'll probably be able to discover my own understanding of the mechanism used in the patch, but if there is a framework to build that understanding on (rather than having to build it from scratch) that'd be greatly appreciated. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I'm planning on reviewing this patch tomorrow, but in an initial scan > through the patch I noticed there's little information about how the > array keys state machine works in this new design. Do you have a more > toplevel description of the full state machine used in the new design? This is an excellent question. You're entirely right: there isn't enough information about the design of the state machine. In v1 of the patch, from all the way back in July, the "state machine" advanced in the hackiest way possible: via repeated "incremental" advancement (using logic from the function that we call _bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that until the function I'm now calling _bt_tuple_before_array_skeys() eventually reported that the array keys were now sufficiently advanced. v2 greatly improved matters by totally overhauling _bt_advance_array_keys(): it was taught to use binary searches to advance the array keys, with limited remaining use of "incremental" array key advancement. However, version 2 (and all later versions to date) have somewhat wonky state machine transitions, in one important respect: calls to the new _bt_advance_array_keys() won't always advance the array keys to the maximum extent possible (possible while still getting correct behavior, that is). There were still various complicated scenarios involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD scan keys that use BTEqualStrategyNumber), where one single call to _bt_advance_array_keys() would advance the array keys to a point that was still < caller's tuple. AFAICT this didn't cause wrong answers to queries (that would require failing to find a set of exactly matching array keys where a matching set exists), but it was kludgey. It was sloppy in roughly the same way as the approach in my v1 prototype was sloppy (just to a lesser degree). I should be able to post v6 later this week. My current plan is to commit the other nbtree patch first (the backwards scan "boundary cases" one from the ongoing CF) -- since I saw your review earlier today. I think that you should probably wait for this v6 before starting your review. The upcoming version will have simple preconditions and postconditions for the function that advances the array key state machine (the new _bt_advance_array_keys). These are enforced by assertions at the start and end of the function. So the rules for the state machine become crystal clear and fairly easy to keep in your head (e.g., tuple must be >= required array keys on entry and <= required array keys on exit, the array keys must always either advance by one increment or be completely exhausted for the top-level scan in the current scan direction). Unsurprisingly, I found that adding and enforcing these invariants led to a simpler and more general design within _bt_advance_array_keys. That code is still the most complicated part of the patch, but it's much less of a bag of tricks. Another reason for you to hold off for a few more days. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I'm planning on reviewing this patch tomorrow, but in an initial scan > > through the patch I noticed there's little information about how the > > array keys state machine works in this new design. Do you have a more > > toplevel description of the full state machine used in the new design? > > This is an excellent question. You're entirely right: there isn't > enough information about the design of the state machine. > > I should be able to post v6 later this week. My current plan is to > commit the other nbtree patch first (the backwards scan "boundary > cases" one from the ongoing CF) -- since I saw your review earlier > today. I think that you should probably wait for this v6 before > starting your review. Okay, thanks for the update, then I'll wait for v6 to be posted. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote: > > I should be able to post v6 later this week. My current plan is to > > commit the other nbtree patch first (the backwards scan "boundary > > cases" one from the ongoing CF) -- since I saw your review earlier > > today. I think that you should probably wait for this v6 before > > starting your review. > > Okay, thanks for the update, then I'll wait for v6 to be posted. On second thought, I'll just post v6 now (there won't be conflicts against the master branch once the other patch is committed anyway). Highlights: * Major simplifications to the array key state machine, already described by my recent email. * Added preprocessing of "redundant and contradictory" array elements to _bt_preprocess_array_keys(). This makes the special preprocessing pass just for array keys ("preprocessing preprocessing") within _bt_preprocess_array_keys() make this query into a no-op: select * from tab where a in (180, 345) and a in (230, 300); -- contradictory Similarly, it can make this query only attempt one single primitive index scan for "230": select * from tab where a in (180, 230) and a in (230, 300); -- has redundancies, plus some individual elements contradict each other This duplicates some of what _bt_preprocess_keys can do already. But _bt_preprocess_keys can only do this stuff at the level of individual array elements/primitive index scans. Whereas this works "one level up", allowing preprocessing to see the full picture rather than just seeing the start of one particular primitive index scan. It explicitly works across array keys, saving repeat work inside _bt_preprocess_keys. That could really add up with thousands of array keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys already does something like this, to deal with SAOP inequalities such as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising that this obvious optimization wasn't part of the original nbtree SAOP patch.) This reminds me: you might want to try breaking the patch by coming up with adversarial cases, Matthias. The patch needs to be able to deal with absurdly large amounts of array keys reasonably well, because it proposes to normalize passing those to the nbtree code. It's especially important that the patch never takes too much time to do something (e.g., binary searching through array keys) while holding a buffer lock -- even with very silly adversarial queries. So, for example, queries like this one (specifically designed to stress the implementation) *need* to work reasonably well: with a as ( select i from generate_series(0, 500000) i ) select count(*), thousand, tenthous from tenk1 where thousand = any (array[(select array_agg(i) from a)]) and tenthous = any (array[(select array_agg(i) from a)]) group by thousand, tenthous order by thousand, tenthous; (You can run this yourself after the regression tests finish, of course.) This takes about 130ms on my machine, hardly any of which takes place in the nbtree code with the patch (think tens of microseconds per _bt_readpage call, at most) -- the plan is an index-only scan that gets only 30 buffer hits. On the master branch, it's vastly slower -- 1000025 buffer hits. The query as a whole takes about 3 seconds there. If you have 3 or 4 SOAPs (with a composite index that has as many columns) you can quite easily DOS the master branch, since the planner makes a generic assumption that each of these SOAPs will have only 10 elements. The planner also thinks that with the patch applied, with one important difference: it doesn't matter to nbtree. The cost of scanning each index page should be practically independent of the total size of each array, at least past a certain point. Similarly, the maximum cost of an index scan should be approximately fixed: it should be capped at the cost of a full index scan (with the added cost of these relatively expensive quals still capped, still essentially independent of array sizes past some point). I notice that if I remove the "thousand = any (array[(select array_agg(i) from a)]) and" line from the adversarial query, executing the resulting query still get 30 buffer hits with the patch -- though it only takes 90ms this time (it's faster for reasons that likely have less than you'd think to do with nbtree overheads). This is just another way of getting roughly the same full index scan. That's a completely new way of thinking about nbtree SAOPs from a planner perspective (also from a user's perspective, I suppose). It's important that the planner's new optimistic assumptions about the cost profile of SOAPS (that it can expect reasonable performance/access patterns with wildly unreasonable/huge/arbitrarily complicated SAOPs) always be met by nbtree -- no repeat index page accesses, no holding a buffer lock for more than (say) a small fraction of 1 millisecond (no matter the complexity of the query), and possibly other things I haven't thought of yet. If you end up finding a bug in this v6, it'll most likely be a case where nbtree fails to live up to that. This project is as much about robust/predictable performance as anything else -- nbtree needs to be able to cope with practically anything. I suggest that your review start by trying to break the patch along these lines. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > If you end up finding a bug in this v6, it'll most likely be a case > where nbtree fails to live up to that. This project is as much about > robust/predictable performance as anything else -- nbtree needs to be > able to cope with practically anything. I suggest that your review > start by trying to break the patch along these lines. I spent some time on this myself today (which I'd already planned on). Attached is an adversarial stress-test, which shows something that must be approaching the worst case for the patch in terms of time spent with a buffer lock held, due to spending so much time evaluating unusually expensive SAOP index quals. The array binary searches that take place with a buffer lock held aren't quite like anything else that nbtree can do right now, so it's worthy of special attention. I thought of several factors that maximize both the number of binary searches within any given _bt_readpage, as well as the cost of each binary search -- the SQL file has full details. My test query is *extremely* unrealistic, since it combines multiple independent unrealistic factors, all of which aim to make life hard for the implementation in one way or another. I hesitate to say that it couldn't be much worse (I only spent a few hours on this), but I'm prepared to say that it seems very unlikely that any real world query could make the patch spend as many cycles in _bt_readpage/_bt_checkkeys as this one does. Perhaps you can think of some other factor that would make this test case even less sympathetic towards the patch, Matthias? The only thing I thought of that I've left out was the use of a custom btree opclass, "unrealistically_slow_ops". Something that calls pg_usleep in its order proc. (I left it out because it wouldn't prove anything.) On my machine, custom instrumentation shows that each call to _bt_readpage made while this query executes (on a patched server) takes just under 1.4 milliseconds. While that is far longer than it usually takes, it's basically acceptable IMV. It's not significantly longer than I'd expect heap_index_delete_tuples() to take on an average day with EBS (or other network-attached storage). But that's a process that happens all the time, with an exclusive buffer lock held on the leaf page throughout -- whereas this is only a shared buffer lock, and involves a query that's just absurd . Another factor that makes this seem acceptable is just how sensitive the test case is to everything going exactly and perfectly wrong, all at the same time, again and again. The test case uses a 32 column index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP clauses (one per index column). If I reduce the number of SAOP clauses in the query to (say) 8, I still have a test case that's almost as silly as my original -- but now we only spend ~225 microseconds in each _bt_readpage call (i.e. we spend over 6x less time in each _bt_readpage call). (Admittedly if I also make the CREATE INDEX use only 8 columns, we can fit more index tuples on one page, leaving us at ~800 microseconds). I'm a little surprised that it isn't a lot worse than this, given how far I went. I was a little concerned that it would prove necessary to lock this kind of thing down at some higher level (e.g., in the planner), but that now looks unnecessary. There are much better ways to DOS the server than this. For example, you could run this same query while forcing a sequential scan! That appears to be quite a lot less responsive to interrupts (in addition to being hopelessly slow), probably because it uses parallel workers, each of which will use wildly expensive filter quals that just do a linear scan of the SAOP. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > > If you end up finding a bug in this v6, it'll most likely be a case > > where nbtree fails to live up to that. This project is as much about > > robust/predictable performance as anything else -- nbtree needs to be > > able to cope with practically anything. I suggest that your review > > start by trying to break the patch along these lines. > > I spent some time on this myself today (which I'd already planned on). > > Attached is an adversarial stress-test, which shows something that > must be approaching the worst case for the patch in terms of time > spent with a buffer lock held, due to spending so much time evaluating > unusually expensive SAOP index quals. The array binary searches that > take place with a buffer lock held aren't quite like anything else > that nbtree can do right now, so it's worthy of special attention. > > I thought of several factors that maximize both the number of binary > searches within any given _bt_readpage, as well as the cost of each > binary search -- the SQL file has full details. My test query is > *extremely* unrealistic, since it combines multiple independent > unrealistic factors, all of which aim to make life hard for the > implementation in one way or another. I hesitate to say that it > couldn't be much worse (I only spent a few hours on this), but I'm > prepared to say that it seems very unlikely that any real world query > could make the patch spend as many cycles in > _bt_readpage/_bt_checkkeys as this one does. > > Perhaps you can think of some other factor that would make this test > case even less sympathetic towards the patch, Matthias? The only thing > I thought of that I've left out was the use of a custom btree opclass, > "unrealistically_slow_ops". Something that calls pg_usleep in its > order proc. (I left it out because it wouldn't prove anything.) Have you tried using text index columns that are sorted with non-default locales? I've seen non-default locales use significantly more resources during compare operations than any other ordering operation I know of (which has mostly been in finding the locale), and use it extensively to test improvements for worst index shapes over at my btree patchsets because locales are dynamically loaded in text compare and nondefault locales are not cached at all. I suspect that this would be even worse if a somehow even worse locale path is available than what I'm using for test right now; this could be the case with complex custom ICU locales. > On my machine, custom instrumentation shows that each call to > _bt_readpage made while this query executes (on a patched server) > takes just under 1.4 milliseconds. While that is far longer than it > usually takes, it's basically acceptable IMV. It's not significantly > longer than I'd expect heap_index_delete_tuples() to take on an > average day with EBS (or other network-attached storage). But that's a > process that happens all the time, with an exclusive buffer lock held > on the leaf page throughout -- whereas this is only a shared buffer > lock, and involves a query that's just absurd . > > Another factor that makes this seem acceptable is just how sensitive > the test case is to everything going exactly and perfectly wrong, all > at the same time, again and again. The test case uses a 32 column > index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP > clauses (one per index column). If I reduce the number of SAOP clauses > in the query to (say) 8, I still have a test case that's almost as > silly as my original -- but now we only spend ~225 microseconds in > each _bt_readpage call (i.e. we spend over 6x less time in each > _bt_readpage call). (Admittedly if I also make the CREATE INDEX use > only 8 columns, we can fit more index tuples on one page, leaving us > at ~800 microseconds). A quick update of the table definition to use the various installed 'fr-%-x-icu' locales on text hash columns instead of numeric with a different collation for each column this gets me to EXPLAIN (analyze) showing 2.07ms spent every buffer hit inside the index scan node, as opposed to 1.76ms when using numeric. But, as you mention, the value of this metric is probably not very high. As for the patch itself, I'm probably about 50% through the patch now. While reviewing, I noticed the following two user-visible items, related to SAOP but not broken by or touched upon in this patch: 1. We don't seem to plan `column opr ALL (...)` as index conditions, while this should be trivial to optimize for at least btree. Example: SET enable_bitmapscan = OFF; WITH a AS (select generate_series(1, 1000) a) SELECT * FROM tenk1 WHERE thousand = ANY (array(table a)) AND thousand < ALL (array(table a)); This will never return any rows, but it does hit 9990 buffers in the new btree code, while I expected that to be 0 buffers based on the query and index (that is, I expected to hit 0 buffers, until I realized that we don't push ALL into index filters). I shall assume ALL isn't used all that often (heh), but it sure feels like we're missing out on performance here. 2. We also don't seem to support array keys for row compares, which probably is an even more niche use case: SELECT count(*) FROM tenk1 WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]); This is no different from master, too, but it'd be nice if there was support for arrays of row operations, too, just so that composite primary keys can also be looked up with SAOPs. Kind regards, Matthias van de Meent
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote: > > > I should be able to post v6 later this week. My current plan is to > > > commit the other nbtree patch first (the backwards scan "boundary > > > cases" one from the ongoing CF) -- since I saw your review earlier > > > today. I think that you should probably wait for this v6 before > > > starting your review. > > > > Okay, thanks for the update, then I'll wait for v6 to be posted. > > On second thought, I'll just post v6 now (there won't be conflicts > against the master branch once the other patch is committed anyway). Thanks. Here's my review of the btree-related code: > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > * set flag to true if all required keys are satisfied and false > * otherwise. > */ > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > - &requiredMatchedByPrecheck, false); > + _bt_checkkeys(scan, &pstate, itup, false, false); > + requiredMatchedByPrecheck = pstate.continuescan; > + pstate.continuescan = true; /* reset */ The comment above the updated section needs to be updated. > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > * set flag to true if all required keys are satisfied and false > * otherwise. > */ > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > - &requiredMatchedByPrecheck, false); > + _bt_checkkeys(scan, &pstate, itup, false, false); This 'false' finaltup argument is surely wrong for the rightmost page's rightmost tuple, no? > +++ b/src/backend/access/nbtree/nbtutils.c > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > + /* We could pfree(elem_values) after, but not worth the cycles */ > + num_elems = _bt_merge_arrays(scan, cur, > + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0, > + prev->elem_values, prev->num_elems, > + elem_values, num_elems); This code can get hit several times when there are multiple = ANY clauses, which may result in repeated leakage of these arrays during this scan. I think cleaning up may well be worth the cycles when the total size of the arrays is large enough. > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, > _bt_compare_array_elements, &cxt); > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, > + Datum *elems_orig, int nelems_orig, > + Datum *elems_next, int nelems_next) > [...] > + /* > + * Incrementally copy the original array into a temp buffer, skipping over > + * any items that are missing from the "next" array > + */ Given that we only keep the members that both arrays have in common, the result array will be a strict subset of the original array. So, I don't quite see why we need the temporary buffer here - we can reuse the entries of the elems_orig array that we've already compared against the elems_next array. We may want to optimize this further by iterating over only the smallest array: With the current code, [1, 2] + [1....1000] is faster to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much smaller than 1000*log(2). In practice this may matter very little, though. An even better optimized version would do a merge join on the two arrays, rather than loop + binary search. > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg) > [...] > +_bt_binsrch_array_skey(FmgrInfo *orderproc, Is there a reason for this complex initialization of high/low_elem, rather than the this easier to understand and more compact initialization?: + low_elem = 0; + high_elem = array->num_elems - 1; + if (cur_elem_start) + { + if (ScanDirectionIsForward(dir)) + low_elem = array->cur_elem; + else + high_elem = array->cur_elem; + } > @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan) > [...] > + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir) > [...] > + if (scan->parallel_scan != NULL) > + _bt_parallel_done(scan); > + > + /* > + * No more primitive index scans. Terminate the top-level scan. > + */ > + return false; I think the conditional _bt_parallel_done(scan) feels misplaced here, as the comment immediately below indicates the scan is to be terminated after that comment. So, either move this _bt_parallel_done call outside the function (which by name would imply it is read-only, without side effects like this) or move it below the comment "terminate the top-level scan". > +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate, > [...] > + * Set up ORDER 3-way comparison function and array state > [...] > + * Optimization: Skip over non-required scan keys when we know that These two sections should probably be swapped, as the skip makes the setup useless. Also, the comment here is wrong; the scan keys that are skipped are 'required', not 'non-required'. > +++ b/src/test/regress/expected/join.out > @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]); > Merge Cond: (j1.id1 = j2.id1) > Join Filter: (j2.id2 = j1.id2) > -> Index Scan using j1_id1_idx on j1 > - -> Index Only Scan using j2_pkey on j2 > + -> Index Scan using j2_id1_idx on j2 > Index Cond: (id1 >= ANY ('{1,5}'::integer[])) > - Filter: ((id1 % 1000) = 1) > -(7 rows) > +(6 rows) I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter anymore. The output otherwise matches (quite possibly because the other join conditions don't match) and I don't have time to investigate the intricacies between IOS vs normal IS, but this feels off. ---- As for the planner changes, I don't think I'm familiar enough with the planner to make any authorative comments on this. However, it does look like you've changed the meaning of 'amsearcharray', and I'm not sure it's OK to assume all indexes that support amsearcharray will also support for this new assumption of ordered retrieval of SAOPs. For one, the pgroonga extension [0] does mark amcanorder+amsearcharray. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Thanks. Here's my review of the btree-related code: Attached is v7. The main focus in v7 is making the handling of required inequality-strategy scan keys more explicit -- now there is an understanding of inequalities shared by _bt_check_compare (the function that becomes the guts of _bt_checkkeys) and the new _bt_advance_array_keys function/state machine. The big idea for v7 is to generalize how we handle required equality-strategy scan keys (always required in both scan directions), extending the same concept to deal with required inequality strategy scan keys (only ever required in one direction, which may or may not be the scan direction). This led to my discovering and fixing a couple of bugs related to inequality handling. These issues were of the same general character as many others I've dealt with before now: they involved subtle confusion about when and how to start another primitive index scan, leading to the scan reading many more pages than strictly necessary (potentially many more than master). In other words, cases where we didn't give up and start another primitive index scan, even though (with a repro of the issue) it's obviously not sensible. An accidental full index scan. While I'm still not completely happy with the way that inequalities are handled, things in this area are much improved in v7. It should be noted that the patch isn't strictly guaranteed to always read fewer index pages than master, for a given query plan and index. This is by design. Though the patch comes close, it's not quite a certainty. There are known cases where the patch reads the occasional extra page (relative to what master would have done under the same circumstances). These are cases where the implementation just cannot know for sure whether the next/sibling leaf page has key space covered by any of the scan's array keys (at least not in a way that seems practical). The implementation has simple heuristics that infer (a polite word for "make an educated guess") about what will be found on the next page. Theoretically we could be more conservative in how we go about this, but that seems like a bad idea to me. It's really easy to find cases where the maximally conservative approach loses by a lot, and really hard to show cases where it wins at all. These heuristics are more or less a limited form of the heuristics that skip scan would need. A *very* limited form. We're still conservative. Here's how it works, at a high level: if the scan can make it all the way to the end of the page without having to start a new primitive index scan (before reaching the end), and then finds that "finaltup" itself (which is usually the page high key) advances the array keys, we speculate: we move on to the sibling page. It's just about possible that we'll discover (once on the next page) that finaltup actually advanced the array keys by so much (in one single advancement step) that the current/new keys cover key space beyond the sibling page we just arrived at. The sibling page access will have been wasted (though I prefer to think of it as a cost of doing business). I go into a lot of detail on the trade-offs in this area in comments at the end of the new _bt_checkkeys(), just after it calls _bt_advance_array_keys(). Hopefully this is reasonably clear. It's always much easier to understand these things when you've written lots of test cases, though. So I wouldn't at all be surprised to hear that my explanation needs more work. I suspect that I'm spending more time on the topic than it actually warrants, but you have to spend a lot of time on it for yourself to be able to see why that is. > > +++ b/src/backend/access/nbtree/nbtsearch.c > > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > > * set flag to true if all required keys are satisfied and false > > * otherwise. > > */ > > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > > - &requiredMatchedByPrecheck, false); > > + _bt_checkkeys(scan, &pstate, itup, false, false); > > + requiredMatchedByPrecheck = pstate.continuescan; > > + pstate.continuescan = true; /* reset */ > > The comment above the updated section needs to be updated. Updated. > > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > > * set flag to true if all required keys are satisfied and false > > * otherwise. > > */ > > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > > - &requiredMatchedByPrecheck, false); > > + _bt_checkkeys(scan, &pstate, itup, false, false); > > This 'false' finaltup argument is surely wrong for the rightmost > page's rightmost tuple, no? Not in any practical sense. Since finaltup means "the tuple that you should use to decide whether to go to the next page or not", and a rightmost page doesn't have a next page. There are exactly two ways that the top-level scan can end (not to be confused with the primitive scan), at least in v7. They are: 1. The state machine can exhaust the scan's array keys, ending the top-level scan. 2. The scan can just run out of pages, without ever running out of array keys (some array keys can sort higher than any real value from the index). This is just like how an index scan ends when it lacks any required scan keys to terminate the scan, and eventually runs out of pages to scan (think of an index-only scan that performs a full scan of the index, feeding into a group aggregate). Note that it wouldn't be okay if the design relied on _bt_checkkeys advancing and exhausting the array keys -- we really do need both 1 and 2 to deal with various edge cases. For example, there is no way that we'll ever be able to call _bt_checkkeys with a completely empty index. It simply doesn't have any tuples at all. In fact, it doesn't even have any pages (apart from the metapage), so clearly we can't expect any calls to _bt_readpage (much less _bt_checkkeys). > > +++ b/src/backend/access/nbtree/nbtutils.c > > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > > + /* We could pfree(elem_values) after, but not worth the cycles */ > > + num_elems = _bt_merge_arrays(scan, cur, > > + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0, > > + prev->elem_values, prev->num_elems, > > + elem_values, num_elems); > > This code can get hit several times when there are multiple = ANY > clauses, which may result in repeated leakage of these arrays during > this scan. I think cleaning up may well be worth the cycles when the > total size of the arrays is large enough. They won't leak because the memory is allocated in the same dedicated memory context. That said, I added a pfree(). It couldn't hurt. > > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, > > _bt_compare_array_elements, &cxt); > > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, > > + Datum *elems_orig, int nelems_orig, > > + Datum *elems_next, int nelems_next) > > [...] > > + /* > > + * Incrementally copy the original array into a temp buffer, skipping over > > + * any items that are missing from the "next" array > > + */ > > Given that we only keep the members that both arrays have in common, > the result array will be a strict subset of the original array. So, I > don't quite see why we need the temporary buffer here - we can reuse > the entries of the elems_orig array that we've already compared > against the elems_next array. This code path is only hit when the query was written on autopilot, since it must have contained redundant SAOPs for the same index column -- a glaring inconsistency. Plus these arrays just aren't very big in practice (despite my concerns about huge arrays). Plus there is only one of these array-specific preprocessing steps per btrescan. So I don't think that it's worth going to too much trouble here. > We may want to optimize this further by iterating over only the > smallest array: With the current code, [1, 2] + [1....1000] is faster > to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much > smaller than 1000*log(2). In practice this may matter very little, > though. > An even better optimized version would do a merge join on the two > arrays, rather than loop + binary search. v7 allocates the temp buffer using the size of whatever array is the smaller of the two, just because it's an easy marginal improvement. > > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg) > > [...] > > +_bt_binsrch_array_skey(FmgrInfo *orderproc, > > Is there a reason for this complex initialization of high/low_elem, > rather than the this easier to understand and more compact > initialization?: > > + low_elem = 0; > + high_elem = array->num_elems - 1; > + if (cur_elem_start) > + { > + if (ScanDirectionIsForward(dir)) > + low_elem = array->cur_elem; > + else > + high_elem = array->cur_elem; > + } I agree that it's better your way. Done that way in v7. > I think the conditional _bt_parallel_done(scan) feels misplaced here, > as the comment immediately below indicates the scan is to be > terminated after that comment. So, either move this _bt_parallel_done > call outside the function (which by name would imply it is read-only, > without side effects like this) or move it below the comment > "terminate the top-level scan". v7 moves the comment up, so that it's just before the _bt_parallel_done() call. > > +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate, > > [...] > > + * Set up ORDER 3-way comparison function and array state > > [...] > > + * Optimization: Skip over non-required scan keys when we know that > > These two sections should probably be swapped, as the skip makes the > setup useless. Not quite: we need to increment arrayidx for later loop iterations/scan keys. > Also, the comment here is wrong; the scan keys that are skipped are > 'required', not 'non-required'. Agreed. Fixed. > > +++ b/src/test/regress/expected/join.out > > @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]); > > Merge Cond: (j1.id1 = j2.id1) > > Join Filter: (j2.id2 = j1.id2) > > -> Index Scan using j1_id1_idx on j1 > > - -> Index Only Scan using j2_pkey on j2 > > + -> Index Scan using j2_id1_idx on j2 > > Index Cond: (id1 >= ANY ('{1,5}'::integer[])) > > - Filter: ((id1 % 1000) = 1) > > -(7 rows) > > +(6 rows) > > I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter > anymore. The output otherwise matches (quite possibly because the > other join conditions don't match) and I don't have time to > investigate the intricacies between IOS vs normal IS, but this feels > off. This happens because the new plan uses a completely different index -- which happens to be a partial index whose predicate exactly matches the old plan's filter quals. That factor makes the filter quals unnecessary. That's all this is. > As for the planner changes, I don't think I'm familiar enough with the > planner to make any authorative comments on this. However, it does > look like you've changed the meaning of 'amsearcharray', and I'm not > sure it's OK to assume all indexes that support amsearcharray will > also support for this new assumption of ordered retrieval of SAOPs. > For one, the pgroonga extension [0] does mark > amcanorder+amsearcharray. The changes that I've made to the planner are subtractive. We more or less go back to how things were just after the initial work on nbtree amsearcharray support. That work was (at least tacitly) assumed to have no impact on ordered scans. Because why should it? What other type of index clause has ever affected what seems like a rather unrelated thing (namely the sort order of the scan)? The oversight was understandable. The kinds of plans that master cannot produce output for in standard index order are really silly plans, independent of this issue; it makes zero sense to allow a non-required array scan key to affect how or when we skip. The code that I'm removing from the planner is code that quite obviously assumes nbtree-like behavior. So I'm taking away code like that, rather than adding new code like that. That said, I am really surprised that any extension creates an index AM amcanorder=true (not to be confused with amcanorderbyop=true, which is less surprising). That means that it promises the planner that it behaves just like nbtree. To quote the docs, it must have "btree-compatible strategy numbers for their [its] equality and ordering operators". Is that really something that pgroonga even attempts? And if so, why? I also find it bizarre that pgroonga's handler-stated capabilities include "amcanunique=true". So pgroonga is a full text search engine, but also supports unique indexes? I find that particularly hard to believe, and suspect that the way that they set things up in the AM handler just isn't very well thought out. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Heikki Linnakangas
Date:
On 21/11/2023 04:52, Peter Geoghegan wrote: > Attached is v7. First, some high-level reactions before looking at the patch very closely: - +1 on the general idea. Hard to see any downsides if implemented right. - This changes the meaning of amsearcharray==true to mean that the ordering is preserved with ScalarArrayOps, right? You change B-tree to make that true, but what about any out-of-tree index AM extensions? I don't know if any such extensions exist, and I don't think we should jump through any hoops to preserve backwards compatibility here, but probably deserves a notice in the release notes if nothing else. - You use the term "primitive index scan" a lot, but it's not clear to me what it means. Does one ScalarArrayOps turn into one "primitive index scan"? Or each element in the array turns into a separate primitive index scan? Or something in between? Maybe add a new section to the README explain how that works. - _bt_preprocess_array_keys() is called for each btrescan(). It performs a lot of work like cmp function lookups and desconstructing and merging the arrays, even if none of the SAOP keys change in the rescan. That could make queries with nested loop joins like this slower than before: "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 and tenk1.two IN (1,2);". - nbtutils.c is pretty large now. Perhaps create a new file nbtpreprocesskeys.c or something? - You and Matthias talked about an implicit state machine. I wonder if this could be refactored to have a more explicit state machine. The state transitions and interactions between _bt_checkkeys(), _bt_advance_array_keys() and friends feel complicated. And then some details: > --- a/doc/src/sgml/monitoring.sgml > +++ b/doc/src/sgml/monitoring.sgml > @@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage > </para> > </note> > > + <note> > + <para> > + Every time an index is searched, the index's > + <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield> > + field is incremented. This usually happens once per index scan node > + execution, but might take place several times during execution of a scan > + that searches for multiple values together. Only queries that use certain > + <acronym>SQL</acronym> constructs to search for rows matching any value > + out of a list (or an array) of multiple scalar values are affected. See > + <xref linkend="functions-comparisons"/> for details. > + </para> > + </note> > + Is this true even without this patch? Maybe commit this separately. The "Only queries ..." sentence feels difficult. Maybe something like "For example, queries using IN (...) or = ANY(...) constructs.". > * _bt_preprocess_keys treats each primitive scan as an independent piece of > * work. That structure pushes the responsibility for preprocessing that must > * work "across array keys" onto us. This division of labor makes sense once > * you consider that we're typically called no more than once per btrescan, > * whereas _bt_preprocess_keys is always called once per primitive index scan. "That structure ..." is a garden-path sentence. I kept parsing "that must work" as one unit, the same way as "that structure", and it didn't make sense. Took me many re-reads to parse it correctly. Now that I get it, it doesn't bother me anymore, but maybe it could be rephrased. Is there _any_ situation where _bt_preprocess_array_keys() is called more than once per btrescan? > /* > * Look up the appropriate comparison operator in the opfamily. > * > * Note: it's possible that this would fail, if the opfamily is > * incomplete, but it seems quite unlikely that an opfamily would omit > * non-cross-type comparison operators for any datatype that it supports > * at all. ... > */ I agree that's unlikely. I cannot come up with an example where you would have cross-type operators between A and B, but no same-type operators between B and B. For any real-world opfamily, that would be an omission you'd probably want to fix. Still I wonder if we could easily fall back if it doesn't exist? And maybe add a check in the 'opr_sanity' test for that. In _bt_readpage(): > /* > * Prechecking the page with scan keys required for direction scan. We > * check these keys with the last item on the page (according to our scan > * direction). If these keys are matched, we can skip checking them with > * every item on the page. Scan keys for our scan direction would > * necessarily match the previous items. Scan keys required for opposite > * direction scan are already matched by the _bt_first() call. > * > * With the forward scan, we do this check for the last item on the page > * instead of the high key. It's relatively likely that the most > * significant column in the high key will be different from the > * corresponding value from the last item on the page. So checking with > * the last item on the page would give a more precise answer. > * > * We skip this for the first page in the scan to evade the possible > * slowdown of point queries. Never apply the optimization with a scans > * that uses array keys, either, since that breaks certain assumptions. > * (Our search-type scan keys change whenever _bt_checkkeys advances the > * arrays, invalidating any precheck. Tracking all that would be tricky.) > */ > if (!so->firstPage && !numArrayKeys && minoff < maxoff) > { It's sad to disable this optimization completely for array keys. It's actually a regression from current master, isn't it? There's no fundamental reason we couldn't do it for array keys so I think we should do it. _bt_checkkeys() is called in an assertion in _bt_readpage, but it has the side-effect of advancing the array keys. Side-effects from an assertion seems problematic. Vague idea: refactor _bt_checkkeys() into something that doesn't have side-effects, and have a separate function or an argument to _bt_checkkeys() to advance to next array key. The prechecking optimization and the Assertion could both use the side-effect-free function. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Tomas Vondra
Date:
On 11/21/23 03:52, Peter Geoghegan wrote: > On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> Thanks. Here's my review of the btree-related code: > > Attached is v7. > I haven't looked at the code, but I decided to do a bit of blackbox perf and stress testing, to get some feeling of what to expect in terms of performance improvements, and see if there happen to be some unexpected regressions. Attached is a couple simple bash scripts doing a brute-force test with tables of different size / data distribution, number of values in the SAOP expression, etc. And a PDF visualizing the comparing the results between master and build with the patch applied. First group of columns is master, then patched, and then (patched/master) comparison, with green=faster, red=slower. The columns are for different number of values in the SAOP condition. Overall, the results look pretty good, with consistent speedups of up to ~30% for large number of values (SAOP with 1000 elements). There's a couple blips where the performance regresses, also by up to ~30%. It's too regular to be a random variation (it affects different combinations of parameters, tablesizes), but it seems to only affect one of the two machines I used for testing. Interestingly enough, it's the newer one. I'm not convinced this is a problem we have to solve. It's possible it only affects cases that are implausible in practice (the script forces a particular scan type, and maybe it would not be picked in practice). But maybe it's fixable ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I haven't looked at the code, but I decided to do a bit of blackbox perf > and stress testing, to get some feeling of what to expect in terms of > performance improvements, and see if there happen to be some unexpected > regressions. Attached is a couple simple bash scripts doing a > brute-force test with tables of different size / data distribution, > number of values in the SAOP expression, etc. My own stress-testing has focussed on the two obvious extremes for this patch, using variants of pgbench with SAOP SELECTs on pgbench_accounts: 1. The case where there is almost no chance of finding any two index tuples together on the same page, because the constants are completely random. This workload makes the patch's attempts at "coalescing together" index page reads pure overhead, with no possible benefit. Obviously that's a cost that needs to be kept under control. 2. The case where there are 255 of tuples with distinct values that are clustered together (both in the key space and in physical index pages). Usually they'll span two index pages, but they might all fit together on one index page, allowing us to descend to it directly and read it only once. With 32 clients, I typically see a regression of about 1.5% for the first case relative to master, measured in throughput/TPS. The second case typically sees throughput that's ~4.8x master (i.e. a ~380% increase). I consider both of these extremes to be fairly unrealistic. With fewer array constants, the speed-ups you'll see in sympathetic cases are still very significant, but nothing like 4.8x. They're more like the 30% numbers that you saw. As you know, I'm not actually all that excited about cases like 2 -- it's not where users are likely to benefit from the patch. The truly interesting cases are those cases where we can completely avoid heap accesses in the first place (not just *repeat* accesses to the same index pages), due to the patch's ability to consistently use index quals rather than filter quals. It's not that hard to show cases where there are 100x+ fewer pages accessed -- often with cases have very few array constants. It's just that these cases aren't that interesting from a performance validation point of view -- it's obvious that filter quals are terrible. Another thing that the patch does particularly well on is cases where the array keys don't have any matches at all, but there is significant clustering (relatively common when multiple SAOPs are used as index quals, which becomes far more likely due to the planner changes). We don't just skip over parts of the index that aren't relevant -- we also skip over parts of the arrays that aren't relevant. Some of my adversarial test cases that take ~1 millisecond for the patch to execute will practically take forever on master (I had to have my test suite not run those tests against master). You just need lots of array keys. What master does on those adversarial cases with billions of distinct combinations of array keys (not that unlikely if there are 4 or 5 SAOPs with mere hundreds or thousands of array keys each) is so inefficient that we might as well call it infinitely slower than the patch. This is interesting to me from a performance robustness/stability point of view. The slowest kind of SAOP index scan with the patch becomes a full index scan -- just as it would be if we were using any type of non-SOAP qual before now. The worst case is a lot easier to reason about. > I'm not convinced this is a problem we have to solve. It's possible it > only affects cases that are implausible in practice (the script forces a > particular scan type, and maybe it would not be picked in practice). But > maybe it's fixable ... I would expect the patch to do quite well (relative to what is actually possible) on cases like the two extremes that I've focussed on so far. It seems possible that it will do less well on cases that are somewhere in the middle (that also have lots of distinct values on each page). We effectively do a linear search on a page that we know has at least one more match (following a precheck that uses the high key). We hope that the next match (for the next array value) closely follows an initial match. But what if there are only 2 or 3 matches on each leaf page, that are spaced relatively far apart? You're going to have to grovel through the whole page. It's not obvious that that's a problem to be fixed -- we're still only descending the index once and still only locking the leaf page once, so we'll probably still win relative to master. And it's not that easy to imagine beating a linear search -- it's not like there is just one "next value" to search for in these cases. But it's something that deserves further consideration. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I'm not convinced this is a problem we have to solve. It's possible it > > only affects cases that are implausible in practice (the script forces a > > particular scan type, and maybe it would not be picked in practice). But > > maybe it's fixable ... > > I would expect the patch to do quite well (relative to what is > actually possible) on cases like the two extremes that I've focussed > on so far. It seems possible that it will do less well on cases that > are somewhere in the middle (that also have lots of distinct values on > each page). Actually, I think that it's more likely that the problems that you saw are related to low cardinality data, which seems like it might not be a great fit for the heuristics that the patch uses to decide whether to continue the ongoing primitive index scan, or start a new one instead. I'm referring to the heuristics I describe here: https://postgr.es/m/CAH2-WzmTHoCsOmSgLg=yyft9LoERtuCKXyG2GZn+28PzonFA_g@mail.gmail.com The patch itself discusses these heuristics in a large comment block after the point that _bt_checkkeys() calls _bt_advance_array_keys(). I hardly paid any attention to low cardinality data in my performance validation work -- it was almost always indexes that had few or no indexes (just pgbench_accounts if we're talking pure stress-tests), just because those are more complicated, and so seemed more important. I'm not quite prepared to say that there is definitely a problem here, right this minute, but if there was then it wouldn't be terribly surprising (the problems are usually wherever it is that I didn't look for them). Attached is a sample of my debug instrumentation for one such query, based on running the test script that Tomas posted -- thanks for writing this script, Tomas (I'll use it as the basis for some of my own performance validation work going forward). I don't mind sharing the patch that outputs this stuff if anybody is interested (it's kind of a monstrosity, so I'm disinclined to post it with the patch until I have a reason). Even without this instrumentation, you can get some idea of the kinds of issues I'm talking about just by viewing EXPLAIN ANALYZE output for a bitmap index scan -- that breaks out the index page accesses separately, which is a number that we should expect to remain lower than what the master branch shows in approximately all cases. While I still think that we need heuristics that apply speculative criteria to decide whether or not going to the next page directly (when we have a choice at all), that doesn't mean that the v7 heuristics can't be improved on, with a little more thought. It's a bit tricky, since we're probably also benefiting from the same heuristics all the time -- probably even for this same test case. We do lose against the master branch, on balance, and by enough to concern me, though. (I don't want to promise that it'll never happen at all, but it should be very limited, which this wasn't.) I didn't bother to ascertain how much longer it takes to execute the query, since that question seems rather beside the point. The important thing to me is whether or not this behavior actually makes sense, all things considered, and what exactly can be done about it if it doesn't make sense. I will need to think about this some more. This is just a status update. Thanks -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > While I still think that we need heuristics that apply speculative > criteria to decide whether or not going to the next page directly > (when we have a choice at all), that doesn't mean that the v7 > heuristics can't be improved on, with a little more thought. It's a > bit tricky, since we're probably also benefiting from the same > heuristics all the time -- probably even for this same test case. Correction: this particular test case happens to be one where the optimal strategy is to do *exactly* what the master branch does currently. The master branch is unbeatable, so the only reasonable goal for the patch is to not lose (or to lose by no more than a completely negligible amount). I'm now prepared to say that this behavior is not okay -- I definitely need to fix this. It's a bug. Because each distinct value never fits on one leaf page (it's more like 1.5 - 2 pages, even though we're deduplicating heavily), and because Postgres 12 optimizations are so effective with low cardinality/posting-list-heavy indexes such as this, we're bound to lose quite often. The only reason it doesn't happen _every single time_ we descend the index is because the test script uses CREATE INDEX, rather than using retail inserts (I tend to prefer the latter for this sort of analysis). Since nbtsort.c isn't as clever/aggressive about suffix truncation as the nbtsplitloc.c split strategies would have been, had we used them (had there been retail inserts), many individual leaf pages are left with high keys that aren't particularly good targets for the high key precheck optimization (see Postgres 12 commit 29b64d1d). If I wanted to produce a truly adversarial case for this issue (which this is already close to), I'd go with the following: 1. Retail inserts that leave each leaf page full of one single value, which will allow each high key to still make a "clean break" from the right sibling page -- it'll have the right sibling's value. Maybe insert 1200 - 1300 tuples per distinct index value for this. In other words, bulk loading that results in an index that never has to append a heap TID tiebreaker during suffix truncation, but comes very close to needing to. Bulk loading where nbtsplitloc.c needs to use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the point of needing a SPLIT_SINGLE_VALUE split. 2. A SAOP query with an array with every second value in the index as an element. Something like "WHERE arr in (2, 4, 6, 8, ...)". The patch will read every single leaf page, whereas master will *reliably* only read every second leaf page. I didn't need to "trick" the patch in a contrived sort of way to get this bad outcome -- this scenario is fairly realistic. So this behavior is definitely not something that I'm prepared to defend. As I said, it's a bug. It'll be fixed in the next revision. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > - +1 on the general idea. Hard to see any downsides if implemented right. Glad you think so. The "no possible downside" perspective is one that the planner sort of relies on, so this isn't just a nice-to-have -- it might actually be a condition of committing the patch. It's important that the planner can be very aggressive about using SAOP index quals, without us suffering any real downside at execution time. > - This changes the meaning of amsearcharray==true to mean that the > ordering is preserved with ScalarArrayOps, right? You change B-tree to > make that true, but what about any out-of-tree index AM extensions? I > don't know if any such extensions exist, and I don't think we should > jump through any hoops to preserve backwards compatibility here, but > probably deserves a notice in the release notes if nothing else. My interpretation is that the planner changes affect amcanorder + amsearcharray index AMs, but have no impact on mere amsearcharray index AMs. If anything this is a step *away* from knowing about nbtree implementation details in the planner (though the planner's definition of amcanorder is very close to the behavior from nbtree, down to things like knowing all about nbtree strategy numbers). The planner changes from the patch are all subtractive -- I'm removing kludges that were added by bug fix commits. Things that weren't in the original feature commit at all. I used the term "my interpretation" here because it seems hard to think of this in abstract terms, and to write a compatibility note for this imaginary audience. I'm happy to go along with whatever you want, though. Perhaps you can suggest a wording for this? > - You use the term "primitive index scan" a lot, but it's not clear to > me what it means. Does one ScalarArrayOps turn into one "primitive index > scan"? Or each element in the array turns into a separate primitive > index scan? Or something in between? Maybe add a new section to the > README explain how that works. The term primitive index scan refers to the thing that happens each time _bt_first() is called -- with and without the patch. In other words, it's what happens when pg_stat_all_indexes.idx_scan is incremented. You could argue that that's not quite the right thing to be focussing on, with this new design. But it has precedent going for it. As I said, it's the thing that pg_stat_all_indexes.idx_scan counts, which is a pretty exact and tangible thing. So it's consistent with historical practice, but also with what other index AMs do when executing ScalarArrayOps non-natively. > - _bt_preprocess_array_keys() is called for each btrescan(). It performs > a lot of work like cmp function lookups and desconstructing and merging > the arrays, even if none of the SAOP keys change in the rescan. That > could make queries with nested loop joins like this slower than before: > "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 > and tenk1.two IN (1,2);". But that's nothing new. _bt_preprocess_array_keys() isn't a new function, and the way that it's called isn't new in any way. That said, I certainly agree that we should be worried about any added overhead in _bt_first for nested loop joins with an inner index scan. In my experience that can be an important issue. I actually have a TODO item about this already. It needs to be included in my work on performance validation, on general principle. > - nbtutils.c is pretty large now. Perhaps create a new file > nbtpreprocesskeys.c or something? Let me get back to you on this. > - You and Matthias talked about an implicit state machine. I wonder if > this could be refactored to have a more explicit state machine. The > state transitions and interactions between _bt_checkkeys(), > _bt_advance_array_keys() and friends feel complicated. I agree that it's complicated. That's the main problem that the patch has, by far. It used to be even more complicated, but it's hard to see a way to make it a lot simpler at this point. If you can think of a way to simplify it then I'll definitely give it a go. Can you elaborate on "more explicit state machine"? That seems like it could have value by adding more invariants, and making things a bit more explicit in one or two areas. It could also help us to verify that they hold from assertions. But that isn't really the same thing as simplification. I wouldn't use that word, at least. > > + <note> > > + <para> > > + Every time an index is searched, the index's > > + <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield> > > + field is incremented. This usually happens once per index scan node > > + execution, but might take place several times during execution of a scan > > + that searches for multiple values together. Only queries that use certain > > + <acronym>SQL</acronym> constructs to search for rows matching any value > > + out of a list (or an array) of multiple scalar values are affected. See > > + <xref linkend="functions-comparisons"/> for details. > > + </para> > > + </note> > > + > > Is this true even without this patch? Maybe commit this separately. Yes, it is. The patch doesn't actually change anything in this area. However, something in this area is new: it's a bit weird (but still perfectly consistent and logical) that the count shown by pg_stat_all_indexes.idx_scan will now show a value that is often influenced by low-level implementation details now. Things that are fairly far removed from the SQL query now affect that count -- that part is new. That's what I had in mind when I wrote this, FWIW. > The "Only queries ..." sentence feels difficult. Maybe something like > "For example, queries using IN (...) or = ANY(...) constructs.". I'll get back to you on this. > > * _bt_preprocess_keys treats each primitive scan as an independent piece of > > * work. That structure pushes the responsibility for preprocessing that must > > * work "across array keys" onto us. This division of labor makes sense once > > * you consider that we're typically called no more than once per btrescan, > > * whereas _bt_preprocess_keys is always called once per primitive index scan. > > "That structure ..." is a garden-path sentence. I kept parsing "that > must work" as one unit, the same way as "that structure", and it didn't > make sense. Took me many re-reads to parse it correctly. Now that I get > it, it doesn't bother me anymore, but maybe it could be rephrased. I'll do that ahead of the next revision. > Is there _any_ situation where _bt_preprocess_array_keys() is called > more than once per btrescan? No. Note that we don't know the scan direction within _bt_preprocess_array_keys(). We need a separate function to set up the array keys to their initial positions (this is nothing new). > > /* > > * Look up the appropriate comparison operator in the opfamily. > > * > > * Note: it's possible that this would fail, if the opfamily is > > * incomplete, but it seems quite unlikely that an opfamily would omit > > * non-cross-type comparison operators for any datatype that it supports > > * at all. ... > > */ > > I agree that's unlikely. I cannot come up with an example where you > would have cross-type operators between A and B, but no same-type > operators between B and B. For any real-world opfamily, that would be an > omission you'd probably want to fix. > > Still I wonder if we could easily fall back if it doesn't exist? And > maybe add a check in the 'opr_sanity' test for that. I'll see about an opr_sanity test. > In _bt_readpage(): > > if (!so->firstPage && !numArrayKeys && minoff < maxoff) > > { > > It's sad to disable this optimization completely for array keys. It's > actually a regression from current master, isn't it? There's no > fundamental reason we couldn't do it for array keys so I think we should > do it. I'd say whether or not there's any kind of regression in this area is quite ambiguous, though in a way that isn't really worth discussing now. If it makes sense to extend something like this optimization to array keys (or to add a roughly equivalent optimization), then we should do it. Otherwise we shouldn't. Note that the patch actually disables two distinct and independent optimizations when the scan has array keys. Both of these were added by recent commit e0b1ee17, but they are still independent things. They are: 1. This skipping thing inside _bt_readpage, which you've highlighted. This is only applied on the second or subsequent leaf page read by the scan. Right now, in the case of a scan with array keys, that means the second or subsequent page from the current primitive index scan -- which doesn't seem particularly principled to me. I'd need to invent a heuristic that works with my design to adapt the optimization. Plus I'd need to be able to invalidate the precheck whenever the array keys advanced. And I'd probably need a way of guessing whether or not it's likely that the arrays will advance, ahead of time, so that the precheck doesn't almost always go to waste, in a way that just doesn't make sense. Note that all required scan keys are relevant here. I like to think of plain required equality strategy scan keys without any array as "degenerate single value arrays". Something similar can be said of inequality strategy required scan keys (those required in the *current* scan direction), too. So it's not as if I can "just do the precheck stuff for the non-array scan keys". All required scan keys are virtually the same thing as required array-type scan keys -- they can trigger "roll over", affecting array key advancement for the scan keys that are associated with arrays. 2. The optimization that has _bt_checkkeys skip non-required scan keys that are *only* required in the direction *opposite* the current scan direction -- this can work even without any precheck from _bt_readpage. Note that this second optimization relies on various behaviors in _bt_first() that make it impossible for _bt_checkkeys() to ever see a tuple that could fail to satisfy such a scan key -- we must always have passed over non-matching tuples, thanks to _bt_first(). That prevents my patch with a problem: the logic for determining whether or not we need a new primitive index scan only promises to never require the scan to grovel through many leaf pages that _bt_first() could and should just skip over instead. This new logic makes no promises about skipping over small numbers of tuples. So it's possible that _bt_checkkeys() will see a handful of tuples "after the end of the _bt_first-wise primitive index scan", but "before the _bt_first-wise start of the next would-be primitive index scan". Note that this stuff matters even without bringing optimization 2 into it. There are similar considerations for required equality strategy scan keys, which (by definition) must be required in both scan directions. The new mechanism must never act as if it's past the end of matches in the current scan direction, when in fact it's really before the beginning of matches (that would lead to totally ignoring a group of equal matches). The existing _bt_checkkeys() logic can't really tell the difference on its own, since it only has an = operator to work with (well, I guess it knows about this context, since there is a comment about the dependency on _bt_first behaviors in _bt_checkkeys on HEAD -- very old comments). > _bt_checkkeys() is called in an assertion in _bt_readpage, but it has > the side-effect of advancing the array keys. Side-effects from an > assertion seems problematic. I agree that that's a concern, but just to be clear: there are no side-effects presently. You can't mix the array stuff with the optimization stuff. We won't actually call _bt_checkkeys() in an assertion when it can cause side-effects. Assuming that we ultimately conclude that the optimizations *aren't* worth preserving in any form, it might still be worth making it obvious that the assertions have no side-effects. But that question is unsettled right now. Thanks for the review! I'll try to get the next revision out soon. It'll also have bug fixes for mark + restore and for a similar issue seen when the scan changes direction in just the wrong way. (In short, the array key state machine can be confused about scan direction in certain corner cases.) -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Dec 4, 2023 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > 2. The optimization that has _bt_checkkeys skip non-required scan keys > that are *only* required in the direction *opposite* the current scan > direction -- this can work even without any precheck from > _bt_readpage. > > Note that this second optimization relies on various behaviors in > _bt_first() that make it impossible for _bt_checkkeys() to ever see a > tuple that could fail to satisfy such a scan key -- we must always > have passed over non-matching tuples, thanks to _bt_first(). That > prevents my patch with a problem: the logic for determining whether or > not we need a new primitive index scan only promises to never require > the scan to grovel through many leaf pages that _bt_first() could and > should just skip over instead. This new logic makes no promises about > skipping over small numbers of tuples. So it's possible that > _bt_checkkeys() will see a handful of tuples "after the end of the > _bt_first-wise primitive index scan", but "before the _bt_first-wise > start of the next would-be primitive index scan". BTW, I have my doubts about this actually being correct without the patch. The following comment block appears above _bt_preprocess_keys: * Note that one reason we need direction-sensitive required-key flags is * precisely that we may not be able to eliminate redundant keys. Suppose * we have "x > 4::int AND x > 10::bigint", and we are unable to determine * which key is more restrictive for lack of a suitable cross-type operator. * _bt_first will arbitrarily pick one of the keys to do the initial * positioning with. If it picks x > 4, then the x > 10 condition will fail * until we reach index entries > 10; but we can't stop the scan just because * x > 10 is failing. On the other hand, if we are scanning backwards, then * failure of either key is indeed enough to stop the scan. (In general, when * inequality keys are present, the initial-positioning code only promises to * position before the first possible match, not exactly at the first match, * for a forward scan; or after the last match for a backward scan.) As I understand it, this might still be okay, because the optimization in question from Alexander's commit e0b1ee17 (what I've called optimization 2) is careful about NULLs, which were the one case that definitely had problems. Note that IS NOT NULL works kind of like WHERE foo < NULL here (see old bug fix commit 882368e8, "Fix btree stop-at-nulls logic properly", for more context on this NULLs behavior). In any case, my patch isn't compatible with "optimization 2" (as in my tests break in a rather obvious way) due to a behavior that these old comments claim is normal within any scan (or perhaps normal in any scan with scan keys that couldn't be deemed redundant due to a lack of cross-type support in the opfamily). Something has to be wrong here -- could just be the comment, I suppose. But I find it easy to believe that Alexander's commit e0b1ee17 might not have been properly tested with opfamilies that lack a suitable cross-type operator. That's a pretty niche thing. (My patch doesn't need that niche thing to be present to easily break when combined with "optimization 2", which could hint at an existing and far more subtle problem.) -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Nov 20, 2023 at 6:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > It should be noted that the patch isn't strictly guaranteed to always > read fewer index pages than master, for a given query plan and index. > This is by design. Though the patch comes close, it's not quite a > certainty. There are known cases where the patch reads the occasional > extra page (relative to what master would have done under the same > circumstances). These are cases where the implementation just cannot > know for sure whether the next/sibling leaf page has key space covered > by any of the scan's array keys (at least not in a way that seems > practical). The implementation has simple heuristics that infer (a > polite word for "make an educated guess") about what will be found on > the next page. Theoretically we could be more conservative in how we > go about this, but that seems like a bad idea to me. It's really easy > to find cases where the maximally conservative approach loses by a > lot, and really hard to show cases where it wins at all. Attached is v8, which pretty much rips all of this stuff out. I definitely had a point when I said that it made sense to be optimistic about finding matches on the next page in respect of any truncated -inf attributes in high keys, though. And so we still do that much in v8. But, there is no reason why we need to go any further than that -- there is no reason why we should *also* be optimistic about *untruncated* high key/finaltup attributes that *aren't* exact matches for any of the scan's required array keys finding exact matches once we move onto the next sibling page. I reached this conclusion when working on a fix for the low cardinality index regression that Tomas' tests brought to my attention [1]. I started out with the intention of just fixing that one case, in a very targeted way, but quickly realized that it made almost no sense to just limit myself to the low cardinality cases. Even with Tomas' problematic low cardinality test cases, I saw untruncated high key attributes that were "close by to matching tuples" -- they just weren't close enough (i.e. exactly on the next leaf page) for us to actually win (so v7 lost). Being almost correct again and again, but still losing again and again is a good sign that certain basic assumptions were faulty (at least if it's realistic, which it was in this instance). To be clear, and to repeat, even in v8 we'll still "make guesses" about -inf truncated attributes. But it's a much more limited form of guessing. If we didn't at least do the -inf thing, then backwards scans would weirdly work better than forward scans in some cases -- a particular concern with queries that have index quals for each of multiple columns. I don't think that this remaining "speculative" behavior needs to be discussed at very much length in code comments, though. That's why v8 is a great deal simpler than v7 was here. No more huge comment block at the end of the new _bt_checkkeys. Notable stuff that *hasn't* changed from v7: I'm posting this v8 having not yet worked through all of Heikki's feedback. In particular, v8 doesn't deal with the relatively hard question of what to do about the optimizations added by Alexander Korotkov's commit e0b1ee17 (should I keep them disabled, selectively re-enable one or both optimizations, or something else?). This is partly due to at least one of the optimizations having problems of their own that are still outstanding [2]. I also wanted to get a revision out before travelling to Prague for PGConf.EU, which will be followed by other Christmas travel. That's likely to keep me away from the patch for weeks (that, plus I'm moving to NYC in early January). So I just ran out of time to do absolutely everything. Other notable changes: * Bug fixes for cases where the array state machine gets confused by a change in the scan direction, plus similar cases involving mark and restore processing. I'm not entirely happy with my approach here (mostly referring to the new code in _bt_steppage here). Feels like it needs to be a little better integrated with mark/restore processing. No doubt Heikki will have his own doubts about this. I've included my test cases for the issues in this area. The problems are really hard to reproduce, and writing these tests took a surprisingly large amount of effort. The tests might not be suitable for commit, but you really need to see the test cases to be able to review the code efficiently. It's just fiddly. * I've managed to make the array state machine just a little more streamlined compared to v7. Minor code polishing, not really worth describing in detail. * Addressed concerns about incomplete opfamilies not working with the patch by updating the error message within _bt_preprocess_array_keys/_bt_sort_array_cmp_setup. It now exactly matches the similar one in _bt_first. I don't think that we need any new opr_sanity tests, since we already have one for this. Making the error message match the one in _bt_first ensures that anybody that runs into a problem here will see the same error message that they'd have seen on earlier versions, anyway. It's a more useful error compared to the one from v7 (in that it names the index and its attribute directly). Plus it's good to be consistent. I don't see any potential for the underlying _bt_sort_array_cmp_setup behavior to be seen as a regression, in terms of our ability to cope with incomplete opfamilies (compared to earlier Postgres versions). Opfamilies that lack a cross-type ORDER proc mixed with queries that use the corresponding cross-type = operator were always very dicey. That situation isn't meaningfully different with the patch. (Actually, this isn't 100% true in the case of queries + indexes with non-required arrays specifically -- they'll need a 3-way comparator in _bt_preprocess_array_keys/_bt_sort_array_cmp_setup, and yet *won't* need one moments later, in _bt_first. This is because non-required scan keys don't end up in _bt_first's insertion scan key at all, in general. This distinction just seems pedantic, though. We're talking about a case where things accidentally failed to fail in previous versions, for some queries but not most queries. Now it'll fail in exactly the same way in slightly more cases. In reality, affected opfamilies are practically non-existent, so this is a hypothetical upon a hypothetical.) [1] https://postgr.es/m/CAH2-WzmtV7XEWxf_rP1pw=vyDjGLi__zGOy6Me5MovR3e1kfdg@mail.gmail.com [2] https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ+KOK-WZ+QtTQ@mail.gmail.com -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v8, which pretty much rips all of this stuff out. Attached is v9, which I'm posting just to fix bitrot. The patch stopped cleanly applying against HEAD due to recent bugfix commit 7e6fb5da. No real changes here compared to v8. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan <pg@bowt.ie> wrote: > > On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v8, which pretty much rips all of this stuff out. > > Attached is v9, which I'm posting just to fix bitrot. The patch > stopped cleanly applying against HEAD due to recent bugfix commit > 7e6fb5da. No real changes here compared to v8. I found 2 major issues; one correctness issue in the arraykey processing code of _bt_preprocess_array_keys, and one assertion error in _bt_advance_array_keys; both discussed in the relevant sections below. > Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution. > [...] > Bugfix commit 807a40c5 taught the planner to avoid generating unsafe > path keys: path keys on a multicolumn index path, with a SAOP clause on > any attribute beyond the first/most significant attribute. These cases > are now all safe, so we go back to generating path keys without regard > for the presence of SAOP clauses (just like with any other clause type). > Also undo changes from follow-up bugfix commit a4523c5a, which taught > the planner to produce alternative index paths without any low-order > ScalarArrayOpExpr quals (making the SAOP quals into filter quals). > We'll no longer generate these alternative paths, which can no longer > offer any advantage over the index qual paths that we do still generate. Can you pull these planner changes into their own commit(s)? As mentioned upthread, it's a significant change in behavior that should have separate consideration and reference in the commit log. I really don't think it should be buried in the 5th paragraph of an "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the changes of btree are arguably independent of the planner changes, as the btree changes improve performance even if we ignore that it implements strict result ordering. An independent thought when reviewing the finaltup / HIKEY scan optimization parts of this patch: The 'use highkey to check for next page's data range' optimization is useful, but can currently only be used for scans that go to the right. Could we use a similar optimization in the inverse direction if we marked the right page of a split with "the left page has a HIKEY based on this page's (un)truncated leftmost value" or "left page's HIKEY was truncated to N key attributes"? It'd give one bit of information, specifically that it does (or does not) share some (or all) key attributes with this page's minimum value, which allows for earlier scan termination in boundary cases. > +++ b/src/include/access/nbtree.h > +#define SK_BT_RDDNARRAY 0x00040000 /* redundant in array preprocessing */ I think "made redundant" is more appropriate than just "redundant"; the array key is not itself redundant in array preprocessing (indeed we actually work hard on that key during preprocessing to allow us to mark it redundant) > +++ b/src/backend/access/nbtree/nbtsearch.c > * We skip this for the first page in the scan to evade the possible > - * slowdown of the point queries. > + * slowdown of point queries. Never apply the optimization with a scan > + * that uses array keys, either, since that breaks certain assumptions. > + * (Our search-type scan keys change whenever _bt_checkkeys advances the > + * arrays, invalidating any precheck. Tracking all that would be tricky.) I think this was mentioned before, but can't we have an argument or version of _bt_checkkeys that makes it not advance the array keys, so that we can keep this optimization? Or, isn't that now _bt_check_compare? For an orderlines table with 1000s of lines per order, we would greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that handles scan keys more efficiently; the optimization which is being disabled here. > +++ b/src/backend/access/nbtree/nbtutils.c > _bt_preprocess_array_keys(IndexScanDesc scan) The _bt_preprocess_array_keys code is now broken due to type confusion, as it assumes there is only one array subtype being compared per attribute in so.orderProcs. Reproducer: CREATE TABLE test AS SELECT generate_series(1, 10000, 1::bigint) num; CREATE INDEX ON test (num); /* bigint typed */ SELECT num FROM test WHERE num = ANY ('{1}'::smallint[]) AND num = ANY ('{1}'::int[]) /* ensure matching lastEqualityArrayAtt, lastOrderProc for next qual AND num = ANY ('{65537}'::int[]); /* qual is broken due to application of smallint compare operator on int values that do equal mod 2^16, but do not equal in their own type */ num ----- 1 I'm also concerned about the implications of this in _bt_binsrch_array_skey, as this too assumes the same compare operator is always used for all array operations on each attribute. We may need one so->orderProcs entry for each array key, but at least one per sk_subtype of each array key. I also notice that the merging of values doesn't seem to be applied optimally with mixed typed array operations: num = int[] AND num = bigint[] AND num = int[] doesn't seem to merge the first and last array ops. I'm also concerned about being (un)able to merge > +/* > + * _bt_merge_arrays() -- merge together duplicate array keys > + * > + * Both scan keys have array elements that have already been sorted and > + * deduplicated. > + */ As I mentioned upthread, I find this function to be very wasteful, as it uses N binary searches to merge join two already sorted arrays, resulting in a O(n log(m)) complexity, whereas a normal merge join should be O(n + m) once the input datasets are sorted. Please fix this, as it shows up in profiling of large array merges. Additionally, as it merges two arrays of unique items into one, storing only matching entries, I feel that it is quite wasteful to do this additional allocation here. Why not reuse the original allocation immediately? > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate, > + IndexTuple tuple, int sktrig, bool validtrig) I don't quite understand what the 'validtrig' argument is used for. There is an assertion that it is false under some conditions in this code, but it's not at all clear to me why that would have to be the case - it is called with `true` in one of the three call sites. Could the meaning of this be clarified? I also feel that this patch includes several optimizations such as this sktrig argument which aren't easy to understand. Could you pull that into a separately reviewable patch? Additionally, could you try to create a single point of entry for the array key stuff that covers the new systems? I've been trying to wrap my head around this, and it's taking a lot of effort. > _bt_advance_array_keys Thinking about the implementation here: We require transitivity for btree opclasses, where A < B implies NOT A = B, etc. Does this also carry over into cross-type operators? E.g. a type 'truncatedint4' that compares only the highest 16 bits of an integer would be strictly sorted, and could compare 0::truncatedint4 = 0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4 = 2::int4 is false. Would it be valid to add support methods for truncatedint4 to an int4 btree opclass, or is transitivity also required for all operations? i.e. all values that one operator class considers unique within an opfamily must be considered unique for all additional operators in the opfamily, or is that not required? If not, then that would pose problems for this patch, as the ordering of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY ('{0,65536}'::truncatedint4[]) could potentially skip results. I'm also no fan of the (tail) recursion. I would agree that this is unlikely to consume a lot of stack, but it does consume stackspace nonetheless, and I'd prefer if it was not done this way. I notice an assertion error here: > + Assert(cur->sk_strategy != BTEqualStrategyNumber); > + Assert(all_required_sk_satisfied); > + Assert(!foundRequiredOppositeDirOnly); > + > + foundRequiredOppositeDirOnly = true; This assertion can be hit with the following test case: CREATE TABLE test AS SELECT i a, i b, i c FROM generate_series(1, 1000) i; CREATE INDEX ON test(a, b, c); ANALYZE; SELECT count(*) FROM test WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1 AND b = ANY ('{1,2,3}'); > +_bt_update_keys_with_arraykeys(IndexScanDesc scan) I keep getting confused by the mixing of integer increments and pointer increments. Could you explain why in this code you chose to increment a pointer for "ScanKey cur", while using arrray indexing for other fields? It feels very arbitrary to me, and that makes the code difficult to follow. > +++ b/src/test/regress/sql/btree_index.sql > +-- Add tests to give coverage of various subtle issues. > +-- > +-- XXX This may not be suitable for commit, due to taking up too many cycles. > +-- > +-- Here we don't remember the scan's array keys before processing a page, only > +-- after processing a page (which is implicit, it's just the scan's current > +-- keys). So when we move the scan backwards we think that the top-level scan > +-- should terminate, when in reality it should jump backwards to the leaf page > +-- that we last visited. I notice this adds a complex test case that outputs many rows. Can we do with less rows if we build the index after data insertion, and with a lower (non-default) fillfactor? Note: I did not yet do any in-depth review of the planner changes in indxpath.c/selfuncs.c. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Can you pull these planner changes into their own commit(s)? > As mentioned upthread, it's a significant change in behavior that > should have separate consideration and reference in the commit log. I > really don't think it should be buried in the 5th paragraph of an > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the > changes of btree are arguably independent of the planner changes, as > the btree changes improve performance even if we ignore that it > implements strict result ordering. I'm not going to break out the planner changes, because they're *not* independent in any way. You could say the same thing about practically any work that changes the planner. They're "buried" in the 5th paragraph of the commit message. If an interested party can't even read that far to gain some understanding of a legitimately complicated piece of work such as this, I'm okay with that. > An independent thought when reviewing the finaltup / HIKEY scan > optimization parts of this patch: > The 'use highkey to check for next page's data range' optimization is > useful, but can currently only be used for scans that go to the right. > Could we use a similar optimization in the inverse direction if we > marked the right page of a split with "the left page has a HIKEY based > on this page's (un)truncated leftmost value" or "left page's HIKEY was > truncated to N key attributes"? It'd give one bit of information, > specifically that it does (or does not) share some (or all) key > attributes with this page's minimum value, which allows for earlier > scan termination in boundary cases. That would have to be maintained in all sorts of different places in nbtree. And could be broken at any time by somebody inserting a non-pivot tuple before every existing one on the leaf page. Doesn't seem worth it to me. If we were to do something like this then it would be discussed as independent work. It's akin to adding a low key, which could be used in several different places. As you say, it's totally independent. > > +++ b/src/include/access/nbtree.h > > +#define SK_BT_RDDNARRAY 0x00040000 /* redundant in array preprocessing */ > > I think "made redundant" is more appropriate than just "redundant"; > the array key is not itself redundant in array preprocessing (indeed > we actually work hard on that key during preprocessing to allow us to > mark it redundant) Meh. I did it that way to fit under 78 chars while staying on the same line. I don't think that it matters. > I think this was mentioned before, but can't we have an argument or > version of _bt_checkkeys that makes it not advance the array keys, so > that we can keep this optimization? Or, isn't that now > _bt_check_compare? As I said to Heikki, thinking about this some more is on my todo list. I mean the way that this worked had substantial revisions on HEAD right before I posted v9. v9 was only to fix the bit rot that that caused. > For an orderlines table with 1000s of lines per order, we would > greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that > handles scan keys more efficiently; the optimization which is being > disabled here. The way that these optimizations might work with the mechanism from the patch isn't some kind of natural extension to what's there already. We'll need new heuristics to not waste cycles. Applying all of the optimizations together just isn't trivial, and it's not yet clear what really makes sense. Combining the two optimizations more or less adds another dimension of complexity. > > +++ b/src/backend/access/nbtree/nbtutils.c > > _bt_preprocess_array_keys(IndexScanDesc scan) > The _bt_preprocess_array_keys code is now broken due to type > confusion, as it assumes there is only one array subtype being > compared per attribute in so.orderProcs. I've been aware of this for some time, but didn't think that it was worth bringing up before I had a solution to present... > I'm also concerned about the implications of this in > _bt_binsrch_array_skey, as this too assumes the same compare operator > is always used for all array operations on each attribute. We may need > one so->orderProcs entry for each array key, but at least one per > sk_subtype of each array key. ...since right now it's convenient to make so->orderProcs have a 1:1 correspondence with index key columns.... > I also notice that the merging of values doesn't seem to be applied > optimally with mixed typed array operations: num = int[] AND num = > bigint[] AND num = int[] doesn't seem to merge the first and last > array ops. I'm also concerned about being (un)able to merge ...which ought to work and be robust, once the cross-type support is in place. That is, it should work once we really can assume that there really must be exactly one so->orderProcs entry per equality-strategy scan key in all cases -- including in highly obscure corner cases involving a mix of cross-type comparisons that are also redundant. (Though as I go into below, I can see at least 2 viable solutions to the problem.) > > +/* > > + * _bt_merge_arrays() -- merge together duplicate array keys > > + * > > + * Both scan keys have array elements that have already been sorted and > > + * deduplicated. > > + */ > > As I mentioned upthread, I find this function to be very wasteful, as > it uses N binary searches to merge join two already sorted arrays, > resulting in a O(n log(m)) complexity, whereas a normal merge join > should be O(n + m) once the input datasets are sorted. And as I mentioned upthread, I think that you're making a mountain out of a molehill here. This is not a merge join. Even single digit thousand of array elements should be considered huge. Plus this code path is only hit with poorly written queries. > Please fix this, as it shows up in profiling of large array merges. > Additionally, as it merges two arrays of unique items into one, > storing only matching entries, I feel that it is quite wasteful to do > this additional allocation here. Why not reuse the original allocation > immediately? Because that's adding more complexity for a code path that will hardly ever be used in practice. For something that happens once, during a preprocessing step. > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate, > > + IndexTuple tuple, int sktrig, bool validtrig) > > I don't quite understand what the 'validtrig' argument is used for. > There is an assertion that it is false under some conditions in this > code, but it's not at all clear to me why that would have to be the > case - it is called with `true` in one of the three call sites. Could > the meaning of this be clarified? Sure, I'll add a comment. > I also feel that this patch includes several optimizations such as > this sktrig argument which aren't easy to understand. Could you pull > that into a separately reviewable patch? It probably makes sense to add the extra preprocessing stuff out into its own commit, since I tend to agree that that's an optimization that can be treated as unrelated (and isn't essential to the main thrust of the patch). However, the sktrig thing isn't really like that. We need to do things that way for required inequality scan keys. It doesn't make sense to not just do it for all required scan keys (both equality and inequality strategy scan keys) right from the start. > Additionally, could you try to create a single point of entry for the > array key stuff that covers the new systems? I've been trying to wrap > my head around this, and it's taking a lot of effort. I don't understand what you mean here. > > _bt_advance_array_keys > > Thinking about the implementation here: > We require transitivity for btree opclasses, where A < B implies NOT A > = B, etc. Does this also carry over into cross-type operators? Yes, it carries like that. > Would it be valid to add support methods for truncatedint4 to an int4 > btree opclass, or is transitivity also required for all operations? > i.e. all values that one operator class considers unique within an > opfamily must be considered unique for all additional operators in the > opfamily, or is that not required? > If not, then that would pose problems for this patch, as the ordering > of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY > ('{0,65536}'::truncatedint4[]) could potentially skip results. There are roughly two ways to deal with this sort of thing (which sounds like a restatement of the issue shown by your test case?). They are: 1. Add support for detecting redundant scan keys, even when cross-type operators are involved. (Discussed earlier.) 2. Be prepared to have more than one scan key per index key column in rare edge-cases. This means that I can no longer subscript so->orderProcs using an attnum. I intend to use whichever approach works out to be the simplest and most maintainable. Note that there is usually nothing that stops me from having redundant scan keys if it can't be avoided via preprocessing -- just like today. Actually...I might have to use a hybrid of 1 and 2. Because we need to be prepared for the possibility that it just isn't possible to determine redundancy at all due to a lack of suitable cross-type operators -- I don't think that it would be okay to just throw an error there (that really would be a regression against Postgres 16). For that reason I will most likely need to find a way for so->orderProcs to be subscripted using something that maps to equality scan keys (rather than having it map to a key column). > I'm also no fan of the (tail) recursion. I would agree that this is > unlikely to consume a lot of stack, but it does consume stackspace > nonetheless, and I'd prefer if it was not done this way. I disagree. The amount of stack space it consumes in the worst case is fixed. > I notice an assertion error here: > > + Assert(cur->sk_strategy != BTEqualStrategyNumber); > > + Assert(all_required_sk_satisfied); > > + Assert(!foundRequiredOppositeDirOnly); > > + > > + foundRequiredOppositeDirOnly = true; > > This assertion can be hit with the following test case: > > CREATE TABLE test AS > SELECT i a, i b, i c FROM generate_series(1, 1000) i; > CREATE INDEX ON test(a, b, c); ANALYZE; > SELECT count(*) FROM test > WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1 > AND b = ANY ('{1,2,3}'); Will fix. > > +_bt_update_keys_with_arraykeys(IndexScanDesc scan) > > I keep getting confused by the mixing of integer increments and > pointer increments. Could you explain why in this code you chose to > increment a pointer for "ScanKey cur", while using arrray indexing for > other fields? It feels very arbitrary to me, and that makes the code > difficult to follow. Because in one case we follow the convention for scan keys, whereas so->orderProcs is accessed via an attribute number subscript. > > +++ b/src/test/regress/sql/btree_index.sql > > +-- Add tests to give coverage of various subtle issues. > > +-- > > +-- XXX This may not be suitable for commit, due to taking up too many cycles. > > +-- > > +-- Here we don't remember the scan's array keys before processing a page, only > > +-- after processing a page (which is implicit, it's just the scan's current > > +-- keys). So when we move the scan backwards we think that the top-level scan > > +-- should terminate, when in reality it should jump backwards to the leaf page > > +-- that we last visited. > > I notice this adds a complex test case that outputs many rows. Can we > do with less rows if we build the index after data insertion, and with > a lower (non-default) fillfactor? Probably not. It was actually very hard to come up with these test cases, which tickle the implementation in just the right way to demonstrate that the code in places like _bt_steppage() is actually required. It took me a rather long time to just prove that much. Not sure that we really need this. But thought I'd include it for the time being, just so that reviewers could understand those changes. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Dec 28, 2023 at 12:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v9, which I'm posting just to fix bitrot. The patch > stopped cleanly applying against HEAD due to recent bugfix commit > 7e6fb5da. No real changes here compared to v8. Attached is v10, which is another revision most just intended to fix bitrot. This time from changes to selfuncs.c on HEAD. v10 also fixes the assertion failure reported by Matthias in passing, just because it was easy to include. No changes for the other open items, though. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Can you pull these planner changes into their own commit(s)? > > As mentioned upthread, it's a significant change in behavior that > > should have separate consideration and reference in the commit log. I > > really don't think it should be buried in the 5th paragraph of an > > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the > > changes of btree are arguably independent of the planner changes, as > > the btree changes improve performance even if we ignore that it > > implements strict result ordering. > > I'm not going to break out the planner changes, because they're *not* > independent in any way. The planner changes depend on the btree changes, that I agree with. However, I don't think that the btree changes depend on the planner changes. > You could say the same thing about practically > any work that changes the planner. They're "buried" in the 5th > paragraph of the commit message. If an interested party can't even > read that far to gain some understanding of a legitimately complicated > piece of work such as this, I'm okay with that. I would agree with you if this was about new APIs and features, but here existing APIs are being repurposed without changing them. A maintainer of index AMs would not look at the commit title 'Enhance nbtree ScalarArrayOp execution' and think "oh, now I have to make sure my existing amsearcharray+amcanorder handling actually supports non-prefix arrays keys and return data in index order". There are also no changes in amapi.h that would signal any index AM author that expectations have changed. I really don't think you can just ignore all that, and I believe this to also be the basis of Heikki's first comment. > As I said to Heikki, thinking about this some more is on my todo list. > I mean the way that this worked had substantial revisions on HEAD > right before I posted v9. v9 was only to fix the bit rot that that > caused. Then I'll be waiting for that TODO item to be resolved. > > > +++ b/src/backend/access/nbtree/nbtutils.c > > > +/* > > > + * _bt_merge_arrays() -- merge together duplicate array keys > > > + * > > > + * Both scan keys have array elements that have already been sorted and > > > + * deduplicated. > > > + */ > > > > As I mentioned upthread, I find this function to be very wasteful, as > > it uses N binary searches to merge join two already sorted arrays, > > resulting in a O(n log(m)) complexity, whereas a normal merge join > > should be O(n + m) once the input datasets are sorted. > > And as I mentioned upthread, I think that you're making a mountain out > of a molehill here. This is not a merge join. We're merging two sorted sets and want to retain only the matching entries, while retaining the order of the data. AFAIK the best algorithm available for this is a sort-merge join. Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue. > Even single digit > thousand of array elements should be considered huge. Plus this code > path is only hit with poorly written queries. Would you accept suggestions? > > Please fix this, as it shows up in profiling of large array merges. > > Additionally, as it merges two arrays of unique items into one, > > storing only matching entries, I feel that it is quite wasteful to do > > this additional allocation here. Why not reuse the original allocation > > immediately? > > Because that's adding more complexity for a code path that will hardly > ever be used in practice. For something that happens once, during a > preprocessing step. Or many times, when we're in a parameterized loop, as was also pointed out by Heikki. While I do think it is rare, the existence of this path that merges these arrays implies the need for merging these arrays, which thus > > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate, > > > + IndexTuple tuple, int sktrig, bool validtrig) > > > > I don't quite understand what the 'validtrig' argument is used for. > > There is an assertion that it is false under some conditions in this > > code, but it's not at all clear to me why that would have to be the > > case - it is called with `true` in one of the three call sites. Could > > the meaning of this be clarified? > > Sure, I'll add a comment. Thanks! > > I also feel that this patch includes several optimizations such as > > this sktrig argument which aren't easy to understand. Could you pull > > that into a separately reviewable patch? > > It probably makes sense to add the extra preprocessing stuff out into > its own commit, since I tend to agree that that's an optimization that > can be treated as unrelated (and isn't essential to the main thrust of > the patch). > > However, the sktrig thing isn't really like that. We need to do things > that way for required inequality scan keys. It doesn't make sense to > not just do it for all required scan keys (both equality and > inequality strategy scan keys) right from the start. I understand that in some places the "triggered by scankey" concept is required, but in other places the use of it is explicitly flagged as an optimization (specifically, in _bt_advance_array_keys), which confused me. > > Additionally, could you try to create a single point of entry for the > > array key stuff that covers the new systems? I've been trying to wrap > > my head around this, and it's taking a lot of effort. > > I don't understand what you mean here. The documentation (currently mostly code comments) is extensive, but still spread around various inline comments and comments on functions; with no place where a clear top-level design is detailed. I'll agree that we don't have that for the general systems in _bt_next() either, but especially with this single large patch it's difficult to grasp the specific differences between the various functions. > > > _bt_advance_array_keys > > > > Thinking about the implementation here: > > We require transitivity for btree opclasses, where A < B implies NOT A > > = B, etc. Does this also carry over into cross-type operators? > > Yes, it carries like that. > > > Would it be valid to add support methods for truncatedint4 to an int4 > > btree opclass, or is transitivity also required for all operations? > > i.e. all values that one operator class considers unique within an > > opfamily must be considered unique for all additional operators in the > > opfamily, or is that not required? > > If not, then that would pose problems for this patch, as the ordering > > of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY > > ('{0,65536}'::truncatedint4[]) could potentially skip results. > > There are roughly two ways to deal with this sort of thing (which > sounds like a restatement of the issue shown by your test case?). It wasn't really meant as a restatement: It is always unsafe to ignore upper bits, as the index isn't organized by that. However, it *could* be safe to ignore the bits with lowest significance, as the index would still be organized correctly even in that case, for int4 at least. Similar to how you can have required scankeys for the prefix of an (int2, int2) index, but not the suffix (as of right now at least). The issue I was trying to show is that if you have a type that ignores some part of the key for comparison like truncatedint4 (which hypothetically would do ((a>>16) < (b>>16)) on int4 types), then this might cause issues if that key was advanced before more precise equality checks. This won't ever be an issue when there is a requirement that if a = b and b = c then a = c must hold for all triples of typed values and operations in a btree opclass, as you mention above. > They > are: > > 1. Add support for detecting redundant scan keys, even when cross-type > operators are involved. (Discussed earlier.) > > 2. Be prepared to have more than one scan key per index key column in > rare edge-cases. This means that I can no longer subscript > so->orderProcs using an attnum. > Actually...I might have to use a hybrid of 1 and 2. Because we need to > be prepared for the possibility that it just isn't possible to > determine redundancy at all due to a lack of suitable cross-type > operators -- I don't think that it would be okay to just throw an > error there (that really would be a regression against Postgres 16). Agreed. > > I'm also no fan of the (tail) recursion. I would agree that this is > > unlikely to consume a lot of stack, but it does consume stackspace > > nonetheless, and I'd prefer if it was not done this way. > > I disagree. The amount of stack space it consumes in the worst case is fixed. Is it fixed? Without digging very deep into it, it looks like it can iterate on the order of n_scankeys deep? The comment above does hint on 2 iterations, but is not very clear about the conditions and why. > > I notice an assertion error here: > > > + Assert(cur->sk_strategy != BTEqualStrategyNumber); > > > + Assert(all_required_sk_satisfied); > > > + Assert(!foundRequiredOppositeDirOnly); > > > + > > > + foundRequiredOppositeDirOnly = true; > > > > This assertion can be hit with the following test case: > > > > CREATE TABLE test AS > > SELECT i a, i b, i c FROM generate_series(1, 1000) i; > > CREATE INDEX ON test(a, b, c); ANALYZE; > > SELECT count(*) FROM test > > WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1 > > AND b = ANY ('{1,2,3}'); > > Will fix. > > > > +_bt_update_keys_with_arraykeys(IndexScanDesc scan) > > > > I keep getting confused by the mixing of integer increments and > > pointer increments. Could you explain why in this code you chose to > > increment a pointer for "ScanKey cur", while using arrray indexing for > > other fields? It feels very arbitrary to me, and that makes the code > > difficult to follow. > > Because in one case we follow the convention for scan keys, whereas > so->orderProcs is accessed via an attribute number subscript. Okay, but how about this in _bt_merge_arrays? + Datum *elem = elems_orig + i; I'm not familiar with the scan key convention, as most other places use reference+subscripting. > > > +++ b/src/test/regress/sql/btree_index.sql > > > +-- Add tests to give coverage of various subtle issues. > > > +-- > > > +-- XXX This may not be suitable for commit, due to taking up too many cycles. > > > +-- > > > +-- Here we don't remember the scan's array keys before processing a page, only > > > +-- after processing a page (which is implicit, it's just the scan's current > > > +-- keys). So when we move the scan backwards we think that the top-level scan > > > +-- should terminate, when in reality it should jump backwards to the leaf page > > > +-- that we last visited. > > > > I notice this adds a complex test case that outputs many rows. Can we > > do with less rows if we build the index after data insertion, and with > > a lower (non-default) fillfactor? > > Probably not. It was actually very hard to come up with these test > cases, which tickle the implementation in just the right way to > demonstrate that the code in places like _bt_steppage() is actually > required. It took me a rather long time to just prove that much. Not > sure that we really need this. But thought I'd include it for the time > being, just so that reviewers could understand those changes. Makes sense, thanks for the explanation. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > I'm not going to break out the planner changes, because they're *not* > > independent in any way. > > The planner changes depend on the btree changes, that I agree with. > However, I don't think that the btree changes depend on the planner > changes. Yeah, technically it would be possible to break out the indxpath.c changes -- that wouldn't leave the tree in a state with visibly-wrong behavior at any point. But that's far from the only thing that matters. If that was the standard that we applied, then I might have to split the patch into 10 or more patches. What it boils down to is this: it is totally natural for me to think of the planner changes as integral to the nbtree changes, so I'm going to include them together in one commit. That's just how the code was written -- I always thought about it as one single thing. The majority of the queries that I've used to promote the patch directly rely on the planner changes in one way or another (even back in v1, when the planner side of things had lots of kludges). I don't necessarily always follow that standard. Sometimes it is useful to further break things up. Like when it happens to make the high-level division of labor a little bit clearer. The example that comes to mind is commit dd299df8 and commit fab25024. These nbtree commits were essentially one piece of work that was broken into two, but committed within minutes of each other, and left the tree in a momentary state that was not-very-useful (though still correct). That made sense to me at the time because the code in question was mechanically distinct, while at the same time modifying some of the same nbtree files. Drawing a line under it seemed to make sense. I will admit that there is some amount of subjective judgement (gasp!) here. Plus I'll acknowledge that it's *slightly* odd that the most compelling cases for the SAOP patch almost all involve savings in heap page accesses, even though it is fundamentally a B-Tree patch. But that's just how it came out. Slightly odd things happen all the time. > I would agree with you if this was about new APIs and features, but > here existing APIs are being repurposed without changing them. A > maintainer of index AMs would not look at the commit title 'Enhance > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure > my existing amsearcharray+amcanorder handling actually supports > non-prefix arrays keys and return data in index order". This is getting ridiculous. It is quite likely that there are exactly zero affected out-of-core index AMs. I don't count pgroonga as a counterexample (I don't think that it actually fullfills the definition of a ). Basically, "amcanorder" index AMs more or less promise to be compatible with nbtree, down to having the same strategy numbers. So the idea that I'm going to upset amsearcharray+amcanorder index AM authors is a completely theoretical problem. The planner code evolved with nbtree, hand-in-glove. I'm more than happy to add a reminder to the commit message about adding a proforma listing to the compatibility section of the Postgres 17 release notes. Can we actually agree on which theoretical index AM types are affected first, though? Isn't it actually amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I have that right? BTW, how should we phrase this compatibility note, so that it can be understood? It's rather awkward. > > As I said to Heikki, thinking about this some more is on my todo list. > > I mean the way that this worked had substantial revisions on HEAD > > right before I posted v9. v9 was only to fix the bit rot that that > > caused. > > Then I'll be waiting for that TODO item to be resolved. My TODO item is to resolve the question of whether and to what extent these two optimizations should be combined. It's possible that I'll decide that it isn't worth it, for whatever reason. At this point I'm still totally neutral. > > Even single digit > > thousand of array elements should be considered huge. Plus this code > > path is only hit with poorly written queries. > > Would you accept suggestions? You mean that you want to have a go at it yourself? Sure, go ahead. I cannot promise that I'll accept your suggested revisions, but if they aren't too invasive/complicated compared to what I have now, then I will just accept them. I maintain that this isn't really necessary, but it might be the path of least resistance at this point. > > > Please fix this, as it shows up in profiling of large array merges. > > > Additionally, as it merges two arrays of unique items into one, > > > storing only matching entries, I feel that it is quite wasteful to do > > > this additional allocation here. Why not reuse the original allocation > > > immediately? > > > > Because that's adding more complexity for a code path that will hardly > > ever be used in practice. For something that happens once, during a > > preprocessing step. > > Or many times, when we're in a parameterized loop, as was also pointed > out by Heikki. That's not really what Heikki said. Heikki had a general concern about the startup costs with nestloop joins. > > > Additionally, could you try to create a single point of entry for the > > > array key stuff that covers the new systems? I've been trying to wrap > > > my head around this, and it's taking a lot of effort. > > > > I don't understand what you mean here. > > The documentation (currently mostly code comments) is extensive, but > still spread around various inline comments and comments on functions; > with no place where a clear top-level design is detailed. > I'll agree that we don't have that for the general systems in > _bt_next() either, but especially with this single large patch it's > difficult to grasp the specific differences between the various > functions. Between what functions? The guts of this work are in the new _bt_advance_array_keys and _bt_tuple_before_array_skeys (honorable mention for _bt_array_keys_remain). It seems pretty well localized to me. Granted, there are a few places where we rely on things being set up a certain way by other code that's quite some distance away (from the code doing the depending). For example, the new additions to _bt_preprocess_keys that we need later on, in _bt_advance_array_keys/_bt_update_keys_with_arraykeys. These bits and pieces of code are required, but are not particularly crucial to understanding the general design. For the most part the design is that we cede control of array key advancement to _bt_checkkeys() -- nothing much else changes. There are certainly some tricky details behind the scenes, which we should be verifying via testing and especially via robust invariants (verified with assertions). But this is almost completely hidden from the nbtsearch.c caller -- there are no real changes required there. > It wasn't really meant as a restatement: It is always unsafe to ignore > upper bits, as the index isn't organized by that. However, it *could* > be safe to ignore the bits with lowest significance, as the index > would still be organized correctly even in that case, for int4 at > least. Similar to how you can have required scankeys for the prefix of > an (int2, int2) index, but not the suffix (as of right now at least). It isn't safe to assume that types appearing together within an opfamily are binary coercible (or capable of being type-punned like this) in the general case. That often works, but it isn't reliable. Even with integer_ops, it breaks on big-endian machines. > This won't ever be an issue when there is a requirement that if a = b > and b = c then a = c must hold for all triples of typed values and > operations in a btree opclass, as you mention above. Right. It also doesn't matter if there are redundant equality conditions that we cannot formally prove are redundant during preprocessing, for want of appropriate cross-type comparison routines from the opfamily. The actual behavior of index scans (in terms of when and how we skip) won't be affected by that at all. The only problem that it creates is that we'll waste CPU cycles, relative to the case where we can somehow detect this redundancy. In case it wasn't clear before: the optimizations I've added to _bt_preprocess_array_keys are intended to compensate for the pessimizations added to _bt_preprocess_keys (both the optimizations and the pessimizations only affect equality type array scan keys). I don't think that this needs to be perfect; it just seems like a good idea. > > > I'm also no fan of the (tail) recursion. I would agree that this is > > > unlikely to consume a lot of stack, but it does consume stackspace > > > nonetheless, and I'd prefer if it was not done this way. > > > > I disagree. The amount of stack space it consumes in the worst case is fixed. > > Is it fixed? Without digging very deep into it, it looks like it can > iterate on the order of n_scankeys deep? The comment above does hint > on 2 iterations, but is not very clear about the conditions and why. The recursive call to _bt_advance_array_keys is needed to deal with unsatisfied-and-required inequalities that were not detected by an initial call to _bt_check_compare, following a second retry call to _bt_check_compare. We know that this recursive call to _bt_advance_array_keys won't cannot recurse again because we've already identified which specific inequality scan key it is that's a still-unsatisfied inequality, following an initial round of array key advancement. We're pretty much instructing _bt_advance_array_keys to perform "beyond_end_advance" type advancement for the specific known unsatisfied inequality scan key (which must be required in the current scan direction, per the assertions for that) here. So clearly it cannot end up recursing any further -- the recursive call is conditioned on "all_arraylike_sk_satisfied && arrays_advanced", but that'll be unset when "ikey == sktrig && !array" from inside the loop. This is a narrow edge-case -- it is an awkward one. Recursion does seem natural here, because we're essentially repeating the call to _bt_advance_array_keys from inside _bt_advance_array_keys, rather than leaving it up to the usual external caller to do later on. If you ripped this code out it would barely be noticeable at all. But it seems worth it so that we can make a uniform promise to always advance the array keys to the maximum extent possible, based on what the caller's tuple tells us about the progress of the scan. Since all calls to _bt_advance_array_keys are guaranteed to advance the array keys to the maximum extent that's safely possible (barring this one edge-case with recursive calls), it almost follows that there can't be another recursive call. This is the one edge case where the implementation requires a second pass over the tuple -- we advanced the array keys to all-matching values, but nevertheless couldn't finish there due to the unsatisfied required inequality (which must have become unsatisfied _at the same time_ as some earlier required equality scan key for us to end up requiring a recursive call). > > Because in one case we follow the convention for scan keys, whereas > > so->orderProcs is accessed via an attribute number subscript. > > Okay, but how about this in _bt_merge_arrays? > > + Datum *elem = elems_orig + i; > > I'm not familiar with the scan key convention, as most other places > use reference+subscripting. I meant the convention used in code like _bt_check_compare (which is what we call _bt_checkkeys on HEAD, basically). Note that the _bt_merge_arrays code that you've highlighted isn't iterating through so->keyData[] -- it is iterating through the function caller's elements array, which actually come from so->arrayKeys[]. Like every other Postgres contributor, I do my best to follow the conventions established by existing code. Sometimes that leads to pretty awkward results, where CamelCase and underscore styles are closely mixed together, because it works out to be the most consistent way of doing it overall. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote: > Thank you for your replies so far. > On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I would agree with you if this was about new APIs and features, but > > here existing APIs are being repurposed without changing them. A > > maintainer of index AMs would not look at the commit title 'Enhance > > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure > > my existing amsearcharray+amcanorder handling actually supports > > non-prefix arrays keys and return data in index order". > > This is getting ridiculous. > > It is quite likely that there are exactly zero affected out-of-core > index AMs. I don't count pgroonga as a counterexample (I don't think > that it actually fullfills the definition of a ). Basically, > "amcanorder" index AMs more or less promise to be compatible with > nbtree, down to having the same strategy numbers. So the idea that I'm > going to upset amsearcharray+amcanorder index AM authors is a > completely theoretical problem. The planner code evolved with nbtree, > hand-in-glove. And this is where I disagree with your (percieved) implicit argument that this should be and always stay this way. I don't mind changes in the planner for nbtree per se, but as I've mentioned before in other places, I really don't like how we handle amcanorder as if it were amisbtree. But it's not called that, so we shouldn't expect that to remain the case; and definitely not keep building on those expectations that it always is going to be the nbtree when amcanorder is set (or amsearcharray is set, or ..., or any combination of those that is currently used by btree). By keeping that expectation alive, this becomes a self-fulfilling prophecy, and I really don't like such restrictive self-fulfilling prophecies. It's nice that we have index AM feature flags, but if we only effectively allow one implementation for this set of flags by ignoring potential other users when changing behavior, then what is the API good for (apart from abstraction layering, which is nice)? /rant I'll see about the > I'm more than happy to add a reminder to the commit message about > adding a proforma listing to the compatibility section of the Postgres > 17 release notes. Can we actually agree on which theoretical index AM > types are affected first, though? Isn't it actually > amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I > have that right? I think that may be right, but I could very well have built a btree-lite that doesn't have the historical baggage of having to deal with pages from before v12 (version 4) and some improvements that haven't made it to core yet. > BTW, how should we phrase this compatibility note, so that it can be > understood? It's rather awkward. Something like the following, maybe? """ Compatibility: The planner will now generate paths with array scan keys in any column for ordered indexes, rather than on only a prefix of the index columns. The planner still expects fully ordered data from those indexes. Historically, the btree AM couldn't output correctly ordered data for suffix array scan keys, which was "fixed" by prohibiting the planner from generating array scan keys without an equality prefix scan key up to that attribute. In this commit, the issue has been fixed, and the planner restriction has thus been removed as the only in-core IndexAM that reports amcanorder now supports array scan keys on any column regardless of what prefix scan keys it has. """ > > > As I said to Heikki, thinking about this some more is on my todo list. > > > I mean the way that this worked had substantial revisions on HEAD > > > right before I posted v9. v9 was only to fix the bit rot that that > > > caused. > > > > Then I'll be waiting for that TODO item to be resolved. > > My TODO item is to resolve the question of whether and to what extent > these two optimizations should be combined. It's possible that I'll > decide that it isn't worth it, for whatever reason. That's fine, this decision (and any work related to it) is exactly what I was referring to with that mention of the TODO. > > > Even single digit > > > thousand of array elements should be considered huge. Plus this code > > > path is only hit with poorly written queries. > > > > Would you accept suggestions? > > You mean that you want to have a go at it yourself? Sure, go ahead. > > I cannot promise that I'll accept your suggested revisions, but if > they aren't too invasive/complicated compared to what I have now, then > I will just accept them. I maintain that this isn't really necessary, > but it might be the path of least resistance at this point. Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental on top of your earlier set, and both don't allocate additional memory in the merge operation in non-assertion builds. patch-a is a trivial and clean implementation of mergesort, which tends to reduce the total number of compare operations if both arrays are of similar size and value range. It writes data directly back into the main array on non-assertion builds, and with assertions it reuses your binary search join on scratch space for algorithm validation. patch-b is an improved version of patch-a with reduced time complexity in some cases: It applies exponential search to reduce work done where there are large runs of unmatched values in either array, and does that while keeping O(n+m) worst-case complexity, now getting a best-case O(log(n)) complexity. > > > > I'm also no fan of the (tail) recursion. I would agree that this is > > > > unlikely to consume a lot of stack, but it does consume stackspace > > > > nonetheless, and I'd prefer if it was not done this way. > > > > > > I disagree. The amount of stack space it consumes in the worst case is fixed. > > > > Is it fixed? Without digging very deep into it, it looks like it can > > iterate on the order of n_scankeys deep? The comment above does hint > > on 2 iterations, but is not very clear about the conditions and why. > > The recursive call to _bt_advance_array_keys is needed to deal with > unsatisfied-and-required inequalities that were not detected by an > initial call to _bt_check_compare, following a second retry call to > _bt_check_compare. We know that this recursive call to > _bt_advance_array_keys won't cannot recurse again because we've > already identified which specific inequality scan key it is that's a > still-unsatisfied inequality, following an initial round of array key > advancement. So, it's a case of this? scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY (2 (=current), 3, 4) tuple: a=2, b=4 We first match the 'a' inequality key, then find an array-key mismatch breaking out, then move the array keys forward to a=2 (of ANY (1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later b < 4 inequality key, as that required inequality key wasn't checked because the earlier arraykey did not match in _bt_check_compare, causing it to stop at the a=1 condition as opposed to check the b < 4 condition? If so, then this visual explanation helped me understand the point and why it can't repeat more than once much better than all that text. Maybe this can be integrated? > This is a narrow edge-case -- it is an awkward one. Recursion does > seem natural here, because we're essentially repeating the call to > _bt_advance_array_keys from inside _bt_advance_array_keys, rather than > leaving it up to the usual external caller to do later on. If you > ripped this code out it would barely be noticeable at all. But it > seems worth it so that we can make a uniform promise to always advance > the array keys to the maximum extent possible, based on what the > caller's tuple tells us about the progress of the scan. Agreed on the awkwardness and recursion. > > > Because in one case we follow the convention for scan keys, whereas > > > so->orderProcs is accessed via an attribute number subscript. > > > > Okay, but how about this in _bt_merge_arrays? > > > > + Datum *elem = elems_orig + i; > > > > I'm not familiar with the scan key convention, as most other places > > use reference+subscripting. > > I meant the convention used in code like _bt_check_compare (which is > what we call _bt_checkkeys on HEAD, basically). > > Note that the _bt_merge_arrays code that you've highlighted isn't > iterating through so->keyData[] -- it is iterating through the > function caller's elements array, which actually come from > so->arrayKeys[]. It was exactly why I started asking about the use of pointer additions: _bt_merge_arrays is new code and I can't think of any other case of this style being used in new code, except maybe when surrounding code has this style. The reason I first highlighted _bt_update_keys_with_arraykeys is because it had a clear case of 2 different styles in the same function. > Like every other Postgres contributor, I do my best to follow the > conventions established by existing code. Sometimes that leads to > pretty awkward results, where CamelCase and underscore styles are > closely mixed together, because it works out to be the most consistent > way of doing it overall. I'm slightly surprised by that, as after this patch I can't find any code that wasn't touched by this patch in nbtutils that uses + for pointer offsets. Either way, that's the style for indexing into a ScanKey, apparently? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote: > And this is where I disagree with your (percieved) implicit argument > that this should be and always stay this way. I never said that, and didn't intend to imply it. As I outlined to you back in November, my general philosophy around APIs (such as the index AM API) is that ambiguity about the limits and extent of an abstract interface isn't necessarily a bad thing [1]. It can actually be a good thing! Ever hear of Hyrum's Law? Abstractions are very often quite leaky. APIs like the index AM API inevitably make trade-offs. Trade-offs are almost always political, in one way or another. Litigating every possible question up-front requires knowing ~everything before you really get started. This mostly ends up being a waste of time, since many theoretical contentious trade-offs just won't matter one bit in the long run, for one reason or another (not necessarily because there is only ever one consumer of an API, for all sorts of reasons). You don't have to agree with me. That's just what my experience indicates works best on average, and in the long run. I cannot justify it any further than that. > By keeping that expectation alive, > this becomes a self-fulfilling prophecy, and I really don't like such > restrictive self-fulfilling prophecies. It's nice that we have index > AM feature flags, but if we only effectively allow one implementation > for this set of flags by ignoring potential other users when changing > behavior, then what is the API good for (apart from abstraction > layering, which is nice)? I explicitly made it clear that I don't mean that. > I think that may be right, but I could very well have built a > btree-lite that doesn't have the historical baggage of having to deal > with pages from before v12 (version 4) and some improvements that > haven't made it to core yet. Let me know if you ever do that. Let me know what the problems you encounter are. I'm quite happy to revise my position on this in light of new information. I change my mind all the time! > Something like the following, maybe? > > """ > Compatibility: The planner will now generate paths with array scan > keys in any column for ordered indexes, rather than on only a prefix > of the index columns. The planner still expects fully ordered data > from those indexes. > Historically, the btree AM couldn't output correctly ordered data for > suffix array scan keys, which was "fixed" by prohibiting the planner > from generating array scan keys without an equality prefix scan key up > to that attribute. In this commit, the issue has been fixed, and the > planner restriction has thus been removed as the only in-core IndexAM > that reports amcanorder now supports array scan keys on any column > regardless of what prefix scan keys it has. > """ Even if I put something like this in the commit message, I doubt that Bruce would pick it up in anything like this form (I have my doubts about it happening no matter what wording is used, actually). I could include something less verbose, mentioning a theoretical risk to out-of-core amcanorder routines that coevolved with nbtree, inherited the same SAOP limitations, and then never got the same set of fixes. > patch-a is a trivial and clean implementation of mergesort, which > tends to reduce the total number of compare operations if both arrays > are of similar size and value range. It writes data directly back into > the main array on non-assertion builds, and with assertions it reuses > your binary search join on scratch space for algorithm validation. I'll think about this some more. > patch-b is an improved version of patch-a with reduced time complexity > in some cases: It applies exponential search to reduce work done where > there are large runs of unmatched values in either array, and does > that while keeping O(n+m) worst-case complexity, now getting a > best-case O(log(n)) complexity. This patch seems massively over-engineered, though. Over 200 new lines of code, all for a case that is only used when queries are written with obviously-contradictory quals? It's just not worth it. > > The recursive call to _bt_advance_array_keys is needed to deal with > > unsatisfied-and-required inequalities that were not detected by an > > initial call to _bt_check_compare, following a second retry call to > > _bt_check_compare. We know that this recursive call to > > _bt_advance_array_keys won't cannot recurse again because we've > > already identified which specific inequality scan key it is that's a > > still-unsatisfied inequality, following an initial round of array key > > advancement. > > So, it's a case of this? > > scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY > (2 (=current), 3, 4) > tuple: a=2, b=4 I don't understand why your example starts with "scankey: a > 1" and uses redundant/contradictory scan keys (for both "a" and "b"). For a forward scan the > scan key won't be seen as required by _bt_advance_array_keys, which means that it cannot be relevant to the branch containing the recursive call to _bt_advance_array_keys. (The later branch that calls _bt_check_compare is another matter, but that doesn't call _bt_advance_array_keys recursively at all -- that's not what we're discussing.) I also don't get why you've added all of this tricky redundancy to the example you've proposed. That seems to make the example a lot more complicated, without any apparent advantage. This particular piece of code has nothing to do with redundant/contradictory scan keys. > We first match the 'a' inequality key, then find an array-key mismatch > breaking out, then move the array keys forward to a=2 (of ANY > (1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later > b < 4 inequality key, as that required inequality key wasn't checked > because the earlier arraykey did not match in _bt_check_compare, > causing it to stop at the a=1 condition as opposed to check the b < 4 > condition? I think that that's pretty close, yes. Obviously, _bt_check_compare() is going to give up upon finding the most significant now-unsatisfied scan key (which must be a required scan key in cases relevant to the code in question) -- at that point we need to advance the array keys. If we go on to successfully advance all required arrays to values that all match the corresponding values from caller's tuple, we might still have to consider inequalities (that are required in the current scan direction). It might still turn out that the relevant second call to _bt_check_compare() (the first one inside _bt_advance_array_keys) still sets continuescan=false -- despite what we were able to do with the scan's array keys. At that point, the only possible explanation is that there is a required inequality that still isn't satisfied. We should use "beyond_end_advance" advancement to advance the array keys "incrementally" a second time (in a recursive call "against the same original tuple"). This can only happen when the value of each of two required scan keys (some equality scan key plus some later inequality scan key) both cease to be satisfied at the same time, *within the same tuple*. In practice this just doesn't happen very often. It could in theory be avoided altogether (without any behavioral change) by forcing _bt_check_compare to always assess every scan key, rather than giving up upon finding any unsatisfied scan key (though that would be very inefficient). It'll likely be much easier to see what I mean by considering a real example. See my test case for this, which I don't currently plan to commit: https://github.com/petergeoghegan/postgres/blob/saop-dynamic-skip-v10.0/src/test/regress/sql/dynamic_saop_advancement.sql#L3600 I think that only this test case is the only one that'll actually break when you just rip out the recursive _bt_advance_array_keys call. And it still won't give incorrect answers when it breaks -- it just accesses a single extra leaf page. As I went into a bit already, upthread, this recursive call business is a good example of making the implementation more complicated, in order to preserve interface simplicity and generality. I think that that's the right trade-off here, despite being kinda awkward. > If so, then this visual explanation helped me understand the point and > why it can't repeat more than once much better than all that text. > Maybe this can be integrated? An example seems like it'd be helpful as a code comment. Can you come up with a simplified one? > It was exactly why I started asking about the use of pointer > additions: _bt_merge_arrays is new code and I can't think of any other > case of this style being used in new code, except maybe when > surrounding code has this style. The reason I first highlighted > _bt_update_keys_with_arraykeys is because it had a clear case of 2 > different styles in the same function. Meh. [1] https://postgr.es/m/CAH2-WzmWn2_eS_4rWy90DRzC-NW-oponONR6PwMqy+OOuvVyFA@mail.gmail.com -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental > on top of your earlier set, and both don't allocate additional memory > in the merge operation in non-assertion builds. > > patch-a is a trivial and clean implementation of mergesort, which > tends to reduce the total number of compare operations if both arrays > are of similar size and value range. It writes data directly back into > the main array on non-assertion builds, and with assertions it reuses > your binary search join on scratch space for algorithm validation. This patch fails some of my tests on non-assert builds only (assert builds pass all my tests, though). I'm using the first patch on its own here. While I tend to be relatively in favor of complicated assertions (I tend to think the risk of introducing side-effects is worth it), it looks like you're only performing certain steps in release builds. This is evident from just looking at the code (there is an #else block just for the release build in the loop). Note also that "Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig, &cxt) == 0)" has side-effects in assert-enabled builds only (it increments merged_nelems). While it's true that you *also* increment merged_nelems *outside* of the assertion (or in an #else block used during non-assert builds), that is conditioned on some other thing (so it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING case). It's also just really hard to understand what's going on here. If I was going to do this kind of thing, I'd use two completely separate loops, that were obviously completely separate (maybe even two functions). I'd then memcmp() each array at the end. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > I could include something less verbose, mentioning a theoretical risk > to out-of-core amcanorder routines that coevolved with nbtree, > inherited the same SAOP limitations, and then never got the same set > of fixes. Attached is v11, which now says something like that in the commit message. Other changes: * Fixed buggy sorting of arrays using cross-type ORDER procs, by recognizing that we need to consistently use same-type ORDER procs for sorting and merging the arrays during array preprocessing. Obviously, when we sort, we compare array elements to other array elements (all of the same type). This is true independent of whether the query itself happens to use a cross type operator/ORDER proc, so we will need to do two separate ORDER proc lookups in cross-type scenarios. * No longer subscript the ORDER proc used for array binary searches using a scankey subscript. Now there is an additional indirection that works even in the presence of multiple redundant scan keys that cannot be detected as such due to a lack of appropriate cross-type support within an opfamily. This was subtly buggy before now. Requires a little more coordination between array preprocessing and standard/primitive index scan preprocessing, which isn't ideal but seems unavoidable. * Lots of code polishing, especially within _bt_advance_array_keys(). While _bt_advance_array_keys() still works in pretty much exactly the same way as it did back in v10, there are now better comments. Including something about why its recursive call to itself is guaranteed to use a low, fixed amount of stack space, verified using an assertion. That addresses a concern held by Matthias. Outlook ======= This patch is approaching being committable now. Current plan is to commit this within the next few weeks. All that really remains now is to research how we might integrate this work with the recently added continuescanPrechecked/haveFirstMatch stuff from Alexander Korotkov, if at all. I've put that off until now because it isn't particularly fundamental to what I'm doing here, and seems optional. I would also like to do more performance validation. Things like the parallel index scan code could stand to be revisited once again. Plus I should think about the overhead of array preprocessing when btrescan() is called many times, from a nested loop join -- I should have something to say about that concern (raised by Heikki at one point) before too long. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v11, which now says something like that in the commit > message. Attached is v12. > All that really remains now is to research how we might integrate this > work with the recently added continuescanPrechecked/haveFirstMatch > stuff from Alexander Korotkov, if at all. The main change in v12 is that I've integrated both the continuescanPrechecked and the haveFirstMatch optimizations. Both of these fields are now page-level state, shared between the _bt_readpage caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they appear next to the new home for _bt_checkkeys' continuescan field, in the new page state struct). The haveFirstMatch optimization (not a great name BTW) was the easier of the two to integrate. I just invalidate the flag (set it to false) when the array keys advance. It works in exactly the same way as in the simple no-arrays case, except that there can be more than one "first match" per page. (This approach was directly enabled by the new design that came out of Alexander's bugfix commit 7e6fb5da) The continuescanPrechecked optimization was trickier. The hard part was avoiding confusion when the start of matches for the current set of array keys starts somewhere in the middle of the page -- that needs to be a gating condition on applying the optimization, applied within _bt_readpage. continuescanPrechecked ====================== To recap, this precheck optimization is predicated on the idea that the precheck _bt_checkkeys call tells us all we need to know about required-in-same-direction scan keys for every other tuple on the page (assuming continuescan=true is set during the precheck). Critically, this assumes that there can be no confusion between "before the start of matching tuples" and "after the end of matching tuples" (in the presence of required equality strategy scan keys). In other words, it tacitly assumes that we can't break a very old rule that says that _bt_first must never call _bt_readpage with an offnum that's before the start of the first match (the leaf page position located by our insertion scan key). Although it isn't obvious that this is relevant at all right now (partly because the _bt_first call to _bt_readpage won't even use the optimization on the grounds that to do so would regress point queries), it becomes more obvious once my patch is added to the mix. To recap again, making the arrays advance dynamically necessarily requires that I teach _bt_checkkeys to avoid confusing "before the start of matching tuples" with "after the end of matching tuples" -- we must give _bt_checkkeys a set of 3-way ORDER procs (and the necessary context) to avoid that confusion -- even for any non-array required equality keys. Putting it all together (having laid the groundwork with my recaps): This means that it is only safe (for my patch) to even attempt the precheck optimization when we already know that the array keys cover keyspace at the start of the page (and if the attempt actually succeeds, it succeeds because the current array keys cover the entire page). And so in v12 I track that state in so->scanBehind. In practice, we only rarely need to avoid the continuescanPrechecked optimization as a result of this gating condition -- it hardly reduces the effectiveness of the optimization at all. In fact, it isn't even possible for so->scanBehind to ever be set during a backward scan. Saving cycles in_bt_checkkeys with so->scanBehind ------------------------------------------------- It turns out that this so->scanBehind business is generally useful. We now expect _bt_advance_array_keys to establish whether or not the _bt_readpage-wise scan will continue to the end of the current leaf page up-front, each time the arrays are advanced (advanced past the tuple being scanned by _bt_readpage). When _bt_advance_array_keys decides to move onto the next page, it might also set so->scanBehind. This structure allowed me to simplify the hot code paths within _bt_checkkeys. Now _bt_checkkeys doesn't need to check the page high key (or check the first non-pivot tuple in the case of backwards scans) at all in the very common case where so->scanBehind hasn't been set by _bt_advance_array_keys. I mentioned that only forward scans can ever set the so->scanBehind flag variable. That's because only forward scans can advance the array keys using the page high key, which (due to suffix truncation) creates the possibility that the next _bt_readpage will start out on a page whose earlier tuples are before the real start of matches (for the current array keys) -- see the comments in _bt_advance_array_keys for the full explanation. so->scanBehind summary ---------------------- In summary, so->scanBehind serves two quite distinct purposes: 1. Makes it safe to apply the continuescanPrechecked optimization during scans that happen to use array keys -- with hardly any changes required to _bt_readpage to make this work (just testing the so->scanBehind flag). 2. Gives _bt_checkkeys() a way to know when to check if an earlier speculative choice (by _bt_advance_array_keys) to move on to the current leaf page without it yet being 100% clear that its key space has exact matches for all the scan's equality/array keys (which is only possible when suffix truncation obscures the issue, hence the thing about so->scanBehind only ever being set during forward scans). This speculative choice can only be made during forward scans of a composite index, whose high key has one or more truncated suffix attributes that correspond to a required scan key. _bt_advance_array_keys deems that the truncated attribute "satisfies" the scan key, but remembers that it wasn't strictly speaking "satisfied", by setting so->scanBehind (so it is "satisfied" with the truncated attributes being a "match", but not so satisfied that it's willing to allow it without also giving _bt_checkkeys a way to back out if it turns out to have been the wrong choice once on the next page). _bt_preprocess_keys contradictory key array advancement ======================================================= The other big change in v12 concerns _bt_preprocess_keys (not to be confused with _bt_preprocess_array_keys). It has been taught to treat the scan's array keys as a distinct type of scan key for the purposes of detecting redundant and contradictory scan keys. In other words, it treats scan keys with arrays as their own special type of equality scan key -- it doesn't treat them as just another type of equality strategy key in v12. In other other words, _bt_preprocess_keys doesn't "work at the level of individual primitive index scans" in v12. Technically this is an optimization that targets cases with many distinct sets of array keys that have contradictory quals. But I mostly did this because it has what I now consider to be far better code structure than what we had in previous versions (I changed my mind, having previously considered the changes I'd made to _bt_preprocess_keys for arrays to be less than desirable). And because it defensively makes the B-Tree code tolerant of scans that have an absurdly large number of distinct sets of array keys (e.g., 3 arrays with 1,000 elements, which gives us a billion distinct sets of array keys in total). In v12 an index scan can use (say) a billion distinct sets of array keys, and still expect optimal performance -- even in the corner case where (for whatever reason) the scan's quals also happen to be contradictory. There is no longer any risk of calling _bt_preprocess_keys a billion times, making the query time explode, all because somebody added an "a = 5 AND a = 6" to the end of the query's qual. In fact, there isn't ever a need to call _bt_preprocess_keys() more than once, no matter the details of the scan -- not now that _bt_preprocess_keys literally operates on whole arrays as a separate thing to other equality keys. Once you start to think in terms of array scan keys (not primitive index scans), it becomes obvious that a single call to _bt_preprocess_keys is all you really need, (Note: While we actually do still always call _bt_preprocess_keys from _bt_first, every call to _bt_preprocess_keys after the first one can now simply reuse the existing scan keys that were output during the first call. So effectively there's only one call to _bt_preprocess_keys required per top-level scan.) Potential downsides ------------------- Admittedly, this new approach in _bt_preprocess_keys has some potential downsides. _bt_preprocess_keys is now largely incapable of treating quals like "WHERE a IN(1,2,3) AND a = 2" as "partially contradictory". What it'll actually do is allow the qual to contain duplicative scan keys (just like when we can't detect redundancy due to a lack of cross-type support), and leave it all up to _bt_checkkeys/_bt_advance_array_keys to figure it out later on. (At one point I actually taught _bt_preprocesses_keys to perform "incremental advancement" of the scan's arrays itself, which worked, but that still left the code vulnerable to having to call _bt_preprocesses_keys an enormous number of times in extreme cases. That was deemed unacceptable, because it still seemed like a POLA violation.) This business with just leaving in some extra scan keys might seem a little bit sloppy. I thought so myself, for a while, but now I think that it's actually fine. For the following reasons: * Even with arrays, contradictory quals like "WHERE a = 4 AND a = 5" are still detected as contradictory, while quals like "WHERE a = 5 and a = 5" are still detected as containing a redundancy. We'll only "fail to detect redundancy/contradictoriness" with cases such as "WHERE a IN (1, 2, 3) and a = 2" -- cases that (once you buy into this new way of thinking about arrays and preprocessing) aren't really contradictory/redundant at all. * We have _bt_preprocess_array_keys, which was first taught to merge together redundant array keys in an earlier version of this patch. So, for example, quals like "WHERE a IN (1, 2, 3) AND a IN (3,4,5)" can be preprocessed into "WHERE a in (3)". Plus quals like "WHERE a IN (1, 2, 3) AND a IN (18,72)" can be ruled contradictory there and then, in _bt_preprocess_array_keys , ending the scan immediately. These cases seem like the really important ones for array redundancy/contradictoriness. We're not going to miss these array redundancies. * The new approach to array key advancement is very good at quickly eliminating redundant subsets of the array keys that couldn't be detected as such by preprocessing, due to a lack of suitable infrastructure. While there is some downside from allowing preprocessing to fail to detect "partially contradictory" quals, the new code is so good at advancing the array keys efficiently at runtime the added overhead is hardly noticeable at all. We're talking maybe one or two extra descents of the index, for just a subset of all badly written queries that show some kind of redundancy/contradictoriness involving array scan keys. Even if somebody takes the position that this approach isn't acceptable (not that I expect that anybody will), then the problem won't be that I've gone too far. The problem will just be that I haven't gone far enough. If it somehow becomes a blocker to commit, I'd rather handle the problem by compensating for the minor downsides that the v12 _bt_preprocess_keys changes arguably create, by adding more types of array-specific preprocessing to _bt_preprocess_array_keys. You know, preprocessing that can actually eliminate a subset of array keys given a qual like "WHERE a IN (1, 2, 3) AND a < 2". To be clear, I really don't think it's worth adding new types of array preprocessing to _bt_preprocess_array_keys, just for these cases -- my current approach works just fine, by any relevant metric (even if you assume that the sorts of array redundancy/contradictoriness we can miss out on are common and important, it's small potatoes). Just bringing the idea of adding more stuff to _bt_preprocess_array_keys to the attention of the group, as an option (this idea might also provide useful context, that makes my high level thinking on this a bit easier to follow). -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v11, which now says something like that in the commit > > message. > > Attached is v12. Some initial comments on the documentation: > + that searches for multiple values together. Queries that use certain > + <acronym>SQL</acronym> constructs to search for rows matching any value > + out of a list (or an array) of multiple scalar values might perform > + multiple <quote>primitive</quote> index scans (up to one primitive scan > + per scalar value) at runtime. See <xref linkend="functions-comparisons"/> > + for details. I don't think the "see <functions-comparisons> for details" is correctly worded: The surrounding text implies that it would contain details about in which precise situations multiple primitive index scans would be consumed, while it only contains documentation about IN/NOT IN/ANY/ALL/SOME. Something like the following would fit better IMO: + that searches for multiple values together. Queries that use certain + <acronym>SQL</acronym> constructs to search for rows matching any value + out of a list or array of multiple scalar values (such as those described in + <functions-comparisons> might perform multiple <quote>primitive</quote> + index scans (up to one primitive scan per scalar value) at runtime. Then there is a second issue in the paragraph: Inverted indexes such as GIN might well decide to start counting more than one "primitive scan" per scalar value, because they may need to go through their internal structure more than once to produce results for a single scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a trigram index would likely use at least 4 descents here: one for each of "som", "ome", "wor", "ord". > > All that really remains now is to research how we might integrate this > > work with the recently added continuescanPrechecked/haveFirstMatch > > stuff from Alexander Korotkov, if at all. > > The main change in v12 is that I've integrated both the > continuescanPrechecked and the haveFirstMatch optimizations. Both of > these fields are now page-level state, shared between the _bt_readpage > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > appear next to the new home for _bt_checkkeys' continuescan field, in > the new page state struct). Cool. I'm planning to review the rest of this patch this week/tomorrow, could you take some time to review some of my btree patches, too? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > + that searches for multiple values together. Queries that use certain > > + <acronym>SQL</acronym> constructs to search for rows matching any value > > + out of a list (or an array) of multiple scalar values might perform > > + multiple <quote>primitive</quote> index scans (up to one primitive scan > > + per scalar value) at runtime. See <xref linkend="functions-comparisons"/> > > + for details. > > I don't think the "see <functions-comparisons> for details" is > correctly worded: The surrounding text implies that it would contain > details about in which precise situations multiple primitive index > scans would be consumed, while it only contains documentation about > IN/NOT IN/ANY/ALL/SOME. > > Something like the following would fit better IMO: > > + that searches for multiple values together. Queries that use certain > + <acronym>SQL</acronym> constructs to search for rows matching any value > + out of a list or array of multiple scalar values (such as those > described in > + <functions-comparisons> might perform multiple <quote>primitive</quote> > + index scans (up to one primitive scan per scalar value) at runtime. I think that there is supposed to be a closing parenthesis here? So "... (such as those described in <functions-comparisons>") might perform...". FWM, if that's what you meant. > Then there is a second issue in the paragraph: Inverted indexes such > as GIN might well decide to start counting more than one "primitive > scan" per scalar value, because they may need to go through their > internal structure more than once to produce results for a single > scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a > trigram index would likely use at least 4 descents here: one for each > of "som", "ome", "wor", "ord". Calling anything other than an executor invocation of amrescan/index_rescan an "index scan" (or "primitive index scan") is a bit silly IMV, since, as you point out, whether the index AM manages to do things like descend the underlying physical index structure is really just an implementation detail. GIN has more than one B-Tree, so it's not like there's necessarily just one "index structure", anyway. It seems to me the only meaningful definition of index scan is something directly tied to how the index AM gets invoked. That's a logical, index-AM-agnostic concept. It has a fixed relationship to how EXPLAIN ANALYZE displays information. But we're not living in a perfect world. The fact is that the pg_stat_*_tables system views follow a definition that brings these implementation details into it. Its definition of index scan is probably a legacy of when SAOPs could only be executed in a way that was totally opaque to the index AM (which is how it still works for every non-nbtree index AM). Back then, "index scan" really was synonymous with an executor invocation of amrescan/index_rescan with SAOPs. I've described the issues in this area (in the docs) in a way that is most consistent with historical conventions. That seems to have the fewest problems, despite everything I've said about it. > > > All that really remains now is to research how we might integrate this > > > work with the recently added continuescanPrechecked/haveFirstMatch > > > stuff from Alexander Korotkov, if at all. > > > > The main change in v12 is that I've integrated both the > > continuescanPrechecked and the haveFirstMatch optimizations. Both of > > these fields are now page-level state, shared between the _bt_readpage > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > > appear next to the new home for _bt_checkkeys' continuescan field, in > > the new page state struct). > > Cool. I'm planning to review the rest of this patch this > week/tomorrow, could you take some time to review some of my btree > patches, too? Okay, I'll take a look again. Attached is v13. At one point Heikki suggested that I just get rid of BTScanOpaqueData.arrayKeyData (which has been there for as long as nbtree had native support for SAOPs), and use BTScanOpaqueData.keyData exclusively instead. I've finally got around to doing that now. These simplifications were enabled by my new approach within _bt_preprocess_keys, described when I posted v12. v13 goes even further than v12 did, by demoting _bt_preprocess_array_keys to a helper function for _bt_preprocess_keys. That means that we do all of our scan key preprocessing at the same time, at the start of _bt_first (though only during the first _bt_first, or to be more precise during the first per btrescan). If we need fancier preprocessing/normalization for arrays, then it ought to be a lot easier with this structure. Note that we no longer need to have an independent representation of so->qual_okay for array keys (the convention of setting so->numArrayKeys to -1 for unsatisfiable array keys is no longer required). There is no longer any need for a separate pass to carry over the contents of BTScanOpaqueData.arrayKeyData to BTScanOpaqueData.keyData, which was confusing. Are you still interested in working directly on the preprocessing stuff? I have a feeling that I was slightly too optimistic about how likely we were to be able to get away with not having certain kinds of array preprocessing, back when I posted v12. It's true that the propensity of the patch to not recognize "partial redundancies/contradictions" hardly matters with redundant equalities, but inequalities are another matter. I'm slightly worried about cases like this one: select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; Maybe we need to do better with that. What do you think? -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > + that searches for multiple values together. Queries that use certain > > > + <acronym>SQL</acronym> constructs to search for rows matching any value > > > + out of a list (or an array) of multiple scalar values might perform > > > + multiple <quote>primitive</quote> index scans (up to one primitive scan > > > + per scalar value) at runtime. See <xref linkend="functions-comparisons"/> > > > + for details. > > > > I don't think the "see <functions-comparisons> for details" is > > correctly worded: The surrounding text implies that it would contain > > details about in which precise situations multiple primitive index > > scans would be consumed, while it only contains documentation about > > IN/NOT IN/ANY/ALL/SOME. > > > > Something like the following would fit better IMO: > > > > + that searches for multiple values together. Queries that use certain > > + <acronym>SQL</acronym> constructs to search for rows matching any value > > + out of a list or array of multiple scalar values (such as those > > described in > > + <functions-comparisons> might perform multiple <quote>primitive</quote> > > + index scans (up to one primitive scan per scalar value) at runtime. > > I think that there is supposed to be a closing parenthesis here? So > "... (such as those described in <functions-comparisons>") might > perform...". Correct. > FWM, if that's what you meant. WFM, yes? > > Then there is a second issue in the paragraph: Inverted indexes such > > as GIN might well decide to start counting more than one "primitive > > scan" per scalar value, [...] > I've described the issues in this area (in the docs) in a way that is > most consistent with historical conventions. That seems to have the > fewest problems, despite everything I've said about it. Clear enough, thank you for explaining your thoughts on this. > > > > All that really remains now is to research how we might integrate this > > > > work with the recently added continuescanPrechecked/haveFirstMatch > > > > stuff from Alexander Korotkov, if at all. > > > > > > The main change in v12 is that I've integrated both the > > > continuescanPrechecked and the haveFirstMatch optimizations. Both of > > > these fields are now page-level state, shared between the _bt_readpage > > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they > > > appear next to the new home for _bt_checkkeys' continuescan field, in > > > the new page state struct). > > > > Cool. I'm planning to review the rest of this patch this > > week/tomorrow, could you take some time to review some of my btree > > patches, too? > > Okay, I'll take a look again. Thanks, greatly appreciated. > At one point Heikki suggested that I just get rid of > BTScanOpaqueData.arrayKeyData (which has been there for as long as > nbtree had native support for SAOPs), and use > BTScanOpaqueData.keyData exclusively instead. I've finally got around > to doing that now. I'm not sure if it was worth the reduced churn when the changes for the primary optimization are already 150+kB in size; every "small" addition increases the context needed to review the patch, and it's already quite complex. > These simplifications were enabled by my new approach within > _bt_preprocess_keys, described when I posted v12. v13 goes even > further than v12 did, by demoting _bt_preprocess_array_keys to a > helper function for _bt_preprocess_keys. That means that we do all of > our scan key preprocessing at the same time, at the start of _bt_first > (though only during the first _bt_first, or to be more precise during > the first per btrescan). If we need fancier > preprocessing/normalization for arrays, then it ought to be a lot > easier with this structure. Agreed. > Note that we no longer need to have an independent representation of > so->qual_okay for array keys (the convention of setting > so->numArrayKeys to -1 for unsatisfiable array keys is no longer > required). There is no longer any need for a separate pass to carry > over the contents of BTScanOpaqueData.arrayKeyData to > BTScanOpaqueData.keyData, which was confusing. I wasn't very confused by it, but sure. > Are you still interested in working directly on the preprocessing > stuff? If you mean my proposed change to merging two equality AOPs, then yes. I'll try to fit it in tomorrow with the rest of the review. > I have a feeling that I was slightly too optimistic about how > likely we were to be able to get away with not having certain kinds of > array preprocessing, back when I posted v12. It's true that the > propensity of the patch to not recognize "partial > redundancies/contradictions" hardly matters with redundant equalities, > but inequalities are another matter. I'm slightly worried about cases > like this one: > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > Maybe we need to do better with that. What do you think? Let me come back to that when I'm done reviewing the final part of nbtutils. > Attached is v13. It looks like there are few issues remaining outside the changes in nbtutils. I've reviewed the changes to those files, and ~half of nbtutils (up to _bt_advance_array_keys_increment) now. I think I can get the remainder done tomorrow. > +++ b/src/backend/access/nbtree/nbtutils.c > /* > * _bt_start_array_keys() -- Initialize array keys at start of a scan > * > * Set up the cur_elem counters and fill in the first sk_argument value for > - * each array scankey. We can't do this until we know the scan direction. > + * each array scankey. > */ > void > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > BTScanOpaque so = (BTScanOpaque) scan->opaque; > int i; > > + Assert(so->numArrayKeys); > + Assert(so->qual_ok); Has the requirement for a known scan direction been removed, or should this still have an Assert(dir != NoMovementScanDirection)? > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, > [...] > - _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false); > + pstate.prechecked = false; /* prechecked earlier tuple */ I'm not sure that comment is correct, at least it isn't as clear as can be. Maybe something more in the line of the following? + pstate.prechecked = false; /* prechecked didn't cover HIKEY */ +++ b/src/backend/access/nbtree/nbtutils.c > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > + elemtype = cur->sk_subtype; > + if (elemtype == InvalidOid) > + elemtype = rel->rd_opcintype[cur->sk_attno - 1]; Should we Assert() that this elemtype matches the array element type ARR_ELEMTYPE(arrayval) used to deconstruct the array? > + /* > + * If this scan key is semantically equivalent to a previous equality > + * operator array scan key, merge the two arrays together to eliminate > [...] > + if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype) This is not "a" previous equality key, but "the" previous equality operator array scan key. Do we want to expend some additional cycles for detecting duplicate equality array key types in interleaved order like =int[] =bigint[] =int[]? I don't think it would be very expensive considering the limited number of cross-type equality operators registered in default PostgreSQL, so a simple loop that checks matching element types starting at the first array key scankey for this attribute should suffice. We could even sort the keys by element type if we wanted to fix any O(n) issues for this type of behaviour (though this is _extremely_ unlikely to become a performance issue). > + * qual is unsatisfiable > + */ > + if (num_elems == 0) > + { > + so->qual_ok = false; > + return NULL; > + } I think there's a memory context switch back to oldContext missing here, as well as above at `if (num_nonnulls == 0)`. These were probably introduced by changing from break to return; and the paths aren't yet exercised in regression tests. > +++ b/src/backend/utils/adt/selfuncs.c has various changes in comments that result in spelling and style issues: > + * primitive index scans that will be performed for caller + * primitive index scans that will be performed for the caller. (missing "the", period) > - * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr, > - * since that's internal to the indexscan.) > + * share for each outer scan The trailing period was lost: + * share for each outer scan. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote: > > At one point Heikki suggested that I just get rid of > > BTScanOpaqueData.arrayKeyData (which has been there for as long as > > nbtree had native support for SAOPs), and use > > BTScanOpaqueData.keyData exclusively instead. I've finally got around > > to doing that now. > > I'm not sure if it was worth the reduced churn when the changes for > the primary optimization are already 150+kB in size; every "small" > addition increases the context needed to review the patch, and it's > already quite complex. To clarify, what I mean here is that merging the changes of both the SAOPs changes and the removal of arrayKeyData seems to increase the patch size and increases the maximum complexity of each component patch's review. Separate patches may make this more reviewable, or not, but no comment was given on why it is better to merge the changes into a single patch. - Matthias
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Benoit Tigeot
Date:
Hello,
I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we have at the moment and the result are still very very good. [1]
[1] - https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__________
Benoit Tigeot
In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hope that Postgres 17 will contain those improvements.
Kind regards,
Benoit
[1] - https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__________
Benoit Tigeot
Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan <pg@bowt.ie> a écrit :
On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I could include something less verbose, mentioning a theoretical risk
> to out-of-core amcanorder routines that coevolved with nbtree,
> inherited the same SAOP limitations, and then never got the same set
> of fixes.
Attached is v11, which now says something like that in the commit
message. Other changes:
* Fixed buggy sorting of arrays using cross-type ORDER procs, by
recognizing that we need to consistently use same-type ORDER procs for
sorting and merging the arrays during array preprocessing.
Obviously, when we sort, we compare array elements to other array
elements (all of the same type). This is true independent of whether
the query itself happens to use a cross type operator/ORDER proc, so
we will need to do two separate ORDER proc lookups in cross-type
scenarios.
* No longer subscript the ORDER proc used for array binary searches
using a scankey subscript. Now there is an additional indirection that
works even in the presence of multiple redundant scan keys that cannot
be detected as such due to a lack of appropriate cross-type support
within an opfamily.
This was subtly buggy before now. Requires a little more coordination
between array preprocessing and standard/primitive index scan
preprocessing, which isn't ideal but seems unavoidable.
* Lots of code polishing, especially within _bt_advance_array_keys().
While _bt_advance_array_keys() still works in pretty much exactly the
same way as it did back in v10, there are now better comments.
Including something about why its recursive call to itself is
guaranteed to use a low, fixed amount of stack space, verified using
an assertion. That addresses a concern held by Matthias.
Outlook
=======
This patch is approaching being committable now. Current plan is to
commit this within the next few weeks.
All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all. I've put that off until now
because it isn't particularly fundamental to what I'm doing here, and
seems optional.
I would also like to do more performance validation. Things like the
parallel index scan code could stand to be revisited once again. Plus
I should think about the overhead of array preprocessing when
btrescan() is called many times, from a nested loop join -- I should
have something to say about that concern (raised by Heikki at one
point) before too long.
--
Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote: > > I think that there is supposed to be a closing parenthesis here? So > > "... (such as those described in <functions-comparisons>") might > > perform...". > > Correct. > > > FWM, if that's what you meant. > > WFM, yes? Then we're in agreement on this. > > At one point Heikki suggested that I just get rid of > > BTScanOpaqueData.arrayKeyData (which has been there for as long as > > nbtree had native support for SAOPs), and use > > BTScanOpaqueData.keyData exclusively instead. I've finally got around > > to doing that now. > > I'm not sure if it was worth the reduced churn when the changes for > the primary optimization are already 150+kB in size; every "small" > addition increases the context needed to review the patch, and it's > already quite complex. I agree that the patch is quite complex, especially relative to its size. > > Note that we no longer need to have an independent representation of > > so->qual_okay for array keys (the convention of setting > > so->numArrayKeys to -1 for unsatisfiable array keys is no longer > > required). There is no longer any need for a separate pass to carry > > over the contents of BTScanOpaqueData.arrayKeyData to > > BTScanOpaqueData.keyData, which was confusing. > > I wasn't very confused by it, but sure. > > > Are you still interested in working directly on the preprocessing > > stuff? > > If you mean my proposed change to merging two equality AOPs, then yes. > I'll try to fit it in tomorrow with the rest of the review. I didn't just mean that stuff. I was also suggesting that you could join the project directly (not just as a reviewer). If you're interested, you could do general work on the preprocessing of arrays. Fancier array-specific preprocessing. For example, something that can transform this: select * from foo where a in (1,2,3) and a < 3; Into this: select * from foo where a in (1,2); Or, something that can just set qual_okay=false, given a query such as: select * from foo where a in (1,2,3) and a < 5; This is clearly doable by reusing the binary search code during preprocessing. The first example transformation could work via a binary search for the constant "3", followed by a memmove() to shrink the array in-place (plus the inequality itself would need to be fully eliminated). The second example would work a little like the array merging thing that the patch has already, except that there'd only be one array involved (there wouldn't be a pair of arrays). > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > > > Maybe we need to do better with that. What do you think? > > Let me come back to that when I'm done reviewing the final part of nbtutils. Even if this case doesn't matter (which I now doubt), it's probably easier to just get it right than it would be to figure out if we can live without it. > > void > > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir) > > BTScanOpaque so = (BTScanOpaque) scan->opaque; > > int i; > > > > + Assert(so->numArrayKeys); > > + Assert(so->qual_ok); > > Has the requirement for a known scan direction been removed, or should > this still have an Assert(dir != NoMovementScanDirection)? I agree that such an assertion is worth having. Added that locally. > I'm not sure that comment is correct, at least it isn't as clear as > can be. Maybe something more in the line of the following? > + pstate.prechecked = false; /* prechecked didn't cover HIKEY */ I agree that that's a little better than what I had. > +++ b/src/backend/access/nbtree/nbtutils.c > > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > > + elemtype = cur->sk_subtype; > > + if (elemtype == InvalidOid) > > + elemtype = rel->rd_opcintype[cur->sk_attno - 1]; > > Should we Assert() that this elemtype matches the array element type > ARR_ELEMTYPE(arrayval) used to deconstruct the array? Yeah, good idea. > This is not "a" previous equality key, but "the" previous equality > operator array scan key. > Do we want to expend some additional cycles for detecting duplicate > equality array key types in interleaved order like =int[] =bigint[] > =int[]? I don't think it would be very expensive considering the > limited number of cross-type equality operators registered in default > PostgreSQL, so a simple loop that checks matching element types > starting at the first array key scankey for this attribute should > suffice. We could even sort the keys by element type if we wanted to > fix any O(n) issues for this type of behaviour (though this is > _extremely_ unlikely to become a performance issue). Yes, I think that we should probably have this (though likely wouldn't bother sorting the scan keys themselves). You'd need to look-up cross-type operators for this. They could possibly fail, even when nothing else fails, because in principle you could have 3 types involved for only 2 scan keys: the opclass/on-disk type, plus 2 separate types. We could fail to find a cross-type operator for our "2 separate types", while still succeeding in finding 2 cross type operators for each of the 2 separate scan keys (each paired up the indexed column). When somebody writes a silly query with such obvious redundancies, there needs to be a sense of proportion about it. Fixed preprocessing costs don't seem that important. What's important is that we make a reasonable effort to avoid horrible runtime performance when it can be avoided, such as scanning the whole index. (If a user has a complaint about added cycles during preprocessing, then the fix for that is to just not write silly queries. I care much less about added cycles if they have only a fixed, small-ish added cost, that isn't borne by sensibly-written queries.) > > + * qual is unsatisfiable > > + */ > > + if (num_elems == 0) > > + { > > + so->qual_ok = false; > > + return NULL; > > + } > > I think there's a memory context switch back to oldContext missing > here, as well as above at `if (num_nonnulls == 0)`. These were > probably introduced by changing from break to return; and the paths > aren't yet exercised in regression tests. I agree that this is buggy. But it doesn't seem to actually fail in practice (it "accidentally fails to fail"). In practice, the whole index scan is shut down before anything breaks anyway. I mention this only because it's worth understanding that I do have coverage for this case (at least in my own expansive test suite). It just wasn't enough to detect this particular oversight. > > +++ b/src/backend/utils/adt/selfuncs.c > has various changes in comments that result in spelling and style issues: > > > + * primitive index scans that will be performed for caller > + * primitive index scans that will be performed for the caller. > (missing "the", period) I'll just keep the previous wording and punctuation, with one difference: "index scans" will be changed to "primitive index scans". > > - * share for each outer scan. (Don't pro-rate for ScalarArrayOpExpr, > > - * since that's internal to the indexscan.) > > + * share for each outer scan > > The trailing period was lost: > + * share for each outer scan. I'm not in the habit of including a period/full stop after a standalone sentence (actually, I'm in the habit of *not* doing so, specifically). But in the interest of consistency with surrounding code (and because it doesn't really matter), I'll add one back here. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > To clarify, what I mean here is that merging the changes of both the > SAOPs changes and the removal of arrayKeyData seems to increase the > patch size and increases the maximum complexity of each component > patch's review. Removing arrayKeyData probably makes the patch very slightly smaller, actually. But even if it's really the other way around, I'd still like to get rid of it as part of the same commit as everything else. It just makes sense that way. > Separate patches may make this more reviewable, or not, but no comment > was given on why it is better to merge the changes into a single > patch. Fair enough. Here's why: The original SAOP design (commit 9e8da0f7, "Teach btree to handle ScalarArrayOpExpr quals natively") added a layer of indirection between scan->keyData (input scan keys) and so->keyData (output scan keys): it added another scan key array, so->arrayKeyData. There was array-specific preprocessing in _bt_preprocess_array_keys, that happened before the first primitive index scan even began -- that transformed our true input scan keys (scan->keyData) into a copy of the array with limited amounts of array-specific preprocessing already performed (so->arrayKeyData). This made a certain amount of sense at the time, because _bt_preprocess_keys was intended to be called once per primitive index scan. Kind of like the inner side of a nested loop join's inner index scan, where we call _bt_preprocess_keys once per inner-side scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess once per primitive index scan per btrescan call, which might matter in those rare cases where the inner side of a nestloop join had SAOP quals.) What I now propose to do is to just call _bt_preprocess_keys once per btrescan (actually, it's still called once per primitive index scan, but all calls after the first are just no-ops after v12 of the patch). This makes sense because SAOP array constants aren't like nestloop joins with an inner index scans, in one important way: we really can see everything up-front. We can see all of the array elements, and operate on whole arrays as necessary during preprocessing (e.g., performing the array merging thing I added to _bt_preprocess_array_keys). It's not like the next array element is only visible to prepocessing only after the outer side of a nestloop join runs, and next calls btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is almost the same thing as "WHERE a = any('{1,2}')", so why not use essentially the same approach to preprocessing in both cases? We don't need to copy the input keys into so->arrayKeyData, because the indirection (which is a bit like a fake nested loop join) doesn't buy us anything. v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While it removes the arrayKeyData field from the BTScanOpaqueData struct, we still use a temporary array (accessed via a pointer that's just a local variable) that's also called arrayKeyData. And that also stores array-preprocessing-only copies of the input scan keys. That happens within _bt_preprocess_keys. So we do "still need arrayKeyData", but we only need it for a brief period at the start of the index scan. It just doesn't make any sense to keep it around for longer than that, in a world where _bt_preprocess_keys "operates directly on arrays". That only made sense (a bit of sense) back when _bt_preprocess_keys was subordinate to _bt_preprocess_array_keys, but it's kinda the other way around now. We could probably even get rid of this remaining limited form of arrayKeyData, but that doesn't seem like it would add much. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote: > > > > Are you still interested in working directly on the preprocessing > > stuff? > > If you mean my proposed change to merging two equality AOPs, then yes. > I'll try to fit it in tomorrow with the rest of the review. I've attached v14, where 0001 is v13, 0002 is a patch with small changes + some large comment suggestions, and 0003 which contains sorted merge join code for _bt_merge_arrays. I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I can make of it. > > I have a feeling that I was slightly too optimistic about how > > likely we were to be able to get away with not having certain kinds of > > array preprocessing, back when I posted v12. It's true that the > > propensity of the patch to not recognize "partial > > redundancies/contradictions" hardly matters with redundant equalities, > > but inequalities are another matter. I'm slightly worried about cases > > like this one: > > > > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183; > > > > Maybe we need to do better with that. What do you think? > > Let me come back to that when I'm done reviewing the final part of nbtutils. > > > Attached is v13. > > It looks like there are few issues remaining outside the changes in > nbtutils. I've reviewed the changes to those files, and ~half of > nbtutils (up to _bt_advance_array_keys_increment) now. I think I can > get the remainder done tomorrow. > +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir) [...] > + if (!(cur->sk_flags & SK_SEARCHARRAY) && > + cur->sk_strategy != BTEqualStrategyNumber) > + continue; This should use ||, not &&: if it's not an array, or not an equality array key, it's not using an array key slot and we're not interested. Note that _bt_verify_arrays_bt_first does have the right condition already. > + if (readpagetup || result != 0) > + { > + Assert(result != 0); > + return false; > + } I'm confused about this. By asserting !readpagetup after this exit, we could save a branch condition for the !readpagetup result != 0 path. Can't we better assert the inverse just below, or is this specifically for defense-in-depth against bug? E.g. + if (result != 0) + return false; + + Assert(!readpagetup); > + /* > + * By here we have established that the scan's required arrays were > + * advanced, and that they haven't become exhausted. > + */ > + Assert(arrays_advanced || !arrays_exhausted); Should use &&, based on the comment. > + * We generally permit primitive index scans to continue onto the next > + * sibling page when the page's finaltup satisfies all required scan keys > + * at the point where we're between pages. This should probably describe that we permit primitive scans with array keys to continue until we get to the sibling page, rather than this rather obvious and generic statement that would cover even the index scan for id > 0 ORDER BY id asc; or this paragraph can be removed. + if (!all_required_satisfied && pstate->finaltup && + _bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0, + &so->scanBehind)) + goto new_prim_scan; > +_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir) [...] > + if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) || > + ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir))) > + continue; I think a simple check if any SK_BT_REQ flag is set should be OK here: The key must be an equality key, and those must be required either in both directions, or in neither direction. ----- Further notes: I have yet to fully grasp what so->scanBehind is supposed to mean. "/* Scan might be behind arrays? */" doesn't give me enough hints here. I find it weird that we call _bt_advance_array_keys for non-required sktrig. Shouldn't that be as easy as doing a binary search through the array? Why does this need to hit the difficult path? Kind regards, Matthias van de Meent
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I've attached v14, where 0001 is v13, 0002 is a patch with small > changes + some large comment suggestions, and 0003 which contains > sorted merge join code for _bt_merge_arrays. I have accepted your changes from 0003. Agree that it's better that way. It's at least a little faster, but not meaningfully more complicated. This is part of my next revision, v15, which I've attached (along with a test case that you might find useful, explained below). > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I > can make of it. That's been the big focus of this new v15, which now goes all out with teaching _bt_preprocess_keys with how to deal with arrays. We now do comprehensive normalization of even very complicated combinations of arrays and non-array scan keys in this version. For example, consider this query: select * from multi_test where a = any('{1, 2, 3, 4, 5}'::int[]) and a > 2::bigint and a = any('{2, 3, 4, 5, 6}'::bigint[]) and a < 6::smallint and a = any('{2, 3, 4, 5, 6}'::smallint[]) and a < 4::int; This has a total of 6 input scankeys -- 3 of which are arrays. But by the time v15's _bt_preprocess_keys is done with it, it'll have only 1 scan key -- which doesn't even have an array (not anymore). And so we won't even need to advance the array keys one single time -- there'll simply be no array left to advance. In other words, it'll be just as if the query was written this way from the start: select * from multi_test where a = 3::int; (Though of course the original query will spend more cycles on preprocessing, compared to this manually simplified variant.) In general, preprocessing can now simplify queries like this to the maximum extent possible (without bringing the optimizer into it), no matter how much crud like this is added -- even including adversarial cases, with massive arrays that have some amount of redundancy/contradictoriness to them. It turned out to not be terribly difficult to teach _bt_preprocess_keys everything it could possibly need to know about arrays, so that it can operate on them directly, as a variant of the standard equality strategy (we do still need _bt_preprocess_array_keys for basic preprocessing of arrays, mostly just merging them). This is better overall (in that it gets every little subtlety right), but it also simplified a number of related issues. For example, there is no longer any need to maintain a mapping between so->keyData[]-wise scan keys (output scan keys), and scan->keyData[]-wise scan keys (input scan keys). We can just add a step to fix-up the references to the end of _bt_preprocess_keys, to make life easier within _bt_advance_array_keys. This preprocessing work should all be happening during planning, not during query execution -- that's the design that makes the most sense. This is something we've discussed in the past in the context of skip scan (see my original email to this thread for the reference). It would be especially useful for the very fancy kinds of preprocessing that are described by the MDAM paper, like using an index scan for a NOT IN() list/array (this can actually make sense with a low cardinality index). The structure for preprocessing that I'm working towards (especially in v15) sets the groundwork for making those shifts in the planner, because we'll no longer treat each array constant as its own primitive index scan during preprocessing. Right now, on HEAD, preprocessing with arrays kinda treats each array constant like the parameter of an imaginary inner index scan, from an imaginary nested loop join. But the planner really needs to operate on the whole qual, all at once, including any arrays. An actual nestloop join's inner index scan naturally cannot do that, and so might still require runtime/dynamic preprocessing in a world where that mostly happens in the planner -- but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE a in(4, 5)" are almost the same thing, and so should be handled in almost the same way by preprocessing). > > +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir) > [...] > > + if (!(cur->sk_flags & SK_SEARCHARRAY) && > > + cur->sk_strategy != BTEqualStrategyNumber) > > + continue; > > This should use ||, not && Fixed. Yeah, that was a bug. > > + if (readpagetup || result != 0) > > + { > > + Assert(result != 0); > > + return false; > > + } > > I'm confused about this. By asserting !readpagetup after this exit, we > could save a branch condition for the !readpagetup result != 0 path. > Can't we better assert the inverse just below, or is this specifically > for defense-in-depth against bug? E.g. > > + if (result != 0) > + return false; > + > + Assert(!readpagetup); Yeah, that's what it was -- defensively. It seems slightly better as-is, because you'll get an assertion failure if a "readpagetup" caller gets "result == 0". That's never supposed to happen (if it did happen then our ORDER proc won't be in agreement with what our = operator indicated about the same tuple attribute value moments earlier, inside _bt_check_compare). > > + /* > > + * By here we have established that the scan's required arrays were > > + * advanced, and that they haven't become exhausted. > > + */ > > + Assert(arrays_advanced || !arrays_exhausted); > > Should use &&, based on the comment. Fixed by getting rid of the arrays_exhausted variable, which wasn't adding much anyway. > > + * We generally permit primitive index scans to continue onto the next > > + * sibling page when the page's finaltup satisfies all required scan keys > > + * at the point where we're between pages. > > This should probably describe that we permit primitive scans with > array keys to continue until we get to the sibling page, rather than > this rather obvious and generic statement that would cover even the > index scan for id > 0 ORDER BY id asc; or this paragraph can be > removed. It's not quite obvious. The scan's array keys change as the scan makes progress, up to once per tuple read. But even if it really was obvious, it's really supposed to frame the later discussion -- discussion of those less common cases where this isn't what happens. The exceptions. These exceptions are: 1. When a required scan key is deemed "satisfied" only because its value was truncated in a high key finaltup. (Technically this is what _bt_checkkeys has always done, but we're much more sensitive to this stuff now, because we won't necessarily get to make another choice about starting a new primitive index scan for a long time.) 2. When we apply the has_required_opposite_direction_only stuff, and decide to start a new primitive index scan, even though technically all of our required-in-this-direction scan keys are still satisfied. Separately, there is also the potential for undesirable interactions between 1 and 2, which is why we don't let them mix. (We have the "if (so->scanBehind && has_required_opposite_direction_only) goto new_prim_scan" gating condition.) > Further notes: > > I have yet to fully grasp what so->scanBehind is supposed to mean. "/* > Scan might be behind arrays? */" doesn't give me enough hints here. Yes, it is complicated. The best explanation is the one I've added to _bt_readpage, next to the precheck. But that does need more work. Note that the so->scanBehind thing solves two distinct problems for the patch (related, but still clearly distinct). These problem are: 1. It makes the existing precheck/continuescan optimization in _bt_readpage safe -- we'd sometimes get wrong answers to queries if we didn't limit application of the optimization to !so->scanBehind cases. I have a test case that proves that this is true -- the one I mentioned in my introduction. I'm attaching that as precheck_testcase.sql now. It might help you to understand so->scanBehind, particularly this point 1 about the basic correctness of the precheck thing (possibly point 2 also). 2. It is used to speculatively visit the next leaf page in corner cases where truncated -inf attributes from the high key are deemed "satisfied". Generally speaking, we don't want to visit the next leaf page unless we're already 100% sure that it might have matches (if any page has matches for our current array keys at all, that is). But in a certain sense we're only guessing. It isn't guaranteed (and fundamentally can't be guaranteed) to work out once on the next sibling page. We've merely assumed that our array keys satisfy truncated -inf columns, without really knowing what it is that we'll find on the next page when we actually visit it at that point (we're not going to see -inf in the non-pivot tuples on the next page, we'll see some real/non-sentinel low-sorting value). We have good practical reasons to not want to treat them as non-matching (though that would be correct), and to take this limited gamble (we can discuss that some more if you'd like). Once you accept that we have to do this, it follows that we need to be prepared to: A. Notice that we're really just guessing in the sense I've described (before leaving for the next sibling leaf page), by setting so->scanBehind. We'll remember that it happened. and: B. Notice that that limited gamble didn't pay off once on the next/sibling leaf page, so that we can cut our losses and start a new primitive index scan at that point. We do this by checking so->scanBehind (along with the sibling page's high key), once on the sibling page. (We don't need explicit handling for the case when it works out, as it almost always will.) If you want to include discussion of problem 1 here too (not just problem 2), then I should add a third thing that we need to notice: C. Notice that we can't do the precheck thing once in _bt_readpage, because it'd be wrong to allow it. > I find it weird that we call _bt_advance_array_keys for non-required > sktrig. Shouldn't that be as easy as doing a binary search through the > array? Why does this need to hit the difficult path? What difficult path? Advancing non-required arrays isn't that different to advancing required ones. We will never advance required arrays when called just to advance a non-required one, obviously. But we can do the opposite. In fact, we absolutely have to advance non-required arrays (as best we can) when advancing a required one (or when the call was triggered by a non-array required scan key). That said, it could have been clearer than it was in earlier versions. v15 makes the difference between the non-required scan key trigger and required scan key trigger cases clearer within _bt_advance_array_keys. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I've attached v14, where 0001 is v13, 0002 is a patch with small > > changes + some large comment suggestions, and 0003 which contains > > sorted merge join code for _bt_merge_arrays. > > I have accepted your changes from 0003. Agree that it's better that > way. It's at least a little faster, but not meaningfully more > complicated. Thanks. > > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I > > can make of it. > > That's been the big focus of this new v15, which now goes all out with > teaching _bt_preprocess_keys with how to deal with arrays. We now do > comprehensive normalization of even very complicated combinations of > arrays and non-array scan keys in this version. I was thinking about a more unified processing model, where _bt_preprocess_keys would iterate over all keys, including processing of array keys (by merging and reduction to normal keys) if and when found. This would also reduce the effort expended when there are contradictory scan keys, as array preprocessing is relatively more expensive than other scankeys and contradictions are then found before processing of later keys. As I wasn't very far into the work yet it seems I can reuse a lot of your work here. > For example, consider this query: [...] > This has a total of 6 input scankeys -- 3 of which are arrays. But by > the time v15's _bt_preprocess_keys is done with it, it'll have only 1 > scan key -- which doesn't even have an array (not anymore). And so we > won't even need to advance the array keys one single time -- there'll > simply be no array left to advance. In other words, it'll be just as > if the query was written this way from the start: > > select * > from multi_test > where > a = 3::int; > > (Though of course the original query will spend more cycles on > preprocessing, compared to this manually simplified variant.) That's a good improvement, much closer to an optimal access pattern. > It turned out to not be terribly difficult to teach > _bt_preprocess_keys everything it could possibly need to know about > arrays, so that it can operate on them directly, as a variant of the > standard equality strategy (we do still need _bt_preprocess_array_keys > for basic preprocessing of arrays, mostly just merging them). This is > better overall (in that it gets every little subtlety right), but it > also simplified a number of related issues. For example, there is no > longer any need to maintain a mapping between so->keyData[]-wise scan > keys (output scan keys), and scan->keyData[]-wise scan keys (input > scan keys). We can just add a step to fix-up the references to the end > of _bt_preprocess_keys, to make life easier within > _bt_advance_array_keys. > > This preprocessing work should all be happening during planning, not > during query execution -- that's the design that makes the most sense. > This is something we've discussed in the past in the context of skip > scan (see my original email to this thread for the reference). Yes, but IIRC we also agreed that it's impossible to do this fully in planning, amongst others due to joins on array fields. > It > would be especially useful for the very fancy kinds of preprocessing > that are described by the MDAM paper, like using an index scan for a > NOT IN() list/array (this can actually make sense with a low > cardinality index). Yes, indexes such as those on enums. Though, in those cases the NOT IN () could be transformed into IN()-lists by the planner, but not the index. > The structure for preprocessing that I'm working towards (especially > in v15) sets the groundwork for making those shifts in the planner, > because we'll no longer treat each array constant as its own primitive > index scan during preprocessing. I hope that's going to be a fully separate patch. I don't think I can handle much more complexity in this one. > Right now, on HEAD, preprocessing > with arrays kinda treats each array constant like the parameter of an > imaginary inner index scan, from an imaginary nested loop join. But > the planner really needs to operate on the whole qual, all at once, > including any arrays. An actual nestloop join's inner index scan > naturally cannot do that, and so might still require runtime/dynamic > preprocessing in a world where that mostly happens in the planner -- > but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE > a in(4, 5)" are almost the same thing, and so should be handled in > almost the same way by preprocessing). Yeah, if the planner could handle some of this that'd be great. At the same time, I think that this might need to be gated behind a guc for more expensive planner-time deductions. > > > + * We generally permit primitive index scans to continue onto the next > > > + * sibling page when the page's finaltup satisfies all required scan keys > > > + * at the point where we're between pages. > > > > This should probably describe that we permit primitive scans with > > array keys to continue until we get to the sibling page, rather than > > this rather obvious and generic statement that would cover even the > > index scan for id > 0 ORDER BY id asc; or this paragraph can be > > removed. > > It's not quite obvious. [...] > Separately, there is also the potential for undesirable interactions > between 1 and 2, which is why we don't let them mix. (We have the "if > (so->scanBehind && has_required_opposite_direction_only) goto > new_prim_scan" gating condition.) I see. > > Further notes: > > > > I have yet to fully grasp what so->scanBehind is supposed to mean. "/* > > Scan might be behind arrays? */" doesn't give me enough hints here. > > Yes, it is complicated. The best explanation is the one I've added to > _bt_readpage, next to the precheck. But that does need more work. Yeah. The _bt_readpage comment doesn't actually contain the search term scanBehind, so I wasn't expecting that to be documented there. > > I find it weird that we call _bt_advance_array_keys for non-required > > sktrig. Shouldn't that be as easy as doing a binary search through the > > array? Why does this need to hit the difficult path? > > What difficult path? "Expensive" would probably have been a better wording: we do a comparative lot of processing in the !_bt_check_compare() + !continuescan path; much more than the binary searches you'd need for non-required array key checks. > Advancing non-required arrays isn't that different to advancing > required ones. We will never advance required arrays when called just > to advance a non-required one, obviously. But we can do the opposite. > In fact, we absolutely have to advance non-required arrays (as best we > can) when advancing a required one (or when the call was triggered by > a non-array required scan key). I think it's a lot more expensive to do the non-required array key increment for non-required triggers. What are we protecting against (or improving) by always doing advance_array_keys on non-required trigger keys? I mean that we should just do the non-required array key binary search inside _bt_check_compare for non-required array keys, as that would skip a lot of the rather expensive other array key infrastructure, and only if we're outside the minimum or maximum bounds of the non-required scankeys should we trigger advance_array_keys (unless scan direction changed). A full review of the updated patch will follow soon. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Matthias van de Meent
Date:
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I've attached v14, where 0001 is v13, 0002 is a patch with small > > changes + some large comment suggestions, and 0003 which contains > > sorted merge join code for _bt_merge_arrays. > > This is part of my next revision, v15, which I've attached (along with > a test case that you might find useful, explained below). > > v15 makes the difference between the non-required scan key trigger and > required scan key trigger cases clearer within _bt_advance_array_keys. OK, here's a small additional review, with a suggestion for additional changes to _bt_preprocess: > @@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op, > /* > * The opfamily we need to worry about is identified by the index column. > */ > Assert(leftarg->sk_attno == rightarg->sk_attno); > > + /* > + * If either leftarg or rightarg are equality-type array scankeys, we need > + * specialized handling (since by now we know that IS NULL wasn't used) > + */ > [...] > + } > + > opcintype = rel->rd_opcintype[leftarg->sk_attno - 1]; Here, you insert your code between the comment about which opfamily to choose and the code assigning the opfamily. I think this needs some cleanup. > + * Don't call _bt_preprocess_array_keys_final in this fast path > + * (we'll miss out on the single value array transformation, but > + * that's not nearly as important when there's only one scan key) Why is it OK to ignore it? Or, why don't we apply it here? --- Attached 2 patches for further optimization of the _bt_preprocess_keys path (on top of your v15), according to the following idea: Right now, we do "expensive" processing with xform merging for all keys when we have more than 1 keys in the scan. However, we only do per-attribute merging of these keys, so if there is only one key for any attribute, the many cycles spent in that loop are mostly wasted. By checking for single-scankey attributes, we can short-track many multi-column index scans because they usually have only a single scan key per attribute. The first implements that idea, the second reduces the scope of various variables so as to improve compiler optimizability. I'll try to work a bit more on merging the _bt_preprocess steps into a single main iterator, but this is about as far as I got with clean patches. Merging the steps for array preprocessing with per-key processing and post-processing is proving a bit more complex than I'd anticipated, so I don't think I'll be able to finish that before the feature freeze, especially with other things that keep distracting me. Matthias van de Meent
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Mar 7, 2024 at 10:42 AM Benoit Tigeot <benoit.tigeot@gmail.com> wrote: > I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we haveat the moment and the result are still very very good. [1] Thanks for providing the test case. It was definitely important back when the ideas behind the patch had not yet fully developed. It helped me to realize that my thinking around non-required arrays (meaning arrays that cannot reposition the scan, and just filter out non-matching tuples) was still sloppy. > In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hopethat Postgres 17 will contain those improvements. Current plan is to commit this patch in the next couple of weeks, ahead of Postgres 17 feature freeze. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Mar 18, 2024 at 9:25 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I was thinking about a more unified processing model, where > _bt_preprocess_keys would iterate over all keys, including processing > of array keys (by merging and reduction to normal keys) if and when > found. This would also reduce the effort expended when there are > contradictory scan keys, as array preprocessing is relatively more > expensive than other scankeys and contradictions are then found before > processing of later keys. Does it really matter? I just can't get excited about maybe getting one less syscache lookup in queries that involve *obviously* contradictory keys at the SQL level. Especially because we're so much better off with the new design here anyway; calling _bt_preprocess_keys once rather than once per distinct set of array keys is an enormous improvement, all on its own. My concern with preprocessing overhead is almost entirely limited to pathological performance issues involving extreme (even adversarial) input scan keys/arrays. I feel that if we're going to guarantee that we won't choke in _bt_checkkeys, even given billions of distinct sets of array keys, we should make the same guarantee in _bt_preprocess_keys -- just to avoid POLA violations. But that's the only thing that seems particularly important in the general area of preprocessing performance. (Preprocessing performance probably does matter quite a bit in more routine cases, where </<= and >/>= are mixed together on the same attribute. But there'll be no new _bt_preprocess_keys operator/function lookups for stuff like that.) The advantage of not completely merging _bt_preprocess_array_keys with _bt_preprocess_keys is that preserving this much of the existing code structure allows us to still decide how many array keys we'll need for the scan up front (if the scan doesn't end up having an unsatisfiable qual). _bt_preprocess_array_keys will eliminate redundant arrays early, in practically all cases (the exception is when it has to deal with an incomplete opfamily lacking the appropriate cross-type ORDER proc), so we don't have to think about merging arrays after that point. Rather like how we don't have to worry about "WHERE a < any('{1, 2, 3}')" type inequalities after _bt_preprocess_array_keys does an initial round of array-specific preprocessing (_bt_preprocess_keys can deal with those in the same way as it will with standard inequalities). > > This preprocessing work should all be happening during planning, not > > during query execution -- that's the design that makes the most sense. > > This is something we've discussed in the past in the context of skip > > scan (see my original email to this thread for the reference). > > Yes, but IIRC we also agreed that it's impossible to do this fully in > planning, amongst others due to joins on array fields. Even with a nested loop join's inner index scan, where the constants used for each btrescan are not really predictable in the planner, we can still do most preprocessing in the planner, at least most of the time. We could still easily do analysis that is capable of ruling out redundant or contradictory scan keys for any possible parameter value -- seen or unseen. I'd expect this to be the common case -- most of the time these inner index scans need only one simple = operator (maybe 2 = operators). Obviously, tjat approach still requires that btrescan at least accept a new set of constants for each new inner index scan invocation. But that's still much cheaper than calling _bt_preprocess_keys from scratch every time btresca is called. > > It > > would be especially useful for the very fancy kinds of preprocessing > > that are described by the MDAM paper, like using an index scan for a > > NOT IN() list/array (this can actually make sense with a low > > cardinality index). > > Yes, indexes such as those on enums. Though, in those cases the NOT IN > () could be transformed into IN()-lists by the planner, but not the > index. I think that it would probably be built as just another kind of index path, like the ones we build for SAOPs. Anti-SAOPs? Just as with SAOPs, the disjunctive accesses wouldn't really be something that the planner would need too much direct understanding of (except during costing). I'd only expect the plan to use such an index scan when it wasn't too far off needing a full index scan anyway. Just like with skip scan, the distinction between these NOT IN() index scan paths and a simple full index scan path is supposed to be fuzzy (maybe they'll actually turn out to be full scans at runtime, since the number of primitive index scans is determined dynamically, based on the structure of the index at runtime). > > The structure for preprocessing that I'm working towards (especially > > in v15) sets the groundwork for making those shifts in the planner, > > because we'll no longer treat each array constant as its own primitive > > index scan during preprocessing. > > I hope that's going to be a fully separate patch. I don't think I can > handle much more complexity in this one. Allowing the planner to hook into _bt_preprocess_keys is absolutely not in scope here. I was just making the point that treating array scan keys like just another type of scan key during preprocessing is going to help with that too. > Yeah. The _bt_readpage comment doesn't actually contain the search > term scanBehind, so I wasn't expecting that to be documented there. Can you think of a better way of explaining it? > > > I find it weird that we call _bt_advance_array_keys for non-required > > > sktrig. Shouldn't that be as easy as doing a binary search through the > > > array? Why does this need to hit the difficult path? > > > > What difficult path? > > "Expensive" would probably have been a better wording: we do a > comparative lot of processing in the !_bt_check_compare() + > !continuescan path; much more than the binary searches you'd need for > non-required array key checks. I've come around to your point of view on this -- at least to a degree. I'm now calling _bt_advance_array_keys from within _bt_check_compare for non-required array keys only. If nothing else, it's clearer this way because it makes it obvious that non-required arrays cannot end the (primitive) scan. There's no further need for the wart in _bt_check_compare's contract about "continuescan=false" being associated with a non-required scan key in this one special case. > > Advancing non-required arrays isn't that different to advancing > > required ones. We will never advance required arrays when called just > > to advance a non-required one, obviously. But we can do the opposite. > > In fact, we absolutely have to advance non-required arrays (as best we > > can) when advancing a required one (or when the call was triggered by > > a non-array required scan key). > > I think it's a lot more expensive to do the non-required array key > increment for non-required triggers. What are we protecting against > (or improving) by always doing advance_array_keys on non-required > trigger keys? We give up on the first non-required array that fails to find an exact match, though. Regardless of whether the scan was triggered by a required or a non-required scan key (it's equally useful in both types of array advancement, because they're not all that different). There were and are steps around starting a new primitive index scan at the end of _bt_advance_array_keys, that aren't required when array advancement was triggered by an unsatisfied non-required array scan key. But those steps were skipped given a non-required trigger scan key, anyway. > I mean that we should just do the non-required array key binary search > inside _bt_check_compare for non-required array keys, as that would > skip a lot of the rather expensive other array key infrastructure, and > only if we're outside the minimum or maximum bounds of the > non-required scankeys should we trigger advance_array_keys (unless > scan direction changed). I've thought about optimizing non-required arrays along those lines. But it doesn't really seem to be necessary at all. If we were going to do it, then it ought to be done in a way that works independent of the trigger condition for array advancement (that is, it'd work for non-required arrays that have a required scan key trigger, too). -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Mar 20, 2024 at 3:26 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > OK, here's a small additional review, with a suggestion for additional > changes to _bt_preprocess: Attached is v16. This has some of the changes that you asked for. But my main focus has been on fixing regressions that were discovered during recent performance validation. Up until now, my stress testing of the patch (not quite the same thing as benchmarking) more or less consisted of using pgbench to look at two different types of cases. These are: 1. Extreme cases where the patch obviously helps the most, at least insofar as navigating the index structure itself is concerned (note that avoiding heap accesses isn't in scope for the pgbench stress testing). These are all variants of pgbench's SELECT workload, with a huge IN() list containing contiguous integers (we can get maybe a 5.3x increase in TPS here), or integers that are spaced apart by maybe 20 - 50 tuples (where we can get maybe a 20% - 30% improvement in TPS here). 2. The opposite extreme, where an IN() list has integers that are essentially random, and so are spaced very far apart on average. Another variant of pgbench's SELECT workload, with an IN() list. This is a case that the patch has hardly any chance of helping, so the goal here is to just avoid regressions. Since I've been paying attention to this for a long time, this wasn't really a problem for v15. 3. Cases where matching tuples are spaced apart by 100 - 300 tuples/integer values. These cases are "somewhere between 1 and 2". Cases where I would expect to win by a smaller amount (say a 5% improvement in TPS), but v15 nevertheless lost by as much as 12% here. This was a problem that seemed to demand a fix. The underlying reason why all earlier versions of the patch had these regressions came down to their use of a pure linear search (by which I mean having _bt_readpage call _bt_checkeys again and again on the same page). That was just too slow here. v15 could roughly halve the number of index descents compared to master, as expected -- but that wasn't enough to make up for the overhead of having to plow through hundreds of non-matching tuples on every leaf page. v15 couldn't even break even. v16 of the patch fixes the problem by adding a limited additional form of "skipping", used *within* a page, as it is scanned by _bt_readpage. (As opposed to skipping over the index by redescending anew, starting from the root page.) In other words, v16 teaches the linear search process to occasionally "look ahead" when it seems like the linear search process has required too many comparisons at that point (comparisons by _bt_tuple_before_array_skeys made from within _bt_checkkeys). You can think of this new "look ahead" mechanism as bridging the gap between case 1 and case 2 (fixing case 3, a hybrid of 1 and 2). We still mostly want to use a "linear search" from within _bt_readpage, but we do benefit from having this fallback when that just isn't working very well. Now even these new type 3 stress tests see an increase in TPS of perhaps 5% - 12%, depending on just how far apart you arrange for matching tuples to be spaced apart by (the total size of the IN list is also relevant, but much less so). Separately, I added a new optimization to the binary search process that selects the next array element via a binary search of the array (actually, to the function _bt_binsrch_array_skey), which lowers the cost of advancing required arrays that trigger advancement: we only really need one comparison per _bt_binsrch_array_skey call in almost all cases. In practice we'll almost certainly find that the next array element in line is <= the unsatisfied tuple datum (we already know from context that the current/former array element must now be < that same tuple datum at that point, so the conditions under which this doesn't work out right away are narrow). This second optimization is much more general than the first one. It helps with pretty much any kind of adversarial pgbench stress test. > Here, you insert your code between the comment about which opfamily to > choose and the code assigning the opfamily. I think this needs some > cleanup. Fixed. > > + * Don't call _bt_preprocess_array_keys_final in this fast path > > + * (we'll miss out on the single value array transformation, but > > + * that's not nearly as important when there's only one scan key) > > Why is it OK to ignore it? Or, why don't we apply it here? It doesn't seem worth adding any cycles to that fast path, given that the array transformation in question can only help us to convert "= any('{1}')" into "= 1", because there's no other scan keys that can cut down the number of array constants first (through earlier array-specific preprocessing steps). Note that even this can't be said for converting "IN (1)" into "= 1", since the planner already does that for us, without nbtree ever seeing it directly. > Attached 2 patches for further optimization of the _bt_preprocess_keys > path (on top of your v15), according to the following idea: > > Right now, we do "expensive" processing with xform merging for all > keys when we have more than 1 keys in the scan. However, we only do > per-attribute merging of these keys, so if there is only one key for > any attribute, the many cycles spent in that loop are mostly wasted. Why do you say that? I agree that calling _bt_compare_scankey_args() more than necessary might matter, but that doesn't come up when there's only one scan key per attribute. We'll only copy each scan key into the strategy-specific slot for the attribute's temp xform[] array. Actually, it doesn't necessarily come up when there's more than one scan key per index attribute, depending on the strategies in use. That seems like an existing problem on HEAD; I think we should be doing this more. We could stand to do better at preprocessing when inequalities are mixed together. Right now, we can't detect contradictory quals, given a case such as "SELECT * FROM pgbench_accounts WHERE aid > 5 AND aid < 3". It'll only work for something like "WHERE aid = 5 AND aid < 3", so we'll still useless descend the index once (not a disaster, but not ideal either). Honestly, I don't follow what the goal is here. Does it really help to restructure the loop like this? Feels like I'm missing something. We do this when there's only one scan key, sort of, but that's probably not really that important anyway. Maybe it helps a bit for nestloop joins with an inner index scan, where one array scan key is common. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Mar 21, 2024 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v16. This has some of the changes that you asked for. But > my main focus has been on fixing regressions that were discovered > during recent performance validation. Attached is v17. This revision deals with problems with parallel index scans that use array keys. This patch is now *very* close to being committable. My only outstanding concern is the costing/selfuncs.c aspects of the patch, which I haven't thought about in many months. I'd estimate that I'm now less than a week away from pushing this patch. (Plus I need to decide which tests from my massive test suite should be included in the committed patch, as standard regression tests.) Back to v17. Earlier versions dealt with parallel index scans in a way that could lead to very unbalanced utilization of worker processes (at least with the wrong sort of query/index), which was clearly unacceptable. The underlying issue was the continued use of BTParallelScanDescData.btps_arrayKeyCount field in shared memory, which was still used to stop workers from moving on to the next primitive scan. That old approach made little sense with this new high level design. (I have been aware of the problem for months, but didn't get around to it until now.) I wasn't specifically trying to win any benchmarks here -- I really just wanted to make the design for parallel index scans with SAOPs into a coherent part of the wider design, without introducing any regressions. I applied my usual charter for the patch when working on v17, by more or less removing any nbtree.c parallel index scan code that had direct awareness of array keys as distinct primitive index scans. This fixed the problem of unbalanced worker utilization, as expected, but it also seems to have benefits particular to parallel index scans with array keys -- which was unexpected. We now use an approach that's almost identical to the approach taken by every other type of parallel scan. My new approach naturally reduces contention among backends that want to advance the scan. (Plus parallel index scans get all of the same benefits as serial index scans, of course.) v17 replaces _bt_parallel_advance_array_keys with a new function, called _bt_parallel_primscan_advance, which is now called from _bt_advance_array_keys. The new function doesn't need to increment btscan->btps_arrayKeyCount (nor does it increment a local copy in BTScanOpaqueData.arrayKeyCount). It is called at the specific point that array key advancement *tries* to start another primitive index scan. The difference (relative to the serial case) is that it might not succeed in doing so. It might not be possible to start another primitive index scan, since it might turn out that some other backend (one page ahead, or even one page behind) already did it for everybody. At that point it's easy for the backend that failed to start the next primitive scan to have _bt_advance_array_keys back out of everything. Then the backend just goes back to consuming pages by seizing the scan. This approach preserves the ability of parallel scans to have _bt_readpage release the next page before _bt_readpage even starts examining tuples from the current page. It's possible that 2 or 3 backends (each of which scans its own leaf pages from a group of adjacent pages) will "independently" decide that it's time to start another scan -- but at most one backend can be granted the right to do so. It's even possible that one such backend among several (all of which might be expected to try to start the next primitive scan in unison) will *not* want to do so -- it could know better than the rest. This can happen when the backend lands one page ahead of the rest, and it just so happens to see no reason to not just continue the scan on the leaf level for now. Such a backend can effectively veto the idea of starting another primitive index scan for the whole top-level parallel scan. This probably doesn't really matter much, in and of itself (skipping ahead by only one or two pages is suboptimal but not really a problem), but that's beside the point. The point is that having very flexible rules like this works out to be both simpler and better performing than a more rigid, pessimistic approach would be. In particular, keeping the time that the scan is seized by any one backend to an absolute minimum is important -- it may be better to detect and recover from contention than to try to prevent them altogether. Just to be clear, all of this only comes up during parts of the scan where primitive index scans actually look like a good idea in the first place. It's perfectly possible for the scan to process thousands of array keys, without any backend ever considering starting another primitive index scan. Much of the time, every worker process can independently advance their own private array keys, without that requiring any sort of special coordination (nothing beyond the coordination required by *all* parallel index scans, including those without any SAOP arrays). As I've said many times already, the high level charter of this project is to make scans that have arrays (whether serial or parallel) as similar as possible to any other type of scan. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Tue, Mar 26, 2024 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > v17 replaces _bt_parallel_advance_array_keys with a new function, > called _bt_parallel_primscan_advance, which is now called from > _bt_advance_array_keys. The new function doesn't need to increment > btscan->btps_arrayKeyCount (nor does it increment a local copy in > BTScanOpaqueData.arrayKeyCount). It is called at the specific point > that array key advancement *tries* to start another primitive index > scan. The difference (relative to the serial case) is that it might > not succeed in doing so. It might not be possible to start another > primitive index scan, since it might turn out that some other backend > (one page ahead, or even one page behind) already did it for > everybody. There was a bug in v17 around how we deal with parallel index scans. Under the right circumstances, a parallel scan with array keys put all of its worker processes to sleep, without subsequently waking them up (so the scan would hang indefinitely). The underlying problem was that v17 assumed that the backend that scheduled the next primitive index scan would also be the one that actually performs that same primitive scan. While it is natural and desirable for the backend that schedules the primitive scan to be the one that actually calls _bt_search, v17 tacitly relied on that to successfully finish the scan. It's particularly important that the leader process always be able to stop participating as a worker immediately, whenever it needs to, and for as long as it needs to. Attached is v18, which adds an additional layer of indirection to fix the bug: worker processes now schedule primitive index scans explicitly. They store their current array keys in shared memory when primitive scans are scheduled (during a parallel scan), allowing other backends to pick things up where the scheduling backend left off. And so with v18, any other backend can be the one that performs the actual descent of the index within _bt_search. The design is essentially the same as before, though. We'd still prefer that it be the backend that scheduled the primitive index scan. Other backends might well be busy finishing off their scan of immediately preceding leaf pages. Separately, v18 also adds many new regression tests. This greatly improves the general situation around test coverage, which I'd put off before now. Now there is coverage for parallel scans with array keys, as well as coverage for the new array-specific preprocessing code. Note: v18 doesn't have any adjustments to the costing, as originally planned. I'll probably need to post a revised patch with improved (or at least polished) costing in the next few days, so that others will have the opportunity to comment before I commit the patch. -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > Note: v18 doesn't have any adjustments to the costing, as originally > planned. I'll probably need to post a revised patch with improved (or > at least polished) costing in the next few days, so that others will > have the opportunity to comment before I commit the patch. Attached is v19, which dealt with remaining concerns I had about the costing in selfuncs.c. My current plan is to commit this on Saturday morning (US Eastern time). I ultimately concluded that selfuncs.c shouldn't be doing anything to cap the estimated number of primitive index scans for an SAOP clause, unless doing so is all but certain to make the estimate more accurate. And so v19 makes the changes in selfuncs.c much closer to the approach used to cost SOAP clauses on master. In short, v19 is a lot more conservative in the changes made to selfuncs.c. v19 does hold onto the idea of aggressively capping the estimate in extreme cases -- cases where the estimate begins to approach the total number of leaf pages in the index. This is clearly safe (it's literally impossible to have more index descents than there are leaf pages in the index), and probably necessary given that index paths with SAOP filter quals (on indexed attributes) will no longer be generated by the planner. To recap, since nbtree more or less behaves as if scans with SAOPs are one continuous index scan under the new design from patch, it was tempting to make code in genericcostestimate/btcostestimate treat it like that, too (the selfuncs.c changes from earlier versions tried to do that). As I said already, I've now given up on that approach in v19. This does mean that genericcostestimate continues to apply the Mackert-Lohman formula for btree SAOP scans. This is a bit odd in a world where nbtree specifically promises to not repeat any leaf page accesses. I was a bit uneasy about that aspect for a while, but ultimately concluded that it wasn't very important in the grand scheme of things. The problem with teaching code in genericcostestimate/btcostestimate to treat nbtree scans with SAOPs are one continuous index scan is that it hugely biases genericcostestimate's simplistic approach to estimating numIndexPages. genericcostestimate does this by simply prorating using numIndexTuples/index->pages/index->tuples. That naive approach probably works alright with simple scalar operators, but it's not going to work with SAOPs directly. I can't think of a good way of estimating numIndexPages with SAOPs, and suspect that the required information just isn't available. It seems best to treat it as a possible area for improvement in a later release. The really important thing for v17 is that we never wildly overestimate the number of descents when multiple SAOPs are used, each with a medium or large array. v19 also makes sure that genericcostestimate doesn't allow a non-boundary SAOP clause/non-required array scan key to affect num_sa_scans -- it'll just use the num_sa_scans values used by btcostestimate in v19. This makes sense because nbtree is guaranteed to not start new primitive scans for these sorts of scan keys (plus there's no need to calculate num_sa_scans twice, in two places, using two slightly different approaches). -- Peter Geoghegan
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Alexander Lakhin
Date:
Hello Peter, 03.04.2024 22:53, Peter Geoghegan wrote: > On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Note: v18 doesn't have any adjustments to the costing, as originally >> planned. I'll probably need to post a revised patch with improved (or >> at least polished) costing in the next few days, so that others will >> have the opportunity to comment before I commit the patch. > Attached is v19, which dealt with remaining concerns I had about the > costing in selfuncs.c. My current plan is to commit this on Saturday > morning (US Eastern time). Please look at an assertion failure (reproduced starting from 5bf748b86), triggered by the following query: CREATE TABLE t (a int, b int); CREATE INDEX t_idx ON t (a, b); INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g; ANALYZE t; SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267 Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > Please look at an assertion failure (reproduced starting from 5bf748b86), > triggered by the following query: > CREATE TABLE t (a int, b int); > CREATE INDEX t_idx ON t (a, b); > INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g; > ANALYZE t; > SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); > > TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267 I immediately see what's up here. WIll fix this in the next short while. There is no bug here in builds without assertions, but it's probably best to keep the assertion, and to just make sure not to call _bt_preprocess_array_keys_final() unless it has real work to do. The assertion failure demonstrates that _bt_preprocess_array_keys_final can currently be called when there can be no real work for it to do. The problem is that we condition the call to _bt_preprocess_array_keys_final() on whether or not we had to do "real work" back when _bt_preprocess_array_keys() was called, at the start of _bt_preprocess_keys(). That usually leaves us with "so->numArrayKeys > 0", because the "real work" typically includes equality type array keys. But not here -- here you have two SAOP inequalities, which become simple scalar scan keys through early array-specific preprocessing in _bt_preprocess_array_keys(). There are no arrays left at this point, so "so->numArrayKeys == 0". FWIW I missed this because the tests only cover cases with one SOAP inequality, which will always return early from _bt_preprocess_keys (by taking its generic single scan key fast path). Your test case has 2 scan keys, avoiding the fast path, allowing us to reach _bt_preprocess_array_keys_final(). -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Tom Lane
Date:
Coverity pointed out something that looks like a potentially live problem in 5bf748b86: /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys() 2944 * need to make sure that we don't throw away an array 2945 * scan key. _bt_compare_scankey_args expects us to 2946 * always keep arrays (and discard non-arrays). 2947 */ 2948 Assert(j == (BTEqualStrategyNumber - 1)); 2949 Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); >>> CID 1596256: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "array". 2950 Assert(xform[j].ikey == array->scan_key); 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY)); 2952 } 2953 } 2954 else if (j == (BTEqualStrategyNumber - 1)) Above this there is an assertion Assert(!array || array->num_elems > 0); which certainly makes it look like array->scan_key could be a null-pointer dereference. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Coverity pointed out something that looks like a potentially live > problem in 5bf748b86: > > /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys() > 2944 * need to make sure that we don't throw away an array > 2945 * scan key. _bt_compare_scankey_args expects us to > 2946 * always keep arrays (and discard non-arrays). > 2947 */ > 2948 Assert(j == (BTEqualStrategyNumber - 1)); > 2949 Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); > >>> CID 1596256: Null pointer dereferences (FORWARD_NULL) > >>> Dereferencing null pointer "array". > 2950 Assert(xform[j].ikey == array->scan_key); > 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY)); > 2952 } > 2953 } > 2954 else if (j == (BTEqualStrategyNumber - 1)) > > Above this there is an assertion > > Assert(!array || array->num_elems > 0); > > which certainly makes it look like array->scan_key could be > a null-pointer dereference. But the "Assert(xform[j].ikey == array->scan_key)" assertion is located in a block where it's been established that the scan key (the one stored in xform[j] at this point in execution) must have an array. It has been marked SK_SEARCHARRAY, and uses the equality strategy, so it had better have one or we're in big trouble either way. This is probably very hard for tools like Coverity to understand. We also rely on the fact that only one of the two scan keys (only one of the pair of scan keys that were passed to _bt_compare_scankey_args) can have an array at the point of the assertion that Coverity finds suspicious. It's possible that both of those scan keys actually did have arrays, but _bt_compare_scankey_args just treats that as a case of being unable to prove which scan key was redundant/contradictory due to a lack of suitable cross-type support -- so the assertion won't be reached. Would Coverity stop complaining if I just removed the assertion? I could just do that, I suppose, but that seems backwards to me. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Coverity pointed out something that looks like a potentially live >> problem in 5bf748b86: >> ... which certainly makes it look like array->scan_key could be >> a null-pointer dereference. > But the "Assert(xform[j].ikey == array->scan_key)" assertion is > located in a block where it's been established that the scan key (the > one stored in xform[j] at this point in execution) must have an array. > This is probably very hard for tools like Coverity to understand. It's not obvious to human readers either ... > Would Coverity stop complaining if I just removed the assertion? I > could just do that, I suppose, but that seems backwards to me. Perhaps this'd help: - Assert(xform[j].ikey == array->scan_key); + Assert(array && xform[j].ikey == array->scan_key); If that doesn't silence it, I'd be prepared to just dismiss the warning. Some work in the comment to explain why we must have an array here wouldn't be out of place either, perhaps. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Apr 7, 2024 at 9:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perhaps this'd help: > > - Assert(xform[j].ikey == array->scan_key); > + Assert(array && xform[j].ikey == array->scan_key); > > If that doesn't silence it, I'd be prepared to just dismiss the > warning. The assertions in question are arguably redundant. There are very similar assertions just a little earlier on, as we initially set up the array stuff (right before _bt_compare_scankey_args is called). I'll just remove the "Assert(xform[j].ikey == array->scan_key)" assertion that Coverity doesn't like, in addition to the "Assert(!array || array->scan_key == i)" assertion, on the grounds that they're redundant. > Some work in the comment to explain why we must have an array here > wouldn't be out of place either, perhaps. There is a comment block about this right above the assertion in question: /* * Both scan keys might have arrays, in which case we'll * arbitrarily pass only one of the arrays. That won't * matter, since _bt_compare_scankey_args is aware that two * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys * failed to eliminate redundant arrays through array merging. * _bt_compare_scankey_args just returns false when it sees * this; it won't even try to examine either array. */ Do you think it needs more work? -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > The assertions in question are arguably redundant. There are very > similar assertions just a little earlier on, as we initially set up > the array stuff (right before _bt_compare_scankey_args is called). > I'll just remove the "Assert(xform[j].ikey == array->scan_key)" > assertion that Coverity doesn't like, in addition to the > "Assert(!array || array->scan_key == i)" assertion, on the grounds > that they're redundant. If you're doing that, then surely if (j != (BTEqualStrategyNumber - 1) || !(xform[j].skey->sk_flags & SK_SEARCHARRAY)) { ... } else { Assert(j == (BTEqualStrategyNumber - 1)); Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); Assert(xform[j].ikey == array->scan_key); Assert(!(cur->sk_flags & SK_SEARCHARRAY)); } those first two Asserts are redundant with the "if" as well. regards, tom lane
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you're doing that, then surely > > if (j != (BTEqualStrategyNumber - 1) || > !(xform[j].skey->sk_flags & SK_SEARCHARRAY)) > { > ... > } > else > { > Assert(j == (BTEqualStrategyNumber - 1)); > Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY); > Assert(xform[j].ikey == array->scan_key); > Assert(!(cur->sk_flags & SK_SEARCHARRAY)); > } > > those first two Asserts are redundant with the "if" as well. I'll get rid of those other two assertions as well, then. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > those first two Asserts are redundant with the "if" as well. > > I'll get rid of those other two assertions as well, then. Done that way. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Donghang Lin
Date:
Hi Peter
There seems to be an assertion failure with this change in HEAD
TRAP: failed Assert("leftarg->sk_attno == rightarg->sk_attno"), File: "../../src/backend/access/nbtree/nbtutils.c", Line: 3246, PID: 1434532
It can be reproduced by:
create table t(a int);
insert into t select 1 from generate_series(1,10);
create index on t (a desc);
set enable_seqscan = false;
select * from t where a IN (1,2) and a IN (1,2,3);
insert into t select 1 from generate_series(1,10);
create index on t (a desc);
set enable_seqscan = false;
select * from t where a IN (1,2) and a IN (1,2,3);
It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column,
the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid.
Attached a demo fix.
Regards,
Donghang Lin
(ServiceNow)
Attachment
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Thu, Apr 18, 2024 at 2:13 AM Donghang Lin <donghanglin@gmail.com> wrote: > It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column, > the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid. The problem is that _bt_fix_scankey_strategy shouldn't have been doing anything with already-eliminated array scan keys in the first place (whether or not they're on a DESC index column). I just pushed a fix along those lines. Thanks for the report! -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Alexander Lakhin
Date:
Hello Peter, 07.04.2024 20:18, Peter Geoghegan wrote: > On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]); >> >> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267 > I immediately see what's up here. WIll fix this in the next short > while. There is no bug here in builds without assertions, but it's > probably best to keep the assertion, and to just make sure not to call > _bt_preprocess_array_keys_final() unless it has real work to do. Please look at another case, where a similar Assert (but this time in _bt_preprocess_keys()) is triggered: CREATE TABLE t (a text, b text); INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g; CREATE INDEX t_idx ON t USING btree(a); BEGIN; DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a'; FETCH FROM c; FETCH RELATIVE 0 FROM c; TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962 Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2024 at 4:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > Please look at another case, where a similar Assert (but this time in > _bt_preprocess_keys()) is triggered: > CREATE TABLE t (a text, b text); > INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g; > CREATE INDEX t_idx ON t USING btree(a); > BEGIN; > DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a'; > FETCH FROM c; > FETCH RELATIVE 0 FROM c; > > TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962 I'm pretty sure that I could fix this by simply removing the assertion. But I need to think about it a bit more before I push a fix. The test case you've provided proves that _bt_preprocess_keys's new no-op path isn't just used during scans that have array keys (your test case doesn't have a SAOP at all). This was never intended. On the other hand, I think that it's still correct (or will be once the assertion is gone), and it seems like it would be simpler to allow this case (and document it) than to not allow it at all. The general idea that we only need one "real" _bt_preprocess_keys call per btrescan (independent of the presence of array keys) still seems sound. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2024 at 11:13 AM Peter Geoghegan <pg@bowt.ie> wrote: > I'm pretty sure that I could fix this by simply removing the > assertion. But I need to think about it a bit more before I push a > fix. > > The test case you've provided proves that _bt_preprocess_keys's > new no-op path isn't just used during scans that have array keys (your > test case doesn't have a SAOP at all). This was never intended. On the > other hand, I think that it's still correct (or will be once the assertion is > gone), and it seems like it would be simpler to allow this case (and > document it) than to not allow it at all. Pushed a fix like that just now. Thanks for the report, Alexander. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
"Anton A. Melnikov"
Date:
Hi, Peter! On 20.01.2024 01:41, Peter Geoghegan wrote: > It is quite likely that there are exactly zero affected out-of-core > index AMs. I don't count pgroonga as a counterexample (I don't think > that it actually fullfills the definition of a ). Basically, > "amcanorder" index AMs more or less promise to be compatible with > nbtree, down to having the same strategy numbers. So the idea that I'm > going to upset amsearcharray+amcanorder index AM authors is a > completely theoretical problem. The planner code evolved with nbtree, > hand-in-glove. From the 5bf748b86bc commit message: > There is a theoretical risk that removing restrictions on SAOP index > paths from the planner will break compatibility with amcanorder-based > index AMs maintained as extensions. Such an index AM could have the > same limitations around ordered SAOP scans as nbtree had up until now. > Adding a pro forma incompatibility item about the issue to the Postgres > 17 release notes seems like a good idea. Seems, this commit broke our posted knn_btree patch. [1] If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp, then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp in descending order and starts searching from smaller to larger and always expects that for each element of ScalarArrayOp there will be a separate scan. And now it does not work. Reproduction is described in [2]. Seems it is impossible to solve this problem only from the knn-btree patch side. Could you advise any ways how to deal with this. Would be very grateful. With the best wishes, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company [1] https://commitfest.postgresql.org/48/4871/ [2] https://www.postgresql.org/message-id/47adb0b0-6e65-4b40-8d93-20dcecc21395%40postgrespro.ru
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Wed, Jul 31, 2024 at 12:47 AM Anton A. Melnikov <a.melnikov@postgrespro.ru> wrote: > From the 5bf748b86bc commit message: > > > There is a theoretical risk that removing restrictions on SAOP index > > paths from the planner will break compatibility with amcanorder-based > > index AMs maintained as extensions. Such an index AM could have the > > same limitations around ordered SAOP scans as nbtree had up until now. > > Adding a pro forma incompatibility item about the issue to the Postgres > > 17 release notes seems like a good idea. Here you've quoted the commit message's description of the risks around breaking third party index AMs maintained as extensions. FWIW that seems like a rather different problem to the one that you ran into with your KNN nbtree patch. > Seems, this commit broke our posted knn_btree patch. [1] > If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp, > then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp > in descending order and starts searching from smaller to larger > and always expects that for each element of ScalarArrayOp there will be a separate scan. > And now it does not work. Reproduction is described in [2]. > > Seems it is impossible to solve this problem only from the knn-btree patch side. > Could you advise any ways how to deal with this. Would be very grateful. Well, you're actually patching nbtree. Your patch isn't upstream of the nbtree code. You're going to have to reconcile your design with the design for advancing arrays that is primarily implemented by _bt_advance_array_keys. I'm not entirely sure what the best way to go about doing that is -- that would require a lot of context about the KNN patch. It wouldn't be particularly hard to selectively reenable the old Postgres 16 SAOP behavior, though. You could conditionally force "goto new_prim_scan" whenever the KNN patch's mechanism was in use. That kind of approach might have unintended consequences elsewhere, though. For example it could break things like mark/restore processing by merge joins, which assumes that the scan's current array keys can always be reconstructed after restoring a marked position by resetting the array's to their first elements (or final elements if it's a backwards scan). On the other hand, I think that you already shouldn't expect anything like that to work with your patch. After all, it changes the order of the tuples returned by the scan, which is already bound to break certain assumptions made by "regular" index scans. -- Peter Geoghegan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Alexander Lakhin
Date:
Hello Peter, 22.04.2024 20:59, Peter Geoghegan wrote: > > Pushed a fix like that just now. I'm sorry to bother you again, but I've come across another assertion failure. Please try the following query (I use a clean "postgres" database, just after initdb): EXPLAIN SELECT conname FROM pg_constraint WHERE conname IN ('pkey', 'ID') ORDER BY conname DESC; SELECT conname FROM pg_constraint WHERE conname IN ('pkey', 'ID') ORDER BY conname DESC; It fails for me as below: QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Index Only Scan Backward using pg_constraint_conname_nsp_index on pg_constraint (cost=0.14..4.18 rows=2 width=64) Index Cond: (conname = ANY ('{pkey,ID}'::name[])) (2 rows) server closed the connection unexpectedly ... with the stack trace: ... #5 0x000055a49f81148d in ExceptionalCondition (conditionName=0x55a49f8bb540 "ItemIdHasStorage(itemId)", fileName=0x55a49f8bb4a8 "../../../../src/include/storage/bufpage.h", lineNumber=355) at assert.c:66 #6 0x000055a49f0f2ddd in PageGetItem (page=0x7f97cbf17000 "", itemId=0x7f97cbf2f064) at ../../../../src/include/storage/bufpage.h:355 #7 0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2, tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105 #8 0x000055a49f0f8ac3 in _bt_checkkeys (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, arrayKeys=true, tuple=0x7f97cbf18890, tupnatts=2) at nbtutils.c:3612 #9 0x000055a49f0ebb4b in _bt_readpage (scan=0x55a4a0ac4548, dir=BackwardScanDirection, offnum=20, firstPage=true) at nbtsearch.c:1863 ... (gdb) f 7 #7 0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2, tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105 4105 ahead = (IndexTuple) PageGetItem(pstate->page, (gdb) p aheadoffnum $1 = 24596 (gdb) p pstate->offnum $2 = 20 (gdb) p pstate->targetdistance $3 = -24576 Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
From
Peter Geoghegan
Date:
On Mon, Aug 26, 2024 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > I'm sorry to bother you again, but I've come across another assertion > failure. You've found a real bug. I should be the one apologizing - not you. > Please try the following query (I use a clean "postgres" database, > just after initdb): > EXPLAIN SELECT conname > FROM pg_constraint WHERE conname IN ('pkey', 'ID') > ORDER BY conname DESC; > > SELECT conname > FROM pg_constraint WHERE conname IN ('pkey', 'ID') > ORDER BY conname DESC; The problem is that _bt_checkkeys_look_ahead didn't quite get everything right with sanitizing the page offset number it uses to check if a later tuple is before the recently advanced array scan keys. The page offset itself was checked, but in a way that was faulty in cases where the int16 we use could overflow. I can fix the bug by making sure that pstate->targetdistance (an int16 variable) can only be doubled when it actually makes sense. That way there can never be an int16 overflow, and so the final offnum cannot underflow to a value that's much higher than the page's max offset number. This approach works: /* * The look ahead distance starts small, and ramps up as each call here * allows _bt_readpage to skip over more tuples */ if (!pstate->targetdistance) pstate->targetdistance = LOOK_AHEAD_DEFAULT_DISTANCE; - else + else if (pstate->targetdistance < MaxIndexTuplesPerPage / 2) pstate->targetdistance *= 2; I'll push a fix along these lines shortly. Thanks for the report! -- Peter Geoghegan