Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-Wzk0B0Ccso31cr1Eo8mfEE5jaggez6kD6WfmPhOQzhkSPA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
List | pgsql-hackers |
On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > I'm not going to break out the planner changes, because they're *not* > > independent in any way. > > The planner changes depend on the btree changes, that I agree with. > However, I don't think that the btree changes depend on the planner > changes. Yeah, technically it would be possible to break out the indxpath.c changes -- that wouldn't leave the tree in a state with visibly-wrong behavior at any point. But that's far from the only thing that matters. If that was the standard that we applied, then I might have to split the patch into 10 or more patches. What it boils down to is this: it is totally natural for me to think of the planner changes as integral to the nbtree changes, so I'm going to include them together in one commit. That's just how the code was written -- I always thought about it as one single thing. The majority of the queries that I've used to promote the patch directly rely on the planner changes in one way or another (even back in v1, when the planner side of things had lots of kludges). I don't necessarily always follow that standard. Sometimes it is useful to further break things up. Like when it happens to make the high-level division of labor a little bit clearer. The example that comes to mind is commit dd299df8 and commit fab25024. These nbtree commits were essentially one piece of work that was broken into two, but committed within minutes of each other, and left the tree in a momentary state that was not-very-useful (though still correct). That made sense to me at the time because the code in question was mechanically distinct, while at the same time modifying some of the same nbtree files. Drawing a line under it seemed to make sense. I will admit that there is some amount of subjective judgement (gasp!) here. Plus I'll acknowledge that it's *slightly* odd that the most compelling cases for the SAOP patch almost all involve savings in heap page accesses, even though it is fundamentally a B-Tree patch. But that's just how it came out. Slightly odd things happen all the time. > I would agree with you if this was about new APIs and features, but > here existing APIs are being repurposed without changing them. A > maintainer of index AMs would not look at the commit title 'Enhance > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure > my existing amsearcharray+amcanorder handling actually supports > non-prefix arrays keys and return data in index order". This is getting ridiculous. It is quite likely that there are exactly zero affected out-of-core index AMs. I don't count pgroonga as a counterexample (I don't think that it actually fullfills the definition of a ). Basically, "amcanorder" index AMs more or less promise to be compatible with nbtree, down to having the same strategy numbers. So the idea that I'm going to upset amsearcharray+amcanorder index AM authors is a completely theoretical problem. The planner code evolved with nbtree, hand-in-glove. I'm more than happy to add a reminder to the commit message about adding a proforma listing to the compatibility section of the Postgres 17 release notes. Can we actually agree on which theoretical index AM types are affected first, though? Isn't it actually amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I have that right? BTW, how should we phrase this compatibility note, so that it can be understood? It's rather awkward. > > As I said to Heikki, thinking about this some more is on my todo list. > > I mean the way that this worked had substantial revisions on HEAD > > right before I posted v9. v9 was only to fix the bit rot that that > > caused. > > Then I'll be waiting for that TODO item to be resolved. My TODO item is to resolve the question of whether and to what extent these two optimizations should be combined. It's possible that I'll decide that it isn't worth it, for whatever reason. At this point I'm still totally neutral. > > Even single digit > > thousand of array elements should be considered huge. Plus this code > > path is only hit with poorly written queries. > > Would you accept suggestions? You mean that you want to have a go at it yourself? Sure, go ahead. I cannot promise that I'll accept your suggested revisions, but if they aren't too invasive/complicated compared to what I have now, then I will just accept them. I maintain that this isn't really necessary, but it might be the path of least resistance at this point. > > > Please fix this, as it shows up in profiling of large array merges. > > > Additionally, as it merges two arrays of unique items into one, > > > storing only matching entries, I feel that it is quite wasteful to do > > > this additional allocation here. Why not reuse the original allocation > > > immediately? > > > > Because that's adding more complexity for a code path that will hardly > > ever be used in practice. For something that happens once, during a > > preprocessing step. > > Or many times, when we're in a parameterized loop, as was also pointed > out by Heikki. That's not really what Heikki said. Heikki had a general concern about the startup costs with nestloop joins. > > > Additionally, could you try to create a single point of entry for the > > > array key stuff that covers the new systems? I've been trying to wrap > > > my head around this, and it's taking a lot of effort. > > > > I don't understand what you mean here. > > The documentation (currently mostly code comments) is extensive, but > still spread around various inline comments and comments on functions; > with no place where a clear top-level design is detailed. > I'll agree that we don't have that for the general systems in > _bt_next() either, but especially with this single large patch it's > difficult to grasp the specific differences between the various > functions. Between what functions? The guts of this work are in the new _bt_advance_array_keys and _bt_tuple_before_array_skeys (honorable mention for _bt_array_keys_remain). It seems pretty well localized to me. Granted, there are a few places where we rely on things being set up a certain way by other code that's quite some distance away (from the code doing the depending). For example, the new additions to _bt_preprocess_keys that we need later on, in _bt_advance_array_keys/_bt_update_keys_with_arraykeys. These bits and pieces of code are required, but are not particularly crucial to understanding the general design. For the most part the design is that we cede control of array key advancement to _bt_checkkeys() -- nothing much else changes. There are certainly some tricky details behind the scenes, which we should be verifying via testing and especially via robust invariants (verified with assertions). But this is almost completely hidden from the nbtsearch.c caller -- there are no real changes required there. > It wasn't really meant as a restatement: It is always unsafe to ignore > upper bits, as the index isn't organized by that. However, it *could* > be safe to ignore the bits with lowest significance, as the index > would still be organized correctly even in that case, for int4 at > least. Similar to how you can have required scankeys for the prefix of > an (int2, int2) index, but not the suffix (as of right now at least). It isn't safe to assume that types appearing together within an opfamily are binary coercible (or capable of being type-punned like this) in the general case. That often works, but it isn't reliable. Even with integer_ops, it breaks on big-endian machines. > This won't ever be an issue when there is a requirement that if a = b > and b = c then a = c must hold for all triples of typed values and > operations in a btree opclass, as you mention above. Right. It also doesn't matter if there are redundant equality conditions that we cannot formally prove are redundant during preprocessing, for want of appropriate cross-type comparison routines from the opfamily. The actual behavior of index scans (in terms of when and how we skip) won't be affected by that at all. The only problem that it creates is that we'll waste CPU cycles, relative to the case where we can somehow detect this redundancy. In case it wasn't clear before: the optimizations I've added to _bt_preprocess_array_keys are intended to compensate for the pessimizations added to _bt_preprocess_keys (both the optimizations and the pessimizations only affect equality type array scan keys). I don't think that this needs to be perfect; it just seems like a good idea. > > > I'm also no fan of the (tail) recursion. I would agree that this is > > > unlikely to consume a lot of stack, but it does consume stackspace > > > nonetheless, and I'd prefer if it was not done this way. > > > > I disagree. The amount of stack space it consumes in the worst case is fixed. > > Is it fixed? Without digging very deep into it, it looks like it can > iterate on the order of n_scankeys deep? The comment above does hint > on 2 iterations, but is not very clear about the conditions and why. The recursive call to _bt_advance_array_keys is needed to deal with unsatisfied-and-required inequalities that were not detected by an initial call to _bt_check_compare, following a second retry call to _bt_check_compare. We know that this recursive call to _bt_advance_array_keys won't cannot recurse again because we've already identified which specific inequality scan key it is that's a still-unsatisfied inequality, following an initial round of array key advancement. We're pretty much instructing _bt_advance_array_keys to perform "beyond_end_advance" type advancement for the specific known unsatisfied inequality scan key (which must be required in the current scan direction, per the assertions for that) here. So clearly it cannot end up recursing any further -- the recursive call is conditioned on "all_arraylike_sk_satisfied && arrays_advanced", but that'll be unset when "ikey == sktrig && !array" from inside the loop. This is a narrow edge-case -- it is an awkward one. Recursion does seem natural here, because we're essentially repeating the call to _bt_advance_array_keys from inside _bt_advance_array_keys, rather than leaving it up to the usual external caller to do later on. If you ripped this code out it would barely be noticeable at all. But it seems worth it so that we can make a uniform promise to always advance the array keys to the maximum extent possible, based on what the caller's tuple tells us about the progress of the scan. Since all calls to _bt_advance_array_keys are guaranteed to advance the array keys to the maximum extent that's safely possible (barring this one edge-case with recursive calls), it almost follows that there can't be another recursive call. This is the one edge case where the implementation requires a second pass over the tuple -- we advanced the array keys to all-matching values, but nevertheless couldn't finish there due to the unsatisfied required inequality (which must have become unsatisfied _at the same time_ as some earlier required equality scan key for us to end up requiring a recursive call). > > Because in one case we follow the convention for scan keys, whereas > > so->orderProcs is accessed via an attribute number subscript. > > Okay, but how about this in _bt_merge_arrays? > > + Datum *elem = elems_orig + i; > > I'm not familiar with the scan key convention, as most other places > use reference+subscripting. I meant the convention used in code like _bt_check_compare (which is what we call _bt_checkkeys on HEAD, basically). Note that the _bt_merge_arrays code that you've highlighted isn't iterating through so->keyData[] -- it is iterating through the function caller's elements array, which actually come from so->arrayKeys[]. Like every other Postgres contributor, I do my best to follow the conventions established by existing code. Sometimes that leads to pretty awkward results, where CamelCase and underscore styles are closely mixed together, because it works out to be the most consistent way of doing it overall. -- Peter Geoghegan
pgsql-hackers by date: