Thread: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

25 July 2023, 01:33:52

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.

Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.

The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.

The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.

Index order
===========

As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.

Here's a test case that you can run against the regression test database:

pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).

Obviously, the immediate reason why the patch wins by so much is
because it produces a plan that allows the LIMIT to terminate the scan
far sooner. Benoit Tigeot (CC'd) happened to run into this issue
organically -- that was also due to heap hits, a LIMIT, and so on. As
luck would have it, I stumbled upon his problem report (in the
Postgres slack channel) while I was working on this patch. He produced
a fairly complete test case, which was helpful [3]. This example is
more or less just a distillation of his test case, designed to be easy
for a Postgres hacker to try out for themselves.

There are also variants of this query where a LIMIT isn't the crucial
factor, and where index page hits are the problem. This query uses an
index-only scan, both on master and with the patch (same index as
before):

select count(*), two, four, twenty
from tenk1
where two in (0, 1) and four in (1, 2, 3, 4) and
twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15)
group by two, four, twenty
order by two, four, twenty;

The patch gets 18 buffer hits for this query. That outcome makes
intuitive sense, since this query is highly unselective -- it's
approaching the selectivity of the query "select count(*) from tenk1".
The simple count(*) query gets 19 buffer hits for its own index-only
scan, confirming that the patch managed to skip only one or two leaf
pages in the complicated "group by" variant of the count(*) query.
Overall, the GroupAggregate plan used by the patch is slower than the
simple count(*) case (despite touching fewer pages). But both plans
have *approximately* the same execution cost, which makes sense, since
they both have very similar selectivities.

The master branch gets 245 buffer hits for the same group by query.
This is almost as many hits as a sequential scan would require -- even
though there are precisely zero heap accesses needed by the underlying
index-only scan. As with the first example, no planner coaxing was
required to get this outcome on master. It is inherently very
difficult to predict how selective a query like this will be using
conventional statistics. But that's not actually the problem in this
example -- the planner gets that part right, on this occasion. The
real problem is that there is a multiplicative factor to worry about
on master, when executing multiple SAOPs. That makes it almost
impossible to predict the number of pages we'll pin. While with the
patch, scans with multiple SAOPs are often fairly similar to scans
that happen to just have one on the leading column.

With the patch, it is simply impossible for an SAOP index scan to
visit any single leaf page more than once. Just like a conventional
index scan. Whereas right now, on master, using more than one SAOP
clause for a multi column index seems to me to be a wildly risky
proposition. You can easily have cases that work just fine on master,
while only slight variations of the same query see costs explode
(especially likely with a LIMIT). ISTM that there is significant value
in knowing for sure in having a pretty accurate idea of the worst case
in the planner.

Giving nbtree the ability to skip or not skip dynamically, based on
actual conditions in the index (not on statistics), seems like it has
a lot of potential as a way of improving performance *stability*.
Personally I'm most interested in this aspect of the project.

Note: we can visit internal pages more than once, but that seems to
make a negligible difference to the overall cost profile of scans. Our
policy is to not charge an I/O cost for those pages. Plus, the number
of internal page access is dramatically reduced (it's just not
guaranteed that there won't be any repeat accesses for internal pages,
is all).

Note also: there are hard-to-pin-down interactions between the
immediate problem on the nbtree side, and the use of filter quals
rather than true index quals, where the use of index quals is possible
in principle. Some problematic cases see excessive amounts of heap
page hits only (as with my first example query). Other problematic
cases see excessive amounts of index page hits, with little to no
impact on heap page hits at all (as with my second example query).
Some combination of the two is also possible.

Safety
======

As mentioned already, the ability to "advance the current set of array
keys locally" during a scan (the nbtree work in item 1) actually
relies the optimizer work in item 2 -- it's not just a question of
unlocking the potential of the nbtree work. Now I'll discuss those
aspects in a bit more detail.

Without the optimizer work, nbtree will produce wrong answers to
queries, in a way that resembles the complaint addressed by historical
bugfix commit 807a40c5. This incorrect behavior (if the optimizer were
to permit it) would only be seen when there are multiple
arrays/columns, and an inequality on a leading column -- just like
with that historical bug.  (It works both ways, though -- the nbtree
changes also make the optimizer changes safe by limiting the worst
case, which would otherwise be too much of a risk to countenance. You
can't separate one from the other.)

The primary change on the optimizer side is the addition of logic to
differentiate between the following two cases when building an index
path in indxpath.c:

* Unsafe: Cases where it's fundamentally unsafe to treat
multi-column-with-SAOP-clause index paths as returning tuples in a
useful sort order.

For example, the test case committed as part of that bugfix involves
an inequality, so it continues to be treated as unsafe.

* Safe: Cases where (at least in theory) bugfix commit 807a40c5 went
further than it really had to.

Those cases get to use the optimization, and usually get to have
useful path keys.

My optimizer changes are very kludgey. I came up with various ad-hoc
rules to distinguish between the safe and unsafe cases, without ever
really placing those changes into some kind of larger framework. That
was enough to validate the general approach in nbtree, but it
certainly has problems -- glaring problems. The biggest problem of all
may be my whole "safe vs unsafe" framing itself. I know that many of
the ostensibly unsafe cases are in fact safe (with the right
infrastructure in place), because the MDAM paper says just that. The
optimizer can't support inequalities right now, but the paper
describes how to support "NOT IN( )" lists -- clearly an inequality!
The current ad-hoc rules are at best incomplete, and at worst are
addressing the problem in fundamentally the wrong way.

 CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I see this work as a key enabler of "filling in missing key
predicates". MDAM describes how to implement this technique by
applying the same principles that it applies everywhere else: it
proposes a scheme that converts predicates from CNF to DNF. With just
a little extra logic required to do index probes to feed the
DNF-generating state machine, on demand.

More concretely, in Postgres terms: skip scan can be implemented by
inventing a new placeholder clause that can be composed alongside
ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs
can be composed together in the patch already. I'm thinking of a type
of clause that makes the nbtree code materialize a set of "array keys"
for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes
(perhaps static approaches would be better for types like boolean,
which the paper contemplates). It should be possible to teach the
_bt_advance_array_keys state machine to generate those values in
approximately the same fashion as it already does for
ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a
localized fashion, allowing everything else to continue to work in the
same way without any special concern. This separation of concerns is a
nice consequence of the way that the MDAM design really leverages
preprocessing/DNF for everything.

Both types of clauses can be treated as part of a general class of
ScalarArrayOpExpr-like clauses. Making the rules around
"composability" simple will be important.

Although skip scan gets a lot of attention, it's not necessarily the
most compelling MDAM technique. It's also not especially challenging
to implement on top of everything else. It really isn't that special.
Right now I'm focussed on the big picture, in any case. I want to
emphasize the very general nature of these techniques. Although I'm
focussed on SOAPs in the short term, many queries that don't make use
of SAOPs should ultimately see similar benefits. For example, the
paper also describes transformations that apply to BETWEEN/range
predicates. We might end up needing a third type of expression for
those. They're all just DNF single value predicates, under the hood.

Thoughts?

[1] http://vldb.org/conf/1995/P710.PDF
[2] https://www.postgresql.org/message-id/2587523.1647982549@sss.pgh.pa.us
[3] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d
--
Peter Geoghegan

Attachment

v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

26 July 2023, 12:29:45

On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <pg@bowt.ie> wrote:
>
> I've been working on a variety of improvements to nbtree's native
> ScalarArrayOpExpr execution. This builds on Tom's work in commit
> 9e8da0f7.

Cool.

> Attached patch is still at the prototype stage. I'm posting it as v1 a
> little earlier than I usually would because there has been much back
> and forth about it on a couple of other threads involving Tomas Vondra
> and Jeff Davis -- seems like it would be easier to discuss with
> working code available.
>
> The patch adds two closely related enhancements to ScalarArrayOp
> execution by nbtree:
>
> 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
> index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
> "advance the scan's array keys locally", which sometimes avoids
> significant amounts of unneeded pinning/locking of the same set of
> index pages.
>
> SAOP index scans become capable of eliding primitive index scans for
> the next set of array keys in line in cases where it isn't truly
> necessary to descend the B-Tree again. Index scans are now capable of
> "sticking with the existing leaf page for now" when it is determined
> that the end of the current set of array keys is physically close to
> the start of the next set of array keys (the next set in line to be
> materialized by the _bt_advance_array_keys state machine). This is
> often possible.
>
> Naturally, we still prefer to advance the array keys in the
> traditional way ("globally") much of the time. That means we'll
> perform another _bt_first/_bt_search descent of the index, starting a
> new primitive index scan. Whether we try to skip pages on the leaf
> level or stick with the current primitive index scan (by advancing
> array keys locally) is likely to vary a great deal. Even during the
> same index scan. Everything is decided dynamically, which is the only
> approach that really makes sense.
>
> This optimization can significantly lower the number of buffers pinned
> and locked in cases with significant locality, and/or with many array
> keys with no matches. The savings (when measured in buffers
> pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
> has shown that transaction throughput for variants of "pgbench -S"
> designed to stress the implementation (hundreds of array constants)
> under concurrent load can have up to 5.5x higher transaction
> throughput with the patch. Less extreme cases (10 array constants,
> spaced apart) see about a 20% improvement in throughput. There are
> similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

> [...]
> Skip Scan
> =========
>
> MDAM encompasses something that people tend to call "skip scan" --
> terminology with a great deal of baggage. These days I prefer to call
> it "filling in missing key predicates", per the paper. That's much
> more descriptive, and makes it less likely that people will conflate
> the techniques with InnoDB style "loose Index scans" -- the latter is
> a much more specialized/targeted optimization. (I now believe that
> these are very different things, though I was thrown off by the
> superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

> [...]
>
> Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's quite a lot of memory
potentially needed to execute the statement.
So, this begs the question: does this patch have the same issue? Does
it fail with OOM, does it gracefully fall back to the old behaviour
when the clauses are too complex to linearize/compose/fold into the
btree ordering clauses, or are scan keys dynamically constructed using
just-in-time- or generator patterns?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

26 July 2023, 13:41:48

On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Considering that it caches/reuses the page across SAOP operations, can
> (or does) this also improve performance for index scans on the outer
> side of a join if the order of join columns matches the order of the
> index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

> That is, I believe this caches (leaf) pages across scan keys, but can
> (or does) it also reuse these already-cached leaf pages across
> restarts of the index scan/across multiple index lookups in the same
> plan node, so that retrieval of nearby index values does not need to
> do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

> I'm not sure I understand. MDAM seems to work on an index level to
> return full ranges of values, while "skip scan" seems to try to allow
> systems to signal to the index to skip to some other index condition
> based on arbitrary cutoffs. This would usually be those of which the
> information is not stored in the index, such as "SELECT user_id FROM
> orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
> though the user_id index and skip to the next user_id value when it
> gets enough rows of a matching result (where "enough" is determined
> above the index AM's plan node, or otherwise is impossible to
> determine with only the scan key info in the index AM). I'm not sure
> how this could work without specifically adding skip scan-related
> index AM functionality, and I don't see how it fits in with this
> MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

> > [...]
> >
> > Thoughts?
>
> MDAM seems to require exponential storage for "scan key operations"
> for conditions on N columns (to be precise, the product of the number
> of distinct conditions on each column); e.g. an index on mytable
> (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

26 July 2023, 16:07:20

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Considering that it caches/reuses the page across SAOP operations, can
> > (or does) this also improve performance for index scans on the outer
> > side of a join if the order of join columns matches the order of the
> > index?
>
> It doesn't really cache leaf pages at all. What it does is advance the
> array keys locally, while the original buffer lock is still held on
> that same page.

Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.

> > That is, I believe this caches (leaf) pages across scan keys, but can
> > (or does) it also reuse these already-cached leaf pages across
> > restarts of the index scan/across multiple index lookups in the same
> > plan node, so that retrieval of nearby index values does not need to
> > do an index traversal?
>
> I'm not sure what you mean. There is no reason why you need to do more
> than one single descent of an index to scan many leaf pages using many
> distinct sets of array keys. Obviously, this depends on being able to
> observe that we really don't need to redescend the index to advance
> the array keys, again and again. Note in particularly that this
> usually works across leaf pages.

In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.

Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.

I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.

Kind regards,

Matthias van de Meent

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

27 July 2023, 04:13:47

On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> We could cache the last accessed leaf page across amrescan operations
> to reduce the number of index traversals needed when the join key of
> the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.

Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.

To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

27 July 2023, 11:59:13

On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
> > I'm not sure I understand. MDAM seems to work on an index level to
> > return full ranges of values, while "skip scan" seems to try to allow
> > systems to signal to the index to skip to some other index condition
> > based on arbitrary cutoffs. This would usually be those of which the
> > information is not stored in the index, such as "SELECT user_id FROM
> > orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
> > though the user_id index and skip to the next user_id value when it
> > gets enough rows of a matching result (where "enough" is determined
> > above the index AM's plan node, or otherwise is impossible to
> > determine with only the scan key info in the index AM). I'm not sure
> > how this could work without specifically adding skip scan-related
> > index AM functionality, and I don't see how it fits in with this
> > MDAM/SAOP system.
>
> I think of that as being quite a different thing.
>
> Basically, the patch that added that feature had to revise the index
> AM API, in order to support a mode of operation where scans return
> groupings rather than tuples. Whereas this patch requires none of
> that. It makes affected index scans as similar as possible to
> conventional index scans.

Hmm, yes. I see now where my confusion started. You called it out in
your first paragraph of the original mail, too, but that didn't help
me then:

The wiki does not distinguish "Index Skip Scans" and "Loose Index
Scans", but these are not the same.

In the one page on "Loose indexscan", it refers to MySQL's "loose
index scan" documentation, which does handle groupings, and this was
targeted with the previous, mislabeled, "Index skipscan" patchset.
However, crucially, it also refers to other databases' Index Skip Scan
documentation, which document and implement this approach of 'skipping
to the next potential key range to get efficient non-prefix qual
results', giving me a false impression that those two features are one
and the same when they are not.

It seems like I'll have to wait a bit longer for the functionality of
Loose Index Scans.

> > > [...]
> > >
> > > Thoughts?
> >
> > MDAM seems to require exponential storage for "scan key operations"
> > for conditions on N columns (to be precise, the product of the number
> > of distinct conditions on each column); e.g. an index on mytable
> > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> > AND h IN (1, 2)" would require 2^8 entries.
>
> Note that I haven't actually changed anything about the way that the
> state machine generates new sets of single value predicates -- it's
> still just cycling through each distinct set of array keys in the
> patch.
>
> What you describe is a problem in theory, but I doubt that it's a
> problem in practice. You don't actually have to materialize the
> predicates up-front, or at all.

Yes, that's why I asked: The MDAM paper's examples seem to materialize
the full predicate up-front, which would require a product of all
indexed columns' quals in size, so that materialization has a good
chance to get really, really large. But if we're not doing that
materialization upfront, then there is no issue with resource
consumption (except CPU time, which can likely be improved with other
methods)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

27 July 2023, 13:59:51

On Thu, 27 Jul 2023 at 06:14, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > We could cache the last accessed leaf page across amrescan operations
> > to reduce the number of index traversals needed when the join key of
> > the left side is highly (but not necessarily strictly) correllated.
>
> That sounds like block nested loop join. It's possible that that could
> reuse some infrastructure from this patch, but I'm not sure.

My idea is not quite block nested loop join. It's more 'restart the
index scan at the location the previous index scan ended, if
heuristics say there's a good chance that might save us time'. I'd say
it is comparable to the fast tree descent optimization that we have
for endpoint queries, and comparable to this patch's scankey
optimization, but across AM-level rescans instead of internal rescans.

See also the attached prototype and loosely coded patch. It passes
tests, but it might not be without bugs.

The basic design of that patch is this: We keep track of how many
times we've rescanned, and the end location of the index scan. If a
new index scan hits the same page after _bt_search as the previous
scan ended, we register that. Those two values - num_rescans and
num_samepage - are used as heuristics for the following:

If 50% or more of rescans hit the same page as the end location of the
previous scan, we start saving the scan's end location's buffer into
the BTScanOpaque, so that the next _bt_first can check whether that
page might be the right leaf page, and if so, immediately go to that
buffer instead of descending the tree - saving one tree descent in the
process.

Further optimizations of this mechanism could easily be implemented by
e.g. only copying the min/max index tuples instead of the full index
page, reducing the overhead at scan end.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

v1-0001-Cache-btree-scan-end-page-across-rescans-in-the-s.patch.cfbot-ignore

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

27 July 2023, 14:00:31

On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > Basically, the patch that added that feature had to revise the index
> > AM API, in order to support a mode of operation where scans return
> > groupings rather than tuples. Whereas this patch requires none of
> > that. It makes affected index scans as similar as possible to
> > conventional index scans.
>
> Hmm, yes. I see now where my confusion started. You called it out in
> your first paragraph of the original mail, too, but that didn't help
> me then:
>
> The wiki does not distinguish "Index Skip Scans" and "Loose Index
> Scans", but these are not the same.

A lot of people (myself included) were confused on this point for
quite a while. To make matters even more confusing, one of the really
compelling cases for the MDAM design is scans that feed into
GroupAggregates -- preserving index sort order for naturally big index
scans will tend to enable it. One of my examples from the start of
this thread showed just that. (It just so happened that that example
was faster because of all the "skipping" that nbtree *wasn't* doing
with the patch.)

> Yes, that's why I asked: The MDAM paper's examples seem to materialize
> the full predicate up-front, which would require a product of all
> indexed columns' quals in size, so that materialization has a good
> chance to get really, really large. But if we're not doing that
> materialization upfront, then there is no issue with resource
> consumption (except CPU time, which can likely be improved with other
> methods)

I get why you asked. I might have asked the same question.

As I said, the MDAM paper has *surprisingly* little to say about
B-Tree executor stuff -- it's almost all just describing the
preprocessing/transformation process. It seems as if optimizations
like the one from my patch were considered too obvious to talk about
and/or out of scope by the authors. Thinking about the MDAM paper like
that was what made everything fall into place for me. Remember,
"missing key predicates" isn't all that special.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

28 July 2023, 11:11:47

On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > Basically, the patch that added that feature had to revise the index
> > > AM API, in order to support a mode of operation where scans return
> > > groupings rather than tuples. Whereas this patch requires none of
> > > that. It makes affected index scans as similar as possible to
> > > conventional index scans.
> >
> > Hmm, yes. I see now where my confusion started. You called it out in
> > your first paragraph of the original mail, too, but that didn't help
> > me then:
> >
> > The wiki does not distinguish "Index Skip Scans" and "Loose Index
> > Scans", but these are not the same.
>
> A lot of people (myself included) were confused on this point for
> quite a while.

I've taken the liberty to update the "Loose indexscan" wiki page [0],
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).

I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://wiki.postgresql.org/wiki/Loose_indexscan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Alena Rybakina

Date:

31 July 2023, 16:24:52

Hi, all!

>   CNF -> DNF conversion
> =====================
>
> Like many great papers, the MDAM paper takes one core idea, and finds
> ways to leverage it to the hilt. Here the core idea is to take
> predicates in conjunctive normal form (an "AND of ORs"), and convert
> them into disjunctive normal form (an "OR of ANDs"). DNF quals are
> logically equivalent to CNF quals, but ideally suited to SAOP-array
> style processing by an ordered B-Tree index scan -- they reduce
> everything to a series of non-overlapping primitive index scans, that
> can be processed in keyspace order. We already do this today in the
> case of SAOPs, in effect. The nbtree "next array keys" state machine
> already materializes values that can be seen as MDAM style DNF single
> value predicates. The state machine works by outputting the cartesian
> product of each array as a multi-column index is scanned, but that
> could be taken a lot further in the future. We can use essentially the
> same kind of state machine to do everything described in the paper --
> ultimately, it just needs to output a list of disjuncts, like the DNF
> clauses that the paper shows in "Table 3".
>
> In theory, anything can be supported via a sufficiently complete CNF
> -> DNF conversion framework. There will likely always be the potential
> for unsafe/unsupported clauses and/or types in an extensible system
> like Postgres, though. So we will probably need to retain some notion
> of safety. It seems like more of a job for nbtree preprocessing (or
> some suitably index-AM-agnostic version of the same idea) than the
> optimizer, in any case. But that's not entirely true, either (that
> would be far too easy).
>
> The optimizer still needs to optimize. It can't very well do that
> without having some kind of advanced notice of what is and is not
> supported by the index AM. And, the index AM cannot just unilaterally
> decide that index quals actually should be treated as filter/qpquals,
> after all -- it doesn't get a veto. So there is a mutual dependency
> that needs to be resolved. I suspect that there needs to be a two way
> conversation between the optimizer and nbtree code to break the
> dependency -- a callback that does some of the preprocessing work
> during planning. Tom said something along the same lines in passing,
> when discussing the MDAM paper last year [2]. Much work remains here.
>
Honestly, I'm just reading and delving into this thread and other topics 
related to it, so excuse me if I ask you a few obvious questions.

I noticed that you are going to add CNF->DNF transformation at the index 
construction stage. If I understand correctly, you will rewrite 
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be 
possible to apply such a procedure earlier? Otherwise I suppose you 
could face the problem of
incorrect selectivity of the calculation and, consequently, the 
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a 
transformation needs to be applied?

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

31 July 2023, 16:34:47

On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> My idea is not quite block nested loop join. It's more 'restart the
> index scan at the location the previous index scan ended, if
> heuristics say there's a good chance that might save us time'. I'd say
> it is comparable to the fast tree descent optimization that we have
> for endpoint queries, and comparable to this patch's scankey
> optimization, but across AM-level rescans instead of internal rescans.

Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.

(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)

> The basic design of that patch is this: We keep track of how many
> times we've rescanned, and the end location of the index scan. If a
> new index scan hits the same page after _bt_search as the previous
> scan ended, we register that.

I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.

Not too sure how much that matters in practice, though.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

31 July 2023, 17:04:09

On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
<lena.ribackina@yandex.ru> wrote:
> I noticed that you are going to add CNF->DNF transformation at the index
> construction stage. If I understand correctly, you will rewrite
> restrictinfo node,
> change boolean "AND" expressions to "OR" expressions, but would it be
> possible to apply such a procedure earlier?

Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).

At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.

> Otherwise I suppose you
> could face the problem of
> incorrect selectivity of the calculation and, consequently, the
> cardinality calculation?

I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.

Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.

It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)

> I can't clearly understand at what stage it is clear that the such a
> transformation needs to be applied?

I don't know either.

I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.

I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

17 September 2023, 23:47:45

On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > MDAM seems to require exponential storage for "scan key operations"
> > for conditions on N columns (to be precise, the product of the number
> > of distinct conditions on each column); e.g. an index on mytable
> > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> > AND h IN (1, 2)" would require 2^8 entries.

> What you describe is a problem in theory, but I doubt that it's a
> problem in practice. You don't actually have to materialize the
> predicates up-front, or at all. Plus you can skip over them using the
> next index tuple. So skipping works both ways.

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.

It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in (<Every possible value for this column>) AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.

But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.

In particular, the patch now:

1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.

For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:

explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)

We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.

2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".

3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.

The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.

Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index
scans with an inequality on their leading column, we cannot be sure
that the output will appear in the same order as the order that one
big continuous index scan returns rows in. We can only expect to
maintain the illusion that we're doing one continuous index scan when
individual primitive index scans access earlier columns via the
equality strategy -- we need "equality constraints".

In practice, the optimizer (indxpath.c) is very conservative (more
conservative than it really needs to be) when it comes to trusting the
index scan to output rows in index order, in the presence of SAOPs.
All of that now seems totally unnecessary. Again, I don't see a need
to make a trade-off here.

My observation about this query (and others like it) is: why not
literally perform one continuous index scan instead (not multiple
primitive index scans)? That is strictly better, given all the
specifics here. Once we have a way to do that (which the nbtree
executor work listed under item 2 provides), it becomes safe to assume
that the tuples will be output in index order -- there is no illusion
left to preserve. Who needs an illusion that isn't actually helping
us? We actually do less I/O by using this strategy, for the usual
reasons (we can avoid repeating index page accesses).

A more concrete benefit of the non-required-scankeys stuff can be seen
by running Benoit Tigeot's test case [1] with v2. He had a query like
this:

SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280')
ORDER BY sent_at DESC NULLS LAST LIMIT 20;

And, his test case had an index on "sent_at DESC NULLS LAST,
sender_reference, status". This variant was a weak spot for v1.

v2 of the patch is vastly more efficient here, since we don't have to
go to the heap to eliminate non-matching tuples -- that can happen in
the index AM instead. This can easily be 2x-3x faster on a warm cache,
and have *hundreds* of times fewer buffer accesses (which Benoit
verified with an early version of this v2). All because we now require
vastly less heap access -- the quals are fairly selective here, and we
have to scan hundreds of leaf pages before the scan can terminate.
Avoiding filter quals is a huge win.

This particular improvement is hard to squarely attribute to any one
of my 3 items. The immediate problem that the query presents us with
on the master branch is the problem of filter quals that require heap
accesses to do visibility checks (a problem that index quals can never
have). That makes it tempting to credit my item 3. But you can't
really have item 3 without also having items 1 and 2. Taken together,
they eliminate all possible downsides from using index quals.

That high level direction (try to have one good choice for the
optimizer) seems important to me. Both for this project, and in
general.

Other changes in v2:

* Improved costing, that takes advantage of the fact that nbtree now
promises to not repeat any leaf page accesses (unless the scan is
restarted or the direction of the scan changes). This makes the worst
case far more predictable, and more related to selectivity estimation
-- you can't scan more pages than you have in the whole index. Just
like with every other sort of index scan.

* Support for parallel index scans.

The existing approach to array keys for parallel index scan has been
adopted to work with individual primitive index scans, not individual
array keys. I haven't tested this very thoroughly just yet, but it
seems to work well enough already. I think that it's important to not
have very much variation between parallel and serial index scans,
which I seem to have mostly avoided.

[1] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491
--
Peter Geoghegan

Attachment

v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

29 September 2023, 00:32:19

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v2, which makes all array key advancement take place using
> the "next index tuple" approach (using binary searches to find array
> keys using index tuple values).

Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

No notable changes here compared to v2.

--
Peter Geoghegan

Attachment

v3-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

15 October 2023, 20:50:43

On Thu, Sep 28, 2023 at 5:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v2, which makes all array key advancement take place using
> > the "next index tuple" approach (using binary searches to find array
> > keys using index tuple values).
>
> Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

Attached is v4, which applies cleanly on top of HEAD. This was needed
due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
keys required for directional scan in B-tree".

Unfortunately I have more or less dealt with the conflicts on HEAD by
disabling the optimization from that commit, for the time being. The
commit in question is rather poorly documented, and it's not
immediately clear how to integrate it with my work. I just want to
make sure that there's a testable patch available.

--
Peter Geoghegan

Attachment

v4-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

20 October 2023, 22:39:44

On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v4, which applies cleanly on top of HEAD. This was needed
> due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
> keys required for directional scan in B-tree".
>
> Unfortunately I have more or less dealt with the conflicts on HEAD by
> disabling the optimization from that commit, for the time being.

Attached is v5, which deals with the conflict with the optimization
added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
optimization is now only disabled in cases without array scan keys.
(It'd be very hard to make it work with array scan keys, since an
important principle for my patch is that we can change search-type
scan keys right in the middle of any _bt_readpage() call).

v5 also fixes a longstanding open item for the patch: we no longer
call _bt_preprocess_keys() with a buffer lock held, which was a bad
idea at best, and unsafe (due to the syscache lookups within
_bt_preprocess_keys) at worst. A new, minimal version of the function
(called _bt_preprocess_keys_leafbuf) is called at the same point
instead. That change, combined with the array binary search stuff
(which was added back in v2), makes the total amount of work performed
with a buffer lock held totally reasonable in all cases. It's even
okay in extreme or adversarial cases with many millions of array keys.

Making this _bt_preprocess_keys_leafbuf approach work has a downside:
it requires that _bt_preprocess_keys be a little less aggressive about
removing redundant scan keys, in order to meet certain assumptions
held by the new _bt_preprocess_keys_leafbuf function. Essentially,
_bt_preprocess_keys must now worry about current and future array key
values when determining redundancy among scan keys -- not just the
current array key values. _bt_preprocess_keys knows nothing about
SK_SEARCHARRAY scan keys on HEAD, because on HEAD there is a strict
1:1 correspondence between the number of primitive index scans and the
number of array keys (actually, the number of distinct combinations of
array keys). Obviously that's no longer the case with the patch
(that's the whole point of the patch).

It's easiest to understand how elimination of redundant quals needs to
work in v5 by way of an example. Consider the following query:

select count(*), two, four, twenty, hundred
from
  tenk1
where
  two in (0, 1) and four in (1, 2, 3)
  and two < 1;

Notice that "two" appears in the where clause twice. First it appears
as an SAOP, and then as an inequality. Right now, on HEAD, the
primitive index scan where the SAOP's scankey is "two = 0" renders
"two < 1" redundant. However, the subsequent primitive index scan
where "two = 1" does *not* render "two < 1" redundant. This has
implications for the mechanism in the patch, since the patch will
perform one big primitive index scan for all array constants, with
only a single _bt_preprocess_keys call at the start of its one and
only _bt_first call (but with multiple _bt_preprocess_keys_leafbuf
calls once we reach the leaf level).

The compromise that I've settled on in v5 is to teach
_bt_preprocess_keys to *never* treat "two < 1" as redundant with such
a query -- even though there is some squishy sense in which "two < 1"
is indeed still redundant (for the first SAOP key of value 0). My
approach is reasonably well targeted in that it mostly doesn't affect
queries that don't need it. But it will add cycles to some badly
written queries that wouldn't have had them in earlier Postgres
versions. I'm not entirely sure how much this matters, but my current
sense is that it doesn't matter all that much. This is the kind of
thing that is hard to test and poorly tested, so simplicity is even
more of a virtue than usual.

Note that the changes to _bt_preprocess_keys in v5 *don't* affect how
we determine if the scan has contradictory quals, which is generally
more important. With contradictory quals, _bt_first can avoid reading
any data from the index. OTOH eliminating redundant quals (i.e. the
thing that v5 *does* change) merely makes evaluating index quals less
expensive via preprocessing-away unneeded scan keys. In other words,
while it's possible that the approach taken by v5 will add CPU cycles
in a small number of cases, it should never result in more page
accesses.

--
Peter Geoghegan

Attachment

v5-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

06 November 2023, 21:28:01

On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v4, which applies cleanly on top of HEAD. This was needed
> > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
> > keys required for directional scan in B-tree".
> >
> > Unfortunately I have more or less dealt with the conflicts on HEAD by
> > disabling the optimization from that commit, for the time being.
>
> Attached is v5, which deals with the conflict with the optimization
> added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
> optimization is now only disabled in cases without array scan keys.
> (It'd be very hard to make it work with array scan keys, since an
> important principle for my patch is that we can change search-type
> scan keys right in the middle of any _bt_readpage() call).

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

06 November 2023, 23:02:57

On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I'm planning on reviewing this patch tomorrow, but in an initial scan
> through the patch I noticed there's little information about how the
> array keys state machine works in this new design. Do you have a more
> toplevel description of the full state machine used in the new design?

This is an excellent question. You're entirely right: there isn't
enough information about the design of the state machine.

In v1 of the patch, from all the way back in July, the "state machine"
advanced in the hackiest way possible: via repeated "incremental"
advancement (using logic from the function that we call
_bt_advance_array_keys() on HEAD) in a loop -- we just kept doing that
until the function I'm now calling _bt_tuple_before_array_skeys()
eventually reported that the array keys were now sufficiently
advanced. v2 greatly improved matters by totally overhauling
_bt_advance_array_keys(): it was taught to use binary searches to
advance the array keys, with limited remaining use of "incremental"
array key advancement.

However, version 2 (and all later versions to date) have somewhat
wonky state machine transitions, in one important respect: calls to
the new _bt_advance_array_keys() won't always advance the array keys
to the maximum extent possible (possible while still getting correct
behavior, that is). There were still various complicated scenarios
involving multiple "required" array keys (SK_BT_REQFWD + SK_BT_REQBKWD
scan keys that use BTEqualStrategyNumber), where one single call to
_bt_advance_array_keys() would advance the array keys to a point that
was still < caller's tuple. AFAICT this didn't cause wrong answers to
queries (that would require failing to find a set of exactly matching
array keys where a matching set exists), but it was kludgey. It was
sloppy in roughly the same way as the approach in my v1 prototype was
sloppy (just to a lesser degree).

I should be able to post v6 later this week. My current plan is to
commit the other nbtree patch first (the backwards scan "boundary
cases" one from the ongoing CF) -- since I saw your review earlier
today. I think that you should probably wait for this v6 before
starting your review. The upcoming version will have simple
preconditions and postconditions for the function that advances the
array key state machine (the new _bt_advance_array_keys). These are
enforced by assertions at the start and end of the function. So the
rules for the state machine become crystal clear and fairly easy to
keep in your head (e.g., tuple must be >= required array keys on entry
and <= required array keys on exit, the array keys must always either
advance by one increment or be completely exhausted for the top-level
scan in the current scan direction).

Unsurprisingly, I found that adding and enforcing these invariants led
to a simpler and more general design within _bt_advance_array_keys.
That code is still the most complicated part of the patch, but it's
much less of a bag of tricks. Another reason for you to hold off for a
few more days.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

07 November 2023, 12:20:39

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I'm planning on reviewing this patch tomorrow, but in an initial scan
> > through the patch I noticed there's little information about how the
> > array keys state machine works in this new design. Do you have a more
> > toplevel description of the full state machine used in the new design?
>
> This is an excellent question. You're entirely right: there isn't
> enough information about the design of the state machine.
>
> I should be able to post v6 later this week. My current plan is to
> commit the other nbtree patch first (the backwards scan "boundary
> cases" one from the ongoing CF) -- since I saw your review earlier
> today. I think that you should probably wait for this v6 before
> starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

08 November 2023, 01:53:01

On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
> > I should be able to post v6 later this week. My current plan is to
> > commit the other nbtree patch first (the backwards scan "boundary
> > cases" one from the ongoing CF) -- since I saw your review earlier
> > today. I think that you should probably wait for this v6 before
> > starting your review.
>
> Okay, thanks for the update, then I'll wait for v6 to be posted.

On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).

Highlights:

* Major simplifications to the array key state machine, already
described by my recent email.

* Added preprocessing of "redundant and contradictory" array elements
to _bt_preprocess_array_keys().

This makes the special preprocessing pass just for array keys
("preprocessing preprocessing") within _bt_preprocess_array_keys()
make this query into a no-op:

select * from tab where a in (180, 345) and a in (230, 300); -- contradictory

Similarly, it can make this query only attempt one single primitive
index scan for "230":

select * from tab where a in (180, 230) and a in (230, 300); -- has
redundancies, plus some individual elements contradict each other

This duplicates some of what _bt_preprocess_keys can do already. But
_bt_preprocess_keys can only do this stuff at the level of individual
array elements/primitive index scans. Whereas this works "one level
up", allowing preprocessing to see the full picture rather than just
seeing the start of one particular primitive index scan. It explicitly
works across array keys, saving repeat work inside
_bt_preprocess_keys. That could really add up with thousands of array
keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys
already does something like this, to deal with SAOP inequalities such
as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising
that this obvious optimization wasn't part of the original nbtree SAOP
patch.)

This reminds me: you might want to try breaking the patch by coming up
with adversarial cases, Matthias. The patch needs to be able to deal
with absurdly large amounts of array keys reasonably well, because it
proposes to normalize passing those to the nbtree code. It's
especially important that the patch never takes too much time to do
something (e.g., binary searching through array keys) while holding a
buffer lock -- even with very silly adversarial queries.

So, for example, queries like this one (specifically designed to
stress the implementation) *need* to work reasonably well:

with a as (
  select i from generate_series(0, 500000) i
)
select
  count(*), thousand, tenthous
from
  tenk1
where
  thousand = any (array[(select array_agg(i) from a)]) and
  tenthous = any (array[(select array_agg(i) from a)])
group by
  thousand, tenthous
order by
  thousand, tenthous;

 (You can run this yourself after the regression tests finish, of course.)

This takes about 130ms on my machine, hardly any of which takes place
in the nbtree code with the patch (think tens of microseconds per
_bt_readpage call, at most) -- the plan is an index-only scan that
gets only 30 buffer hits. On the master branch, it's vastly slower --
1000025 buffer hits. The query as a whole takes about 3 seconds there.

If you have 3 or 4 SOAPs (with a composite index that has as many
columns) you can quite easily DOS the master branch, since the planner
makes a generic assumption that each of these SOAPs will have only 10
elements. The planner also thinks that with the patch applied, with
one important difference: it doesn't matter to nbtree. The cost of
scanning each index page should be practically independent of the
total size of each array, at least past a certain point. Similarly,
the maximum cost of an index scan should be approximately fixed: it
should be capped at the cost of a full index scan (with the added cost
of these relatively expensive quals still capped, still essentially
independent of array sizes past some point).

I notice that if I remove the "thousand = any (array[(select
array_agg(i) from a)]) and" line from the adversarial query, executing
the resulting query still get 30 buffer hits with the patch -- though
it only takes 90ms this time (it's faster for reasons that likely have
less than you'd think to do with nbtree overheads). This is just
another way of getting roughly the same full index scan. That's a
completely new way of thinking about nbtree SAOPs from a planner
perspective (also from a user's perspective, I suppose).

It's important that the planner's new optimistic assumptions about the
cost profile of SOAPS (that it can expect reasonable
performance/access patterns with wildly unreasonable/huge/arbitrarily
complicated SAOPs) always be met by nbtree -- no repeat index page
accesses, no holding a buffer lock for more than (say) a small
fraction of 1 millisecond (no matter the complexity of the query), and
possibly other things I haven't thought of yet.

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

--
Peter Geoghegan

Attachment

v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

09 November 2023, 23:57:51

On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If you end up finding a bug in this v6, it'll most likely be a case
> where nbtree fails to live up to that. This project is as much about
> robust/predictable performance as anything else -- nbtree needs to be
> able to cope with practically anything. I suggest that your review
> start by trying to break the patch along these lines.

I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.

--
Peter Geoghegan

Attachment

nemesis.sql

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

10 November 2023, 02:05:39

On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > If you end up finding a bug in this v6, it'll most likely be a case
> > where nbtree fails to live up to that. This project is as much about
> > robust/predictable performance as anything else -- nbtree needs to be
> > able to cope with practically anything. I suggest that your review
> > start by trying to break the patch along these lines.
>
> I spent some time on this myself today (which I'd already planned on).
>
> Attached is an adversarial stress-test, which shows something that
> must be approaching the worst case for the patch in terms of time
> spent with a buffer lock held, due to spending so much time evaluating
> unusually expensive SAOP index quals. The array binary searches that
> take place with a buffer lock held aren't quite like anything else
> that nbtree can do right now, so it's worthy of special attention.
>
> I thought of several factors that maximize both the number of binary
> searches within any given _bt_readpage, as well as the cost of each
> binary search -- the SQL file has full details. My test query is
> *extremely* unrealistic, since it combines multiple independent
> unrealistic factors, all of which aim to make life hard for the
> implementation in one way or another. I hesitate to say that it
> couldn't be much worse (I only spent a few hours on this), but I'm
> prepared to say that it seems very unlikely that any real world query
> could make the patch spend as many cycles in
> _bt_readpage/_bt_checkkeys as this one does.
>
> Perhaps you can think of some other factor that would make this test
> case even less sympathetic towards the patch, Matthias? The only thing
> I thought of that I've left out was the use of a custom btree opclass,
> "unrealistically_slow_ops". Something that calls pg_usleep in its
> order proc. (I left it out because it wouldn't prove anything.)

Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.

> On my machine, custom instrumentation shows that each call to
> _bt_readpage made while this query executes (on a patched server)
> takes just under 1.4 milliseconds. While that is far longer than it
> usually takes, it's basically acceptable IMV. It's not significantly
> longer than I'd expect heap_index_delete_tuples() to take on an
> average day with EBS (or other network-attached storage). But that's a
> process that happens all the time, with an exclusive buffer lock held
> on the leaf page throughout -- whereas this is only a shared buffer
> lock, and involves a query that's just absurd .
>
> Another factor that makes this seem acceptable is just how sensitive
> the test case is to everything going exactly and perfectly wrong, all
> at the same time, again and again. The test case uses a 32 column
> index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
> clauses (one per index column). If I reduce the number of SAOP clauses
> in the query to (say) 8, I still have a test case that's almost as
> silly as my original -- but now we only spend ~225 microseconds in
> each _bt_readpage call (i.e. we spend over 6x less time in each
> _bt_readpage call). (Admittedly if I also make the CREATE INDEX use
> only 8 columns, we can fit more index tuples on one page, leaving us
> at ~800 microseconds).

A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.

As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:

1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:

SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
   AND thousand < ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I expected to hit 0 buffers, until I
realized that we don't push ALL into index filters). I shall assume
ALL isn't used all that often (heh), but it sure feels like we're
missing out on performance here.

2. We also don't seem to support array keys for row compares, which
probably is an even more niche use case:

SELECT count(*)
FROM tenk1
WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]);

This is no different from master, too, but it'd be nice if there was
support for arrays of row operations, too, just so that composite
primary keys can also be looked up with SAOPs.

Kind regards,

Matthias van de Meent

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

11 November 2023, 21:08:00

On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg@bowt.ie> wrote:
> > > I should be able to post v6 later this week. My current plan is to
> > > commit the other nbtree patch first (the backwards scan "boundary
> > > cases" one from the ongoing CF) -- since I saw your review earlier
> > > today. I think that you should probably wait for this v6 before
> > > starting your review.
> >
> > Okay, thanks for the update, then I'll wait for v6 to be posted.
>
> On second thought, I'll just post v6 now (there won't be conflicts
> against the master branch once the other patch is committed anyway).

Thanks. Here's my review of the btree-related code:

> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
>          * set flag to true if all required keys are satisfied and false
>          * otherwise.
>          */
> -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> -                             &requiredMatchedByPrecheck, false);
> +        _bt_checkkeys(scan, &pstate, itup, false, false);
> +        requiredMatchedByPrecheck = pstate.continuescan;
> +        pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
>          * set flag to true if all required keys are satisfied and false
>          * otherwise.
>          */
> -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> -                             &requiredMatchedByPrecheck, false);
> +        _bt_checkkeys(scan, &pstate, itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

> +++ b/src/backend/access/nbtree/nbtutils.c
> @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> +            /* We could pfree(elem_values) after, but not worth the cycles */
> +            num_elems = _bt_merge_arrays(scan, cur,
> +                                         (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
> +                                         prev->elem_values, prev->num_elems,
> +                                         elem_values, num_elems);

This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

> @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
>                        _bt_compare_array_elements, &cxt);
> +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> +                 Datum *elems_orig, int nelems_orig,
> +                 Datum *elems_next, int nelems_next)
> [...]
> +    /*
> +     * Incrementally copy the original array into a temp buffer, skipping over
> +     * any items that are missing from the "next" array
> +     */

Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [1....1000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.

> @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
> [...]
> +_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+     if (ScanDirectionIsForward(dir))
+         low_elem = array->cur_elem;
+     else
+         high_elem = array->cur_elem;
+ }

> @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
> [...]
> + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
> [...]
> +    if (scan->parallel_scan != NULL)
> +        _bt_parallel_done(scan);
> +
> +    /*
> +     * No more primitive index scans.  Terminate the top-level scan.
> +     */
> +    return false;

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the scan is to be
terminated after that comment. So, either move this _bt_parallel_done
call outside the function (which by name would imply it is read-only,
without side effects like this) or move it below the comment
"terminate the top-level scan".

> +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
> [...]
> +         * Set up ORDER 3-way comparison function and array state
> [...]
> +         * Optimization: Skip over non-required scan keys when we know that

These two sections should probably be swapped, as the skip makes the
setup useless.
Also, the comment here is wrong; the scan keys that are skipped are
'required', not 'non-required'.

> +++ b/src/test/regress/expected/join.out
> @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
>    Merge Cond: (j1.id1 = j2.id1)
>    Join Filter: (j2.id2 = j1.id2)
>    ->  Index Scan using j1_id1_idx on j1
> -   ->  Index Only Scan using j2_pkey on j2
> +   ->  Index Scan using j2_id1_idx on j2
>          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
> -         Filter: ((id1 % 1000) = 1)
> -(7 rows)
> +(6 rows)

I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
anymore. The output otherwise matches (quite possibly because the
other join conditions don't match) and I don't have time to
investigate the intricacies between IOS vs normal IS, but this feels
off.

----

As for the planner changes, I don't think I'm familiar enough with the
planner to make any authorative comments on this. However, it does
look like you've changed the meaning of 'amsearcharray', and I'm not
sure it's OK to assume all indexes that support amsearcharray will
also support for this new assumption of ordered retrieval of SAOPs.
For one, the pgroonga extension [0] does mark
amcanorder+amsearcharray.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://github.com/pgroonga/pgroonga/blob/115414723c7eb8ce9eb667da98e008bd10fbae0a/src/pgroonga.c#L8782-L8788

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

21 November 2023, 02:52:32

On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Thanks. Here's my review of the btree-related code:

Attached is v7.

The main focus in v7 is making the handling of required
inequality-strategy scan keys more explicit -- now there is an
understanding of inequalities shared by _bt_check_compare (the
function that becomes the guts of _bt_checkkeys) and the new
_bt_advance_array_keys function/state machine. The big idea for v7 is
to generalize how we handle required equality-strategy scan keys
(always required in both scan directions), extending the same concept
to deal with required inequality strategy scan keys (only ever
required in one direction, which may or may not be the scan
direction).

This led to my discovering and fixing a couple of bugs related to
inequality handling. These issues were of the same general character
as many others I've dealt with before now: they involved subtle
confusion about when and how to start another primitive index scan,
leading to the scan reading many more pages than strictly necessary
(potentially many more than master). In other words, cases where we
didn't give up and start another primitive index scan, even though
(with a repro of the issue) it's obviously not sensible. An accidental
full index scan.

While I'm still not completely happy with the way that inequalities
are handled, things in this area are much improved in v7.

It should be noted that the patch isn't strictly guaranteed to always
read fewer index pages than master, for a given query plan and index.
This is by design. Though the patch comes close, it's not quite a
certainty. There are known cases where the patch reads the occasional
extra page (relative to what master would have done under the same
circumstances). These are cases where the implementation just cannot
know for sure whether the next/sibling leaf page has key space covered
by any of the scan's array keys (at least not in a way that seems
practical). The implementation has simple heuristics that infer (a
polite word for "make an educated guess") about what will be found on
the next page. Theoretically we could be more conservative in how we
go about this, but that seems like a bad idea to me. It's really easy
to find cases where the maximally conservative approach loses by a
lot, and really hard to show cases where it wins at all.

These heuristics are more or less a limited form of the heuristics
that skip scan would need. A *very* limited form. We're still
conservative. Here's how it works, at a high level: if the scan can
make it all the way to the end of the page without having to start a
new primitive index scan (before reaching the end), and then finds
that "finaltup" itself (which is usually the page high key) advances
the array keys, we speculate: we move on to the sibling page. It's
just about possible that we'll discover (once on the next page) that
finaltup actually advanced the array keys by so much (in one single
advancement step) that the current/new keys cover key space beyond the
sibling page we just arrived at. The sibling page access will have
been wasted (though I prefer to think of it as a cost of doing
business).

I go into a lot of detail on the trade-offs in this area in comments
at the end of the new _bt_checkkeys(), just after it calls
_bt_advance_array_keys(). Hopefully this is reasonably clear. It's
always much easier to understand these things when you've written lots
of test cases, though. So I wouldn't at all be surprised to hear that
my explanation needs more work. I suspect that I'm spending more time
on the topic than it actually warrants, but you have to spend a lot of
time on it for yourself to be able to see why that is.

> > +++ b/src/backend/access/nbtree/nbtsearch.c
> > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
> >          * set flag to true if all required keys are satisfied and false
> >          * otherwise.
> >          */
> > -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> > -                             &requiredMatchedByPrecheck, false);
> > +        _bt_checkkeys(scan, &pstate, itup, false, false);
> > +        requiredMatchedByPrecheck = pstate.continuescan;
> > +        pstate.continuescan = true; /* reset */
>
> The comment above the updated section needs to be updated.

Updated.

> > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
> >          * set flag to true if all required keys are satisfied and false
> >          * otherwise.
> >          */
> > -        (void) _bt_checkkeys(scan, itup, indnatts, dir,
> > -                             &requiredMatchedByPrecheck, false);
> > +        _bt_checkkeys(scan, &pstate, itup, false, false);
>
> This 'false' finaltup argument is surely wrong for the rightmost
> page's rightmost tuple, no?

Not in any practical sense. Since finaltup means "the tuple that you
should use to decide whether to go to the next page or not", and a
rightmost page doesn't have a next page.

There are exactly two ways that the top-level scan can end (not to be
confused with the primitive scan), at least in v7. They are:

1. The state machine can exhaust the scan's array keys, ending the
top-level scan.

2. The scan can just run out of pages, without ever running out of
array keys (some array keys can sort higher than any real value from
the index). This is just like how an index scan ends when it lacks any
required scan keys to terminate the scan, and eventually runs out of
pages to scan (think of an index-only scan that performs a full scan
of the index, feeding into a group aggregate).

Note that it wouldn't be okay if the design relied on _bt_checkkeys
advancing and exhausting the array keys -- we really do need both 1
and 2 to deal with various edge cases. For example, there is no way
that we'll ever be able to call _bt_checkkeys with a completely empty
index. It simply doesn't have any tuples at all. In fact, it doesn't
even have any pages (apart from the metapage), so clearly we can't
expect any calls to _bt_readpage (much less _bt_checkkeys).

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> > +            /* We could pfree(elem_values) after, but not worth the cycles */
> > +            num_elems = _bt_merge_arrays(scan, cur,
> > +                                         (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
> > +                                         prev->elem_values, prev->num_elems,
> > +                                         elem_values, num_elems);
>
> This code can get hit several times when there are multiple = ANY
> clauses, which may result in repeated leakage of these arrays during
> this scan. I think cleaning up may well be worth the cycles when the
> total size of the arrays is large enough.

They won't leak because the memory is allocated in the same dedicated
memory context.

That said, I added a pfree(). It couldn't hurt.

> > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
> >                        _bt_compare_array_elements, &cxt);
> > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> > +                 Datum *elems_orig, int nelems_orig,
> > +                 Datum *elems_next, int nelems_next)
> > [...]
> > +    /*
> > +     * Incrementally copy the original array into a temp buffer, skipping over
> > +     * any items that are missing from the "next" array
> > +     */
>
> Given that we only keep the members that both arrays have in common,
> the result array will be a strict subset of the original array. So, I
> don't quite see why we need the temporary buffer here - we can reuse
> the entries of the elems_orig array that we've already compared
> against the elems_next array.

This code path is only hit when the query was written on autopilot,
since it must have contained redundant SAOPs for the same index column
-- a glaring inconsistency. Plus these arrays just aren't very big in
practice (despite my concerns about huge arrays). Plus there is only
one of these array-specific preprocessing steps per btrescan. So I
don't think that it's worth going to too much trouble here.

> We may want to optimize this further by iterating over only the
> smallest array: With the current code, [1, 2] + [1....1000] is faster
> to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
> smaller than 1000*log(2). In practice this may matter very little,
> though.
> An even better optimized version would do a merge join on the two
> arrays, rather than loop + binary search.

v7 allocates the temp buffer using the size of whatever array is the
smaller of the two, just because it's an easy marginal improvement.

> > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
> > [...]
> > +_bt_binsrch_array_skey(FmgrInfo *orderproc,
>
> Is there a reason for this complex initialization of high/low_elem,
> rather than the this easier to understand and more compact
> initialization?:
>
> + low_elem = 0;
> + high_elem = array->num_elems - 1;
> + if (cur_elem_start)
> + {
> +     if (ScanDirectionIsForward(dir))
> +         low_elem = array->cur_elem;
> +     else
> +         high_elem = array->cur_elem;
> + }

I agree that it's better your way. Done that way in v7.

> I think the conditional _bt_parallel_done(scan) feels misplaced here,
> as the comment immediately below indicates the scan is to be
> terminated after that comment. So, either move this _bt_parallel_done
> call outside the function (which by name would imply it is read-only,
> without side effects like this) or move it below the comment
> "terminate the top-level scan".

v7 moves the comment up, so that it's just before the _bt_parallel_done() call.

> > +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
> > [...]
> > +         * Set up ORDER 3-way comparison function and array state
> > [...]
> > +         * Optimization: Skip over non-required scan keys when we know that
>
> These two sections should probably be swapped, as the skip makes the
> setup useless.

Not quite: we need to increment arrayidx for later loop iterations/scan keys.

> Also, the comment here is wrong; the scan keys that are skipped are
> 'required', not 'non-required'.

Agreed. Fixed.

> > +++ b/src/test/regress/expected/join.out
> > @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
> >    Merge Cond: (j1.id1 = j2.id1)
> >    Join Filter: (j2.id2 = j1.id2)
> >    ->  Index Scan using j1_id1_idx on j1
> > -   ->  Index Only Scan using j2_pkey on j2
> > +   ->  Index Scan using j2_id1_idx on j2
> >          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
> > -         Filter: ((id1 % 1000) = 1)
> > -(7 rows)
> > +(6 rows)
>
> I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
> anymore. The output otherwise matches (quite possibly because the
> other join conditions don't match) and I don't have time to
> investigate the intricacies between IOS vs normal IS, but this feels
> off.

This happens because the new plan uses a completely different index --
which happens to be a partial index whose predicate exactly matches
the old plan's filter quals. That factor makes the filter quals
unnecessary. That's all this is.

> As for the planner changes, I don't think I'm familiar enough with the
> planner to make any authorative comments on this. However, it does
> look like you've changed the meaning of 'amsearcharray', and I'm not
> sure it's OK to assume all indexes that support amsearcharray will
> also support for this new assumption of ordered retrieval of SAOPs.
> For one, the pgroonga extension [0] does mark
> amcanorder+amsearcharray.

The changes that I've made to the planner are subtractive. We more or
less go back to how things were just after the initial work on nbtree
amsearcharray support. That work was (at least tacitly) assumed to
have no impact on ordered scans. Because why should it? What other
type of index clause has ever affected what seems like a rather
unrelated thing (namely the sort order of the scan)? The oversight was
understandable. The kinds of plans that master cannot produce output
for in standard index order are really silly plans, independent of
this issue; it makes zero sense to allow a non-required array scan key
to affect how or when we skip.

The code that I'm removing from the planner is code that quite
obviously assumes nbtree-like behavior. So I'm taking away code like
that, rather than adding new code like that. That said, I am really
surprised that any extension creates an index AM amcanorder=true (not
to be confused with amcanorderbyop=true, which is less surprising).
That means that it promises the planner that it behaves just like
nbtree. To quote the docs, it must have "btree-compatible strategy
numbers for their [its] equality and ordering operators". Is that
really something that pgroonga even attempts? And if so, why?

I also find it bizarre that pgroonga's handler-stated capabilities
include "amcanunique=true". So pgroonga is a full text search engine,
but also supports unique indexes? I find that particularly hard to
believe, and suspect that the way that they set things up in the AM
handler just isn't very well thought out.

--
Peter Geoghegan

Attachment

v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Heikki Linnakangas

Date:

27 November 2023, 13:39:16

On 21/11/2023 04:52, Peter Geoghegan wrote:
> Attached is v7.

First, some high-level reactions before looking at the patch very closely:

- +1 on the general idea. Hard to see any downsides if implemented right.

- This changes the meaning of amsearcharray==true to mean that the 
ordering is preserved with ScalarArrayOps, right? You change B-tree to 
make that true, but what about any out-of-tree index AM extensions? I 
don't know if any such extensions exist, and I don't think we should 
jump through any hoops to preserve backwards compatibility here, but 
probably deserves a notice in the release notes if nothing else.

- You use the term "primitive index scan" a lot, but it's not clear to 
me what it means. Does one ScalarArrayOps turn into one "primitive index 
scan"? Or each element in the array turns into a separate primitive 
index scan? Or something in between? Maybe add a new section to the 
README explain how that works.

- _bt_preprocess_array_keys() is called for each btrescan(). It performs 
a lot of work like cmp function lookups and desconstructing and merging 
the arrays, even if none of the SAOP keys change in the rescan. That 
could make queries with nested loop joins like this slower than before: 
"select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 
and tenk1.two IN (1,2);".

- nbtutils.c is pretty large now. Perhaps create a new file 
nbtpreprocesskeys.c or something?

- You and Matthias talked about an implicit state machine. I wonder if 
this could be refactored to have a more explicit state machine. The 
state transitions and interactions between _bt_checkkeys(), 
_bt_advance_array_keys() and friends feel complicated.


And then some details:

> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -4035,6 +4035,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
>     </para>
>    </note>
>  
> +  <note>
> +   <para>
> +    Every time an index is searched, the index's
> +    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
> +    field is incremented.  This usually happens once per index scan node
> +    execution, but might take place several times during execution of a scan
> +    that searches for multiple values together.  Only queries that use certain
> +    <acronym>SQL</acronym> constructs to search for rows matching any value
> +    out of a list (or an array) of multiple scalar values are affected.  See
> +    <xref linkend="functions-comparisons"/> for details.
> +   </para>
> +  </note>
> +

Is this true even without this patch? Maybe commit this separately.

The "Only queries ..." sentence feels difficult. Maybe something like 
"For example, queries using IN (...) or = ANY(...) constructs.".

>  * _bt_preprocess_keys treats each primitive scan as an independent piece of
>  * work.  That structure pushes the responsibility for preprocessing that must
>  * work "across array keys" onto us.  This division of labor makes sense once
>  * you consider that we're typically called no more than once per btrescan,
>  * whereas _bt_preprocess_keys is always called once per primitive index scan.

"That structure ..." is a garden-path sentence. I kept parsing "that 
must work" as one unit, the same way as "that structure", and it didn't 
make sense. Took me many re-reads to parse it correctly. Now that I get 
it, it doesn't bother me anymore, but maybe it could be rephrased.

Is there _any_ situation where _bt_preprocess_array_keys() is called 
more than once per btrescan?

>     /*
>      * Look up the appropriate comparison operator in the opfamily.
>      *
>      * Note: it's possible that this would fail, if the opfamily is
>      * incomplete, but it seems quite unlikely that an opfamily would omit
>      * non-cross-type comparison operators for any datatype that it supports
>      * at all. ...
>      */

I agree that's unlikely. I cannot come up with an example where you 
would have cross-type operators between A and B, but no same-type 
operators between B and B. For any real-world opfamily, that would be an 
omission you'd probably want to fix.

Still I wonder if we could easily fall back if it doesn't exist? And 
maybe add a check in the 'opr_sanity' test for that.

In _bt_readpage():
>     /*
>      * Prechecking the page with scan keys required for direction scan.  We
>      * check these keys with the last item on the page (according to our scan
>      * direction).  If these keys are matched, we can skip checking them with
>      * every item on the page.  Scan keys for our scan direction would
>      * necessarily match the previous items.  Scan keys required for opposite
>      * direction scan are already matched by the _bt_first() call.
>      *
>      * With the forward scan, we do this check for the last item on the page
>      * instead of the high key.  It's relatively likely that the most
>      * significant column in the high key will be different from the
>      * corresponding value from the last item on the page.  So checking with
>      * the last item on the page would give a more precise answer.
>      *
>      * We skip this for the first page in the scan to evade the possible
>      * slowdown of point queries.  Never apply the optimization with a scans
>      * that uses array keys, either, since that breaks certain assumptions.
>      * (Our search-type scan keys change whenever _bt_checkkeys advances the
>      * arrays, invalidating any precheck.  Tracking all that would be tricky.)
>      */
>     if (!so->firstPage && !numArrayKeys && minoff < maxoff)
>     {

It's sad to disable this optimization completely for array keys. It's 
actually a regression from current master, isn't it? There's no 
fundamental reason we couldn't do it for array keys so I think we should 
do it.

_bt_checkkeys() is called in an assertion in _bt_readpage, but it has 
the side-effect of advancing the array keys. Side-effects from an 
assertion seems problematic.

Vague idea: refactor _bt_checkkeys() into something that doesn't have 
side-effects, and have a separate function or an argument to 
_bt_checkkeys() to advance to next array key. The prechecking 
optimization and the Assertion could both use the side-effect-free function.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Tomas Vondra

Date:

28 November 2023, 12:29:27

On 11/21/23 03:52, Peter Geoghegan wrote:
> On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>> Thanks. Here's my review of the btree-related code:
> 
> Attached is v7.
> 

I haven't looked at the code, but I decided to do a bit of blackbox perf
and stress testing, to get some feeling of what to expect in terms of
performance improvements, and see if there happen to be some unexpected
regressions. Attached is a couple simple bash scripts doing a
brute-force test with tables of different size / data distribution,
number of values in the SAOP expression, etc.

And a PDF visualizing the comparing the results between master and build
with the patch applied. First group of columns is master, then patched,
and then (patched/master) comparison, with green=faster, red=slower. The
columns are for different number of values in the SAOP condition.

Overall, the results look pretty good, with consistent speedups of up to
~30% for large number of values (SAOP with 1000 elements). There's a
couple blips where the performance regresses, also by up to ~30%. It's
too regular to be a random variation (it affects different combinations
of parameters, tablesizes), but it seems to only affect one of the two
machines I used for testing. Interestingly enough, it's the newer one.

I'm not convinced this is a problem we have to solve. It's possible it
only affects cases that are implausible in practice (the script forces a
particular scan type, and maybe it would not be picked in practice). But
maybe it's fixable ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

28 November 2023, 17:19:21

On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I haven't looked at the code, but I decided to do a bit of blackbox perf
> and stress testing, to get some feeling of what to expect in terms of
> performance improvements, and see if there happen to be some unexpected
> regressions. Attached is a couple simple bash scripts doing a
> brute-force test with tables of different size / data distribution,
> number of values in the SAOP expression, etc.

My own stress-testing has focussed on the two obvious extremes for
this patch, using variants of pgbench with SAOP SELECTs on
pgbench_accounts:

1. The case where there is almost no chance of finding any two index
tuples together on the same page, because the constants are completely
random. This workload makes the patch's attempts at "coalescing
together" index page reads pure overhead, with no possible benefit.
Obviously that's a cost that needs to be kept under control.

2. The case where there are 255 of tuples with distinct values that
are clustered together (both in the key space and in physical index
pages). Usually they'll span two index pages, but they might all fit
together on one index page, allowing us to descend to it directly and
read it only once.

With 32 clients, I typically see a regression of about 1.5% for the
first case relative to master, measured in throughput/TPS. The second
case typically sees throughput that's ~4.8x master (i.e. a ~380%
increase). I consider both of these extremes to be fairly unrealistic.
With fewer array constants, the speed-ups you'll see in sympathetic
cases are still very significant, but nothing like 4.8x.  They're more
like the 30% numbers that you saw.

As you know, I'm not actually all that excited about cases like 2 --
it's not where users are likely to benefit from the patch. The truly
interesting cases are those cases where we can completely avoid heap
accesses in the first place (not just *repeat* accesses to the same
index pages), due to the patch's ability to consistently use index
quals rather than filter quals. It's not that hard to show cases where
there are 100x+ fewer pages accessed -- often with cases have very few
array constants. It's just that these cases aren't that interesting
from a performance validation point of view -- it's obvious that
filter quals are terrible.

Another thing that the patch does particularly well on is cases where
the array keys don't have any matches at all, but there is significant
clustering (relatively common when multiple SAOPs are used as index
quals, which becomes far more likely due to the planner changes). We
don't just skip over parts of the index that aren't relevant -- we
also skip over parts of the arrays that aren't relevant. Some of my
adversarial test cases that take ~1 millisecond for the patch to
execute will practically take forever on master (I had to have my test
suite not run those tests against master). You just need lots of array
keys.

What master does on those adversarial cases with billions of distinct
combinations of array keys (not that unlikely if there are 4 or 5
SAOPs with mere hundreds or thousands of array keys each) is so
inefficient that we might as well call it infinitely slower than the
patch. This is interesting to me from a performance
robustness/stability point of view. The slowest kind of SAOP index
scan with the patch becomes a full index scan -- just as it would be
if we were using any type of non-SOAP qual before now. The worst case
is a lot easier to reason about.

> I'm not convinced this is a problem we have to solve. It's possible it
> only affects cases that are implausible in practice (the script forces a
> particular scan type, and maybe it would not be picked in practice). But
> maybe it's fixable ...

I would expect the patch to do quite well (relative to what is
actually possible) on cases like the two extremes that I've focussed
on so far. It seems possible that it will do less well on cases that
are somewhere in the middle (that also have lots of distinct values on
each page).

We effectively do a linear search on a page that we know has at least
one more match (following a precheck that uses the high key). We hope
that the next match (for the next array value) closely follows an
initial match. But what if there are only 2 or 3 matches on each leaf
page, that are spaced relatively far apart? You're going to have to
grovel through the whole page.

It's not obvious that that's a problem to be fixed -- we're still only
descending the index once and still only locking the leaf page once,
so we'll probably still win relative to master. And it's not that easy
to imagine beating a linear search -- it's not like there is just one
"next value" to search for in these cases. But it's something that
deserves further consideration.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

28 November 2023, 23:52:31

On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I'm not convinced this is a problem we have to solve. It's possible it
> > only affects cases that are implausible in practice (the script forces a
> > particular scan type, and maybe it would not be picked in practice). But
> > maybe it's fixable ...
>
> I would expect the patch to do quite well (relative to what is
> actually possible) on cases like the two extremes that I've focussed
> on so far. It seems possible that it will do less well on cases that
> are somewhere in the middle (that also have lots of distinct values on
> each page).

Actually, I think that it's more likely that the problems that you saw
are related to low cardinality data, which seems like it might not be
a great fit for the heuristics that the patch uses to decide whether
to continue the ongoing primitive index scan, or start a new one
instead. I'm referring to the heuristics I describe here:

https://postgr.es/m/CAH2-WzmTHoCsOmSgLg=yyft9LoERtuCKXyG2GZn+28PzonFA_g@mail.gmail.com

The patch itself discusses these heuristics in a large comment block
after the point that _bt_checkkeys() calls _bt_advance_array_keys().

I hardly paid any attention to low cardinality data in my performance
validation work -- it was almost always indexes that had few or no
indexes (just pgbench_accounts if we're talking pure stress-tests),
just because those are more complicated, and so seemed more important.
I'm not quite prepared to say that there is definitely a problem here,
right this minute, but if there was then it wouldn't be terribly
surprising (the problems are usually wherever it is that I didn't look
for them).

Attached is a sample of my debug instrumentation for one such query,
based on running the test script that Tomas posted -- thanks for
writing this script, Tomas (I'll use it as the basis for some of my
own performance validation work going forward). I don't mind sharing
the patch that outputs this stuff if anybody is interested (it's kind
of a monstrosity, so I'm disinclined to post it with the patch until I
have a reason). Even without this instrumentation, you can get some
idea of the kinds of issues I'm talking about just by viewing EXPLAIN
ANALYZE output for a bitmap index scan -- that breaks out the index
page accesses separately, which is a number that we should expect to
remain lower than what the master branch shows in approximately all
cases.

While I still think that we need heuristics that apply speculative
criteria to decide whether or not going to the next page directly
(when we have a choice at all), that doesn't mean that the v7
heuristics can't be improved on, with a little more thought. It's a
bit tricky, since we're probably also benefiting from the same
heuristics all the time -- probably even for this same test case. We
do lose against the master branch, on balance, and by enough to
concern me, though. (I don't want to promise that it'll never happen
at all, but it should be very limited, which this wasn't.)

I didn't bother to ascertain how much longer it takes to execute the
query, since that question seems rather beside the point. The
important thing to me is whether or not this behavior actually makes
sense, all things considered, and what exactly can be done about it if
it doesn't make sense. I will need to think about this some more. This
is just a status update.

Thanks
--
Peter Geoghegan

Attachment

index_walk_dump.txt.bz2

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

29 November 2023, 00:57:52

On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> While I still think that we need heuristics that apply speculative
> criteria to decide whether or not going to the next page directly
> (when we have a choice at all), that doesn't mean that the v7
> heuristics can't be improved on, with a little more thought. It's a
> bit tricky, since we're probably also benefiting from the same
> heuristics all the time -- probably even for this same test case.

Correction: this particular test case happens to be one where the
optimal strategy is to do *exactly* what the master branch does
currently. The master branch is unbeatable, so the only reasonable
goal for the patch is to not lose (or to lose by no more than a
completely negligible amount).

I'm now prepared to say that this behavior is not okay -- I definitely
need to fix this. It's a bug.

Because each distinct value never fits on one leaf page (it's more
like 1.5 - 2 pages, even though we're deduplicating heavily), and
because Postgres 12 optimizations are so effective with low
cardinality/posting-list-heavy indexes such as this, we're bound to
lose quite often. The only reason it doesn't happen _every single
time_ we descend the index is because the test script uses CREATE
INDEX, rather than using retail inserts (I tend to prefer the latter
for this sort of analysis). Since nbtsort.c isn't as clever/aggressive
about suffix truncation as the nbtsplitloc.c split strategies would
have been, had we used them (had there been retail inserts), many
individual leaf pages are left with high keys that aren't particularly
good targets for the high key precheck optimization (see Postgres 12
commit 29b64d1d).

If I wanted to produce a truly adversarial case for this issue (which
this is already close to), I'd go with the following:

1. Retail inserts that leave each leaf page full of one single value,
which will allow each high key to still make a "clean break" from the
right sibling page -- it'll have the right sibling's value. Maybe
insert 1200 - 1300 tuples per distinct index value for this.

In other words, bulk loading that results in an index that never has
to append a heap TID tiebreaker during suffix truncation, but comes
very close to needing to. Bulk loading where nbtsplitloc.c needs to
use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the
point of needing a SPLIT_SINGLE_VALUE split.

2. A SAOP query with an array with every second value in the index as
an element. Something like "WHERE arr in (2, 4, 6, 8, ...)".

The patch will read every single leaf page, whereas master will
*reliably* only read every second leaf page. I didn't need to "trick"
the patch in a contrived sort of way to get this bad outcome -- this
scenario is fairly realistic. So this behavior is definitely not
something that I'm prepared to defend. As I said, it's a bug.

It'll be fixed in the next revision.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

05 December 2023, 03:25:55

On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> - +1 on the general idea. Hard to see any downsides if implemented right.

Glad you think so. The "no possible downside" perspective is one that
the planner sort of relies on, so this isn't just a nice-to-have -- it
might actually be a condition of committing the patch. It's important
that the planner can be very aggressive about using SAOP index quals,
without us suffering any real downside at execution time.

> - This changes the meaning of amsearcharray==true to mean that the
> ordering is preserved with ScalarArrayOps, right? You change B-tree to
> make that true, but what about any out-of-tree index AM extensions? I
> don't know if any such extensions exist, and I don't think we should
> jump through any hoops to preserve backwards compatibility here, but
> probably deserves a notice in the release notes if nothing else.

My interpretation is that the planner changes affect amcanorder +
amsearcharray index AMs, but have no impact on mere amsearcharray
index AMs. If anything this is a step *away* from knowing about nbtree
implementation details in the planner (though the planner's definition
of amcanorder is very close to the behavior from nbtree, down to
things like knowing all about nbtree strategy numbers). The planner
changes from the patch are all subtractive -- I'm removing kludges
that were added by bug fix commits. Things that weren't in the
original feature commit at all.

I used the term "my interpretation" here because it seems hard to
think of this in abstract terms, and to write a compatibility note for
this imaginary audience. I'm happy to go along with whatever you want,
though. Perhaps you can suggest a wording for this?

> - You use the term "primitive index scan" a lot, but it's not clear to
> me what it means. Does one ScalarArrayOps turn into one "primitive index
> scan"? Or each element in the array turns into a separate primitive
> index scan? Or something in between? Maybe add a new section to the
> README explain how that works.

The term primitive index scan refers to the thing that happens each
time _bt_first() is called -- with and without the patch. In other
words, it's what happens when pg_stat_all_indexes.idx_scan is
incremented.

You could argue that that's not quite the right thing to be focussing
on, with this new design. But it has precedent going for it. As I
said, it's the thing that pg_stat_all_indexes.idx_scan counts, which
is a pretty exact and tangible thing. So it's consistent with
historical practice, but also with what other index AMs do when
executing ScalarArrayOps non-natively.

> - _bt_preprocess_array_keys() is called for each btrescan(). It performs
> a lot of work like cmp function lookups and desconstructing and merging
> the arrays, even if none of the SAOP keys change in the rescan. That
> could make queries with nested loop joins like this slower than before:
> "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
> and tenk1.two IN (1,2);".

But that's nothing new. _bt_preprocess_array_keys() isn't a new
function, and the way that it's called isn't new in any way.

That said, I certainly agree that we should be worried about any added
overhead in _bt_first for nested loop joins with an inner index scan.
In my experience that can be an important issue. I actually have a
TODO item about this already. It needs to be included in my work on
performance validation, on general principle.

> - nbtutils.c is pretty large now. Perhaps create a new file
> nbtpreprocesskeys.c or something?

Let me get back to you on this.

> - You and Matthias talked about an implicit state machine. I wonder if
> this could be refactored to have a more explicit state machine. The
> state transitions and interactions between _bt_checkkeys(),
> _bt_advance_array_keys() and friends feel complicated.

I agree that it's complicated. That's the main problem that the patch
has, by far. It used to be even more complicated, but it's hard to see
a way to make it a lot simpler at this point. If you can think of a
way to simplify it then I'll definitely give it a go.

Can you elaborate on "more explicit state machine"? That seems like it
could have value by adding more invariants, and making things a bit
more explicit in one or two areas. It could also help us to verify
that they hold from assertions. But that isn't really the same thing
as simplification. I wouldn't use that word, at least.

> > +  <note>
> > +   <para>
> > +    Every time an index is searched, the index's
> > +    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
> > +    field is incremented.  This usually happens once per index scan node
> > +    execution, but might take place several times during execution of a scan
> > +    that searches for multiple values together.  Only queries that use certain
> > +    <acronym>SQL</acronym> constructs to search for rows matching any value
> > +    out of a list (or an array) of multiple scalar values are affected.  See
> > +    <xref linkend="functions-comparisons"/> for details.
> > +   </para>
> > +  </note>
> > +
>
> Is this true even without this patch? Maybe commit this separately.

Yes, it is. The patch doesn't actually change anything in this area.
However, something in this area is new: it's a bit weird (but still
perfectly consistent and logical) that the count shown by
pg_stat_all_indexes.idx_scan will now show a value that is often
influenced by low-level implementation details now. Things that are
fairly far removed from the SQL query now affect that count -- that
part is new. That's what I had in mind when I wrote this, FWIW.

> The "Only queries ..." sentence feels difficult. Maybe something like
> "For example, queries using IN (...) or = ANY(...) constructs.".

I'll get back to you on this.

> >  * _bt_preprocess_keys treats each primitive scan as an independent piece of
> >  * work.  That structure pushes the responsibility for preprocessing that must
> >  * work "across array keys" onto us.  This division of labor makes sense once
> >  * you consider that we're typically called no more than once per btrescan,
> >  * whereas _bt_preprocess_keys is always called once per primitive index scan.
>
> "That structure ..." is a garden-path sentence. I kept parsing "that
> must work" as one unit, the same way as "that structure", and it didn't
> make sense. Took me many re-reads to parse it correctly. Now that I get
> it, it doesn't bother me anymore, but maybe it could be rephrased.

I'll do that ahead of the next revision.

> Is there _any_ situation where _bt_preprocess_array_keys() is called
> more than once per btrescan?

No. Note that we don't know the scan direction within
_bt_preprocess_array_keys(). We need a separate function to set up the
array keys to their initial positions (this is nothing new).

> >       /*
> >        * Look up the appropriate comparison operator in the opfamily.
> >        *
> >        * Note: it's possible that this would fail, if the opfamily is
> >        * incomplete, but it seems quite unlikely that an opfamily would omit
> >        * non-cross-type comparison operators for any datatype that it supports
> >        * at all. ...
> >        */
>
> I agree that's unlikely. I cannot come up with an example where you
> would have cross-type operators between A and B, but no same-type
> operators between B and B. For any real-world opfamily, that would be an
> omission you'd probably want to fix.
>
> Still I wonder if we could easily fall back if it doesn't exist? And
> maybe add a check in the 'opr_sanity' test for that.

I'll see about an opr_sanity test.

> In _bt_readpage():

> >       if (!so->firstPage && !numArrayKeys && minoff < maxoff)
> >       {
>
> It's sad to disable this optimization completely for array keys. It's
> actually a regression from current master, isn't it? There's no
> fundamental reason we couldn't do it for array keys so I think we should
> do it.

I'd say whether or not there's any kind of regression in this area is
quite ambiguous, though in a way that isn't really worth discussing
now. If it makes sense to extend something like this optimization to
array keys (or to add a roughly equivalent optimization), then we
should do it. Otherwise we shouldn't.

Note that the patch actually disables two distinct and independent
optimizations when the scan has array keys. Both of these were added
by recent commit e0b1ee17, but they are still independent things. They
are:

1. This skipping thing inside _bt_readpage, which you've highlighted.

This is only applied on the second or subsequent leaf page read by the
scan. Right now, in the case of a scan with array keys, that means the
second or subsequent page from the current primitive index scan --
which doesn't seem particularly principled to me.

I'd need to invent a heuristic that works with my design to adapt the
optimization. Plus I'd need to be able to invalidate the precheck
whenever the array keys advanced. And I'd probably need a way of
guessing whether or not it's likely that the arrays will advance,
ahead of time, so that the precheck doesn't almost always go to waste,
in a way that just doesn't make sense.

Note that all required scan keys are relevant here. I like to think of
plain required equality strategy scan keys without any array as
"degenerate single value arrays". Something similar can be said of
inequality strategy required scan keys (those required in the
*current* scan direction), too. So it's not as if I can "just do the
precheck stuff for the non-array scan keys". All required scan keys
are virtually the same thing as required array-type scan keys -- they
can trigger "roll over", affecting array key advancement for the scan
keys that are associated with arrays.

2. The optimization that has _bt_checkkeys skip non-required scan keys
that are *only* required in the direction *opposite* the current scan
direction -- this can work even without any precheck from
_bt_readpage.

Note that this second optimization relies on various behaviors in
_bt_first() that make it impossible for _bt_checkkeys() to ever see a
tuple that could fail to satisfy such a scan key -- we must always
have passed over non-matching tuples, thanks to _bt_first(). That
prevents my patch with a problem: the logic for determining whether or
not we need a new primitive index scan only promises to never require
the scan to grovel through many leaf pages that _bt_first() could and
should just skip over instead. This new logic makes no promises about
skipping over small numbers of tuples. So it's possible that
_bt_checkkeys() will see a handful of tuples "after the end of the
_bt_first-wise primitive index scan", but "before the _bt_first-wise
start of the next would-be primitive index scan".

Note that this stuff matters even without bringing optimization 2 into
it. There are similar considerations for required equality strategy
scan keys, which (by definition) must be required in both scan
directions. The new mechanism must never act as if it's past the end
of matches in the current scan direction, when in fact it's really
before the beginning of matches (that would lead to totally ignoring a
group of equal matches). The existing _bt_checkkeys() logic can't
really tell the difference on its own, since it only has an = operator
to work with (well, I guess it knows about this context, since there
is a comment about the dependency on _bt_first behaviors in
_bt_checkkeys on HEAD -- very old comments).

> _bt_checkkeys() is called in an assertion in _bt_readpage, but it has
> the side-effect of advancing the array keys. Side-effects from an
> assertion seems problematic.

I agree that that's a concern, but just to be clear: there are no
side-effects presently. You can't mix the array stuff with the
optimization stuff. We won't actually call _bt_checkkeys() in an
assertion when it can cause side-effects.

Assuming that we ultimately conclude that the optimizations *aren't*
worth preserving in any form, it might still be worth making it
obvious that the assertions have no side-effects. But that question is
unsettled right now.

Thanks for the review!

I'll try to get the next revision out soon. It'll also have bug fixes
for mark + restore and for a similar issue seen when the scan changes
direction in just the wrong way. (In short, the array key state
machine can be confused about scan direction in certain corner cases.)

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

05 December 2023, 05:01:37

On Mon, Dec 4, 2023 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> 2. The optimization that has _bt_checkkeys skip non-required scan keys
> that are *only* required in the direction *opposite* the current scan
> direction -- this can work even without any precheck from
> _bt_readpage.
>
> Note that this second optimization relies on various behaviors in
> _bt_first() that make it impossible for _bt_checkkeys() to ever see a
> tuple that could fail to satisfy such a scan key -- we must always
> have passed over non-matching tuples, thanks to _bt_first(). That
> prevents my patch with a problem: the logic for determining whether or
> not we need a new primitive index scan only promises to never require
> the scan to grovel through many leaf pages that _bt_first() could and
> should just skip over instead. This new logic makes no promises about
> skipping over small numbers of tuples. So it's possible that
> _bt_checkkeys() will see a handful of tuples "after the end of the
> _bt_first-wise primitive index scan", but "before the _bt_first-wise
> start of the next would-be primitive index scan".

BTW, I have my doubts about this actually being correct without the
patch. The following comment block appears above _bt_preprocess_keys:

 * Note that one reason we need direction-sensitive required-key flags is
 * precisely that we may not be able to eliminate redundant keys.  Suppose
 * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
 * which key is more restrictive for lack of a suitable cross-type operator.
 * _bt_first will arbitrarily pick one of the keys to do the initial
 * positioning with.  If it picks x > 4, then the x > 10 condition will fail
 * until we reach index entries > 10; but we can't stop the scan just because
 * x > 10 is failing.  On the other hand, if we are scanning backwards, then
 * failure of either key is indeed enough to stop the scan.  (In general, when
 * inequality keys are present, the initial-positioning code only promises to
 * position before the first possible match, not exactly at the first match,
 * for a forward scan; or after the last match for a backward scan.)

As I understand it, this might still be okay, because the optimization
in question from Alexander's commit e0b1ee17 (what I've called
optimization 2) is careful about NULLs, which were the one case that
definitely had problems. Note that IS NOT NULL works kind of like
WHERE foo < NULL here (see old bug fix commit 882368e8, "Fix btree
stop-at-nulls logic properly", for more context on this NULLs
behavior).

In any case, my patch isn't compatible with "optimization 2" (as in my
tests break in a rather obvious way) due to a behavior that these old
comments claim is normal within any scan (or perhaps normal in any
scan with scan keys that couldn't be deemed redundant due to a lack of
cross-type support in the opfamily).

Something has to be wrong here -- could just be the comment, I
suppose. But I find it easy to believe that Alexander's commit
e0b1ee17 might not have been properly tested with opfamilies that lack
a suitable cross-type operator. That's a pretty niche thing. (My patch
doesn't need that niche thing to be present to easily break when
combined with "optimization 2", which could hint at an existing and
far more subtle problem.)

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

09 December 2023, 18:38:57

On Mon, Nov 20, 2023 at 6:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It should be noted that the patch isn't strictly guaranteed to always
> read fewer index pages than master, for a given query plan and index.
> This is by design. Though the patch comes close, it's not quite a
> certainty. There are known cases where the patch reads the occasional
> extra page (relative to what master would have done under the same
> circumstances). These are cases where the implementation just cannot
> know for sure whether the next/sibling leaf page has key space covered
> by any of the scan's array keys (at least not in a way that seems
> practical). The implementation has simple heuristics that infer (a
> polite word for "make an educated guess") about what will be found on
> the next page. Theoretically we could be more conservative in how we
> go about this, but that seems like a bad idea to me. It's really easy
> to find cases where the maximally conservative approach loses by a
> lot, and really hard to show cases where it wins at all.

Attached is v8, which pretty much rips all of this stuff out.

I definitely had a point when I said that it made sense to be
optimistic about finding matches on the next page in respect of any
truncated -inf attributes in high keys, though. And so we still do
that much in v8. But, there is no reason why we need to go any further
than that -- there is no reason why we should *also* be optimistic
about *untruncated* high key/finaltup attributes that *aren't* exact
matches for any of the scan's required array keys finding exact
matches once we move onto the next sibling page.

I reached this conclusion when working on a fix for the low
cardinality index regression that Tomas' tests brought to my attention
[1]. I started out with the intention of just fixing that one case, in
a very targeted way, but quickly realized that it made almost no sense
to just limit myself to the low cardinality cases. Even with Tomas'
problematic low cardinality test cases, I saw untruncated high key
attributes that were "close by to matching tuples" -- they just
weren't close enough (i.e. exactly on the next leaf page) for us to
actually win (so v7 lost). Being almost correct again and again, but
still losing again and again is a good sign that certain basic
assumptions were faulty (at least if it's realistic, which it was in
this instance).

To be clear, and to repeat, even in v8 we'll still "make guesses"
about -inf truncated attributes. But it's a much more limited form of
guessing. If we didn't at least do the -inf thing, then backwards
scans would weirdly work better than forward scans in some cases -- a
particular concern with queries that have index quals for each of
multiple columns. I don't think that this remaining "speculative"
behavior needs to be discussed at very much length in code comments,
though. That's why v8 is a great deal simpler than v7 was here. No
more huge comment block at the end of the new _bt_checkkeys.

Notable stuff that *hasn't* changed from v7:

I'm posting this v8 having not yet worked through all of Heikki's
feedback. In particular, v8 doesn't deal with the relatively hard
question of what to do about the optimizations added by Alexander
Korotkov's commit e0b1ee17 (should I keep them disabled, selectively
re-enable one or both optimizations, or something else?). This is
partly due to at least one of the optimizations having problems of
their own that are still outstanding [2]. I also wanted to get a
revision out before travelling to Prague for PGConf.EU, which will be
followed by other Christmas travel. That's likely to keep me away from
the patch for weeks (that, plus I'm moving to NYC in early January).
So I just ran out of time to do absolutely everything.

Other notable changes:

* Bug fixes for cases where the array state machine gets confused by a
change in the scan direction, plus similar cases involving mark and
restore processing.

I'm not entirely happy with my approach here (mostly referring to the
new code in _bt_steppage here). Feels like it needs to be a little
better integrated with mark/restore processing.

No doubt Heikki will have his own doubts about this. I've included my
test cases for the issues in this area. The problems are really hard
to reproduce, and writing these tests took a surprisingly large amount
of effort. The tests might not be suitable for commit, but you really
need to see the test cases to be able to review the code efficiently.
It's just fiddly.

* I've managed to make the array state machine just a little more
streamlined compared to v7.

Minor code polishing, not really worth describing in detail.

* Addressed concerns about incomplete opfamilies not working with the
patch by updating the error message within
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup. It now exactly
matches the similar one in _bt_first.

I don't think that we need any new opr_sanity tests, since we already
have one for this. Making the error message match the one in _bt_first
ensures that anybody that runs into a problem here will see the same
error message that they'd have seen on earlier versions, anyway.

It's a more useful error compared to the one from v7 (in that it names
the index and its attribute directly). Plus it's good to be
consistent.

I don't see any potential for the underlying _bt_sort_array_cmp_setup
behavior to be seen as a regression, in terms of our ability to cope
with incomplete opfamilies (compared to earlier Postgres versions).
Opfamilies that lack a cross-type ORDER proc mixed with queries that
use the corresponding cross-type = operator were always very dicey.
That situation isn't meaningfully different with the patch.

(Actually, this isn't 100% true in the case of queries + indexes with
non-required arrays specifically -- they'll need a 3-way comparator in
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup, and yet *won't*
need one moments later, in _bt_first. This is because non-required
scan keys don't end up in _bt_first's insertion scan key at all, in
general. This distinction just seems pedantic, though. We're talking
about a case where things accidentally failed to fail in previous
versions, for some queries but not most queries. Now it'll fail in
exactly the same way in slightly more cases. In reality, affected
opfamilies are practically non-existent, so this is a hypothetical
upon a hypothetical.)

[1] https://postgr.es/m/CAH2-WzmtV7XEWxf_rP1pw=vyDjGLi__zGOy6Me5MovR3e1kfdg@mail.gmail.com
[2] https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ+KOK-WZ+QtTQ@mail.gmail.com
--
Peter Geoghegan

Attachment

v8-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

28 December 2023, 17:28:32

On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v8, which pretty much rips all of this stuff out.

Attached is v9, which I'm posting just to fix bitrot. The patch
stopped cleanly applying against HEAD due to recent bugfix commit
7e6fb5da. No real changes here compared to v8.

--
Peter Geoghegan

Attachment

v9-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

15 January 2024, 19:32:36

On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v8, which pretty much rips all of this stuff out.
>
> Attached is v9, which I'm posting just to fix bitrot. The patch
> stopped cleanly applying against HEAD due to recent bugfix commit
> 7e6fb5da. No real changes here compared to v8.

I found 2 major issues; one correctness issue in the arraykey
processing code of _bt_preprocess_array_keys, and one assertion error
in _bt_advance_array_keys; both discussed in the relevant sections
below.

> Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.
> [...]
> Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
> path keys: path keys on a multicolumn index path, with a SAOP clause on
> any attribute beyond the first/most significant attribute.  These cases
> are now all safe, so we go back to generating path keys without regard
> for the presence of SAOP clauses (just like with any other clause type).
> Also undo changes from follow-up bugfix commit a4523c5a, which taught
> the planner to produce alternative index paths without any low-order
> ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
> We'll no longer generate these alternative paths, which can no longer
> offer any advantage over the index qual paths that we do still generate.

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

> +++ b/src/include/access/nbtree.h
> +#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */

I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

> +++ b/src/backend/access/nbtree/nbtsearch.c
>      * We skip this for the first page in the scan to evade the possible
> -     * slowdown of the point queries.
> +     * slowdown of point queries.  Never apply the optimization with a scan
> +     * that uses array keys, either, since that breaks certain assumptions.
> +     * (Our search-type scan keys change whenever _bt_checkkeys advances the
> +     * arrays, invalidating any precheck.  Tracking all that would be tricky.)

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?
For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

> +++ b/src/backend/access/nbtree/nbtutils.c
> _bt_preprocess_array_keys(IndexScanDesc scan)
The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs. Reproducer:
CREATE TABLE test AS
SELECT generate_series(1, 10000, 1::bigint) num;
CREATE INDEX ON test (num); /* bigint typed */

SELECT num FROM test
WHERE num = ANY ('{1}'::smallint[])
  AND num = ANY ('{1}'::int[]) /* ensure matching
lastEqualityArrayAtt, lastOrderProc for next qual
  AND num = ANY ('{65537}'::int[]); /* qual is broken due to
application of smallint compare operator on int values that do equal
mod 2^16, but do not equal in their own type */
 num
-----
   1

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

> +/*
> + * _bt_merge_arrays() -- merge together duplicate array keys
> + *
> + * Both scan keys have array elements that have already been sorted and
> + * deduplicated.
> + */

As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

> +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> +                             IndexTuple tuple, int sktrig, bool validtrig)

I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

> _bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators? E.g. a
type 'truncatedint4' that compares only the highest 16 bits of an
integer would be strictly sorted, and could compare 0::truncatedint4 =
0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4
= 2::int4 is false.
Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I notice an assertion error here:
> +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> +            Assert(all_required_sk_satisfied);
> +            Assert(!foundRequiredOppositeDirOnly);
> +
> +            foundRequiredOppositeDirOnly = true;

This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

> +_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

> +++ b/src/test/regress/sql/btree_index.sql
> +-- Add tests to give coverage of various subtle issues.
> +--
> +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> +--
> +-- Here we don't remember the scan's array keys before processing a page, only
> +-- after processing a page (which is implicit, it's just the scan's current
> +-- keys).  So when we move the scan backwards we think that the top-level scan
> +-- should terminate, when in reality it should jump backwards to the leaf page
> +-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Note: I did not yet do any in-depth review of the planner changes in
indxpath.c/selfuncs.c.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

16 January 2024, 02:02:56

On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Can you pull these planner changes into their own commit(s)?
> As mentioned upthread, it's a significant change in behavior that
> should have separate consideration and reference in the commit log. I
> really don't think it should be buried in the 5th paragraph of an
> "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> changes of btree are arguably independent of the planner changes, as
> the btree changes improve performance even if we ignore that it
> implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way. You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

> An independent thought when reviewing the finaltup / HIKEY scan
> optimization parts of this patch:
> The 'use highkey to check for next page's data range' optimization is
> useful, but can currently only be used for scans that go to the right.
> Could we use a similar optimization in the inverse direction if we
> marked the right page of a split with "the left page has a HIKEY based
> on this page's (un)truncated leftmost value" or "left page's HIKEY was
> truncated to N key attributes"? It'd give one bit of information,
> specifically that it does (or does not) share some (or all) key
> attributes with this page's minimum value, which allows for earlier
> scan termination in boundary cases.

That would have to be maintained in all sorts of different places in
nbtree. And could be broken at any time by somebody inserting a
non-pivot tuple before every existing one on the leaf page. Doesn't
seem worth it to me.

If we were to do something like this then it would be discussed as
independent work. It's akin to adding a low key, which could be used
in several different places. As you say, it's totally independent.

> > +++ b/src/include/access/nbtree.h
> > +#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */
>
> I think "made redundant" is more appropriate than just "redundant";
> the array key is not itself redundant in array preprocessing (indeed
> we actually work hard on that key during preprocessing to allow us to
> mark it redundant)

Meh. I did it that way to fit under 78 chars while staying on the same
line. I don't think that it matters.

> I think this was mentioned before, but can't we have an argument or
> version of _bt_checkkeys that makes it not advance the array keys, so
> that we can keep this optimization? Or, isn't that now
> _bt_check_compare?

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

> For an orderlines table with 1000s of lines per order, we would
> greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
> handles scan keys more efficiently; the optimization which is being
> disabled here.

The way that these optimizations might work with the mechanism from
the patch isn't some kind of natural extension to what's there
already. We'll need new heuristics to not waste cycles. Applying all
of the optimizations together just isn't trivial, and it's not yet
clear what really makes sense. Combining the two optimizations more or
less adds another dimension of complexity.

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > _bt_preprocess_array_keys(IndexScanDesc scan)
> The _bt_preprocess_array_keys code is now broken due to type
> confusion, as it assumes there is only one array subtype being
> compared per attribute in so.orderProcs.

I've been aware of this for some time, but didn't think that it was
worth bringing up before I had a solution to present...

> I'm also concerned about the implications of this in
> _bt_binsrch_array_skey, as this too assumes the same compare operator
> is always used for all array operations on each attribute. We may need
> one so->orderProcs entry for each array key, but at least one per
> sk_subtype of each array key.

...since right now it's convenient to make so->orderProcs have a 1:1
correspondence with index key columns....

> I also notice that the merging of values doesn't seem to be applied
> optimally with mixed typed array operations: num = int[] AND num =
> bigint[] AND num = int[] doesn't seem to merge the first and last
> array ops. I'm also concerned about being (un)able to merge

...which ought to work and be robust, once the cross-type support is
in place. That is, it should work once we really can assume that there
really must be exactly one so->orderProcs entry per equality-strategy
scan key in all cases -- including in highly obscure corner cases
involving a mix of cross-type comparisons that are also redundant.
(Though as I go into below, I can see at least 2 viable solutions to
the problem.)

> > +/*
> > + * _bt_merge_arrays() -- merge together duplicate array keys
> > + *
> > + * Both scan keys have array elements that have already been sorted and
> > + * deduplicated.
> > + */
>
> As I mentioned upthread, I find this function to be very wasteful, as
> it uses N binary searches to merge join two already sorted arrays,
> resulting in a O(n log(m)) complexity, whereas a normal merge join
> should be O(n + m) once the input datasets are sorted.

And as I mentioned upthread, I think that you're making a mountain out
of a molehill here. This is not a merge join. Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

> Please fix this, as it shows up in profiling of large array merges.
> Additionally, as it merges two arrays of unique items into one,
> storing only matching entries, I feel that it is quite wasteful to do
> this additional allocation here. Why not reuse the original allocation
> immediately?

Because that's adding more complexity for a code path that will hardly
ever be used in practice. For something that happens once, during a
preprocessing step.

> > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > +                             IndexTuple tuple, int sktrig, bool validtrig)
>
> I don't quite understand what the 'validtrig' argument is used for.
> There is an assertion that it is false under some conditions in this
> code, but it's not at all clear to me why that would have to be the
> case - it is called with `true` in one of the three call sites. Could
> the meaning of this be clarified?

Sure, I'll add a comment.

> I also feel that this patch includes several optimizations such as
> this sktrig argument which aren't easy to understand. Could you pull
> that into a separately reviewable patch?

It probably makes sense to add the extra preprocessing stuff out into
its own commit, since I tend to agree that that's an optimization that
can be treated as unrelated (and isn't essential to the main thrust of
the patch).

However, the sktrig thing isn't really like that. We need to do things
that way for required inequality scan keys. It doesn't make sense to
not just do it for all required scan keys (both equality and
inequality strategy scan keys) right from the start.

> Additionally, could you try to create a single point of entry for the
> array key stuff that covers the new systems? I've been trying to wrap
> my head around this, and it's taking a lot of effort.

I don't understand what you mean here.

> > _bt_advance_array_keys
>
> Thinking about the implementation here:
> We require transitivity for btree opclasses, where A < B implies NOT A
> = B, etc. Does this also carry over into cross-type operators?

Yes, it carries like that.

> Would it be valid to add support methods for truncatedint4 to an int4
> btree opclass, or is transitivity also required for all operations?
> i.e. all values that one operator class considers unique within an
> opfamily must be considered unique for all additional operators in the
> opfamily, or is that not required?
> If not, then that would pose problems for this patch, as the ordering
> of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
> ('{0,65536}'::truncatedint4[]) could potentially skip results.

There are roughly two ways to deal with this sort of thing (which
sounds like a restatement of the issue shown by your test case?). They
are:

1. Add support for detecting redundant scan keys, even when cross-type
operators are involved. (Discussed earlier.)

2. Be prepared to have more than one scan key per index key column in
rare edge-cases. This means that I can no longer subscript
so->orderProcs using an attnum.

I intend to use whichever approach works out to be the simplest and
most maintainable. Note that there is usually nothing that stops me
from having redundant scan keys if it can't be avoided via
preprocessing -- just like today.

Actually...I might have to use a hybrid of 1 and 2. Because we need to
be prepared for the possibility that it just isn't possible to
determine redundancy at all due to a lack of suitable cross-type
operators -- I don't think that it would be okay to just throw an
error there (that really would be a regression against Postgres 16).
For that reason I will most likely need to find a way for
so->orderProcs to be subscripted using something that maps to equality
scan keys (rather than having it map to a key column).

> I'm also no fan of the (tail) recursion. I would agree that this is
> unlikely to consume a lot of stack, but it does consume stackspace
> nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

> I notice an assertion error here:
> > +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> > +            Assert(all_required_sk_satisfied);
> > +            Assert(!foundRequiredOppositeDirOnly);
> > +
> > +            foundRequiredOppositeDirOnly = true;
>
> This assertion can be hit with the following test case:
>
> CREATE TABLE test AS
> SELECT i a, i b, i c FROM generate_series(1, 1000) i;
> CREATE INDEX ON test(a, b, c); ANALYZE;
> SELECT count(*) FROM test
> WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
> AND b = ANY ('{1,2,3}');

Will fix.

> > +_bt_update_keys_with_arraykeys(IndexScanDesc scan)
>
> I keep getting confused by the mixing of integer increments and
> pointer increments. Could you explain why in this code you chose to
> increment a pointer for "ScanKey cur", while using arrray indexing for
> other fields? It feels very arbitrary to me, and that makes the code
> difficult to follow.

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

> > +++ b/src/test/regress/sql/btree_index.sql
> > +-- Add tests to give coverage of various subtle issues.
> > +--
> > +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> > +--
> > +-- Here we don't remember the scan's array keys before processing a page, only
> > +-- after processing a page (which is implicit, it's just the scan's current
> > +-- keys).  So when we move the scan backwards we think that the top-level scan
> > +-- should terminate, when in reality it should jump backwards to the leaf page
> > +-- that we last visited.
>
> I notice this adds a complex test case that outputs many rows. Can we
> do with less rows if we build the index after data insertion, and with
> a lower (non-default) fillfactor?

Probably not. It was actually very hard to come up with these test
cases, which tickle the implementation in just the right way to
demonstrate that the code in places like _bt_steppage() is actually
required. It took me a rather long time to just prove that much. Not
sure that we really need this. But thought I'd include it for the time
being, just so that reviewers could understand those changes.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

17 January 2024, 22:08:06

On Thu, Dec 28, 2023 at 12:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v9, which I'm posting just to fix bitrot. The patch
> stopped cleanly applying against HEAD due to recent bugfix commit
> 7e6fb5da. No real changes here compared to v8.

Attached is v10, which is another revision most just intended to fix
bitrot. This time from changes to selfuncs.c on HEAD.

v10 also fixes the assertion failure reported by Matthias in passing,
just because it was easy to include. No changes for the other open
items, though.

--
Peter Geoghegan

Attachment

v10-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

18 January 2024, 16:38:58

On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Can you pull these planner changes into their own commit(s)?
> > As mentioned upthread, it's a significant change in behavior that
> > should have separate consideration and reference in the commit log. I
> > really don't think it should be buried in the 5th paragraph of an
> > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> > changes of btree are arguably independent of the planner changes, as
> > the btree changes improve performance even if we ignore that it
> > implements strict result ordering.
>
> I'm not going to break out the planner changes, because they're *not*
> independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

> You could say the same thing about practically
> any work that changes the planner. They're "buried" in the 5th
> paragraph of the commit message. If an interested party can't even
> read that far to gain some understanding of a legitimately complicated
> piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

> As I said to Heikki, thinking about this some more is on my todo list.
> I mean the way that this worked had substantial revisions on HEAD
> right before I posted v9. v9 was only to fix the bit rot that that
> caused.

Then I'll be waiting for that TODO item to be resolved.

> > > +++ b/src/backend/access/nbtree/nbtutils.c
> > > +/*
> > > + * _bt_merge_arrays() -- merge together duplicate array keys
> > > + *
> > > + * Both scan keys have array elements that have already been sorted and
> > > + * deduplicated.
> > > + */
> >
> > As I mentioned upthread, I find this function to be very wasteful, as
> > it uses N binary searches to merge join two already sorted arrays,
> > resulting in a O(n log(m)) complexity, whereas a normal merge join
> > should be O(n + m) once the input datasets are sorted.
>
> And as I mentioned upthread, I think that you're making a mountain out
> of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

> Even single digit
> thousand of array elements should be considered huge. Plus this code
> path is only hit with poorly written queries.

Would you accept suggestions?

> > Please fix this, as it shows up in profiling of large array merges.
> > Additionally, as it merges two arrays of unique items into one,
> > storing only matching entries, I feel that it is quite wasteful to do
> > this additional allocation here. Why not reuse the original allocation
> > immediately?
>
> Because that's adding more complexity for a code path that will hardly
> ever be used in practice. For something that happens once, during a
> preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

> > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > > +                             IndexTuple tuple, int sktrig, bool validtrig)
> >
> > I don't quite understand what the 'validtrig' argument is used for.
> > There is an assertion that it is false under some conditions in this
> > code, but it's not at all clear to me why that would have to be the
> > case - it is called with `true` in one of the three call sites. Could
> > the meaning of this be clarified?
>
> Sure, I'll add a comment.

Thanks!

> > I also feel that this patch includes several optimizations such as
> > this sktrig argument which aren't easy to understand. Could you pull
> > that into a separately reviewable patch?
>
> It probably makes sense to add the extra preprocessing stuff out into
> its own commit, since I tend to agree that that's an optimization that
> can be treated as unrelated (and isn't essential to the main thrust of
> the patch).
>
> However, the sktrig thing isn't really like that. We need to do things
> that way for required inequality scan keys. It doesn't make sense to
> not just do it for all required scan keys (both equality and
> inequality strategy scan keys) right from the start.

I understand that in some places the "triggered by scankey" concept is
required, but in other places the use of it is explicitly flagged as
an optimization (specifically, in _bt_advance_array_keys), which
confused me.

> > Additionally, could you try to create a single point of entry for the
> > array key stuff that covers the new systems? I've been trying to wrap
> > my head around this, and it's taking a lot of effort.
>
> I don't understand what you mean here.

The documentation (currently mostly code comments) is extensive, but
still spread around various inline comments and comments on functions;
with no place where a clear top-level design is detailed.
I'll agree that we don't have that for the general systems in
_bt_next() either, but especially with this single large patch it's
difficult to grasp the specific differences between the various
functions.

> > > _bt_advance_array_keys
> >
> > Thinking about the implementation here:
> > We require transitivity for btree opclasses, where A < B implies NOT A
> > = B, etc. Does this also carry over into cross-type operators?
>
> Yes, it carries like that.
>
> > Would it be valid to add support methods for truncatedint4 to an int4
> > btree opclass, or is transitivity also required for all operations?
> > i.e. all values that one operator class considers unique within an
> > opfamily must be considered unique for all additional operators in the
> > opfamily, or is that not required?
> > If not, then that would pose problems for this patch, as the ordering
> > of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
> > ('{0,65536}'::truncatedint4[]) could potentially skip results.
>
> There are roughly two ways to deal with this sort of thing (which
> sounds like a restatement of the issue shown by your test case?).

It wasn't really meant as a restatement: It is always unsafe to ignore
upper bits, as the index isn't organized by that. However, it *could*
be safe to ignore the bits with lowest significance, as the index
would still be organized correctly even in that case, for int4 at
least. Similar to how you can have required scankeys for the prefix of
an (int2, int2) index, but not the suffix (as of right now at least).

The issue I was trying to show is that if you have a type that ignores
some part of the key for comparison like truncatedint4 (which
hypothetically would do ((a>>16) < (b>>16)) on int4 types), then this
might cause issues if that key was advanced before more precise
equality checks.

This won't ever be an issue when there is a requirement that if a = b
and b = c then a = c must hold for all triples of typed values and
operations in a btree opclass, as you mention above.


> They
> are:
>
> 1. Add support for detecting redundant scan keys, even when cross-type
> operators are involved. (Discussed earlier.)
>
> 2. Be prepared to have more than one scan key per index key column in
> rare edge-cases. This means that I can no longer subscript
> so->orderProcs using an attnum.

> Actually...I might have to use a hybrid of 1 and 2. Because we need to
> be prepared for the possibility that it just isn't possible to
> determine redundancy at all due to a lack of suitable cross-type
> operators -- I don't think that it would be okay to just throw an
> error there (that really would be a regression against Postgres 16).

Agreed.

> > I'm also no fan of the (tail) recursion. I would agree that this is
> > unlikely to consume a lot of stack, but it does consume stackspace
> > nonetheless, and I'd prefer if it was not done this way.
>
> I disagree. The amount of stack space it consumes in the worst case is fixed.

Is it fixed? Without digging very deep into it, it looks like it can
iterate on the order of n_scankeys deep? The comment above does hint
on 2 iterations, but is not very clear about the conditions and why.

> > I notice an assertion error here:
> > > +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> > > +            Assert(all_required_sk_satisfied);
> > > +            Assert(!foundRequiredOppositeDirOnly);
> > > +
> > > +            foundRequiredOppositeDirOnly = true;
> >
> > This assertion can be hit with the following test case:
> >
> > CREATE TABLE test AS
> > SELECT i a, i b, i c FROM generate_series(1, 1000) i;
> > CREATE INDEX ON test(a, b, c); ANALYZE;
> > SELECT count(*) FROM test
> > WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
> > AND b = ANY ('{1,2,3}');
>
> Will fix.
>
> > > +_bt_update_keys_with_arraykeys(IndexScanDesc scan)
> >
> > I keep getting confused by the mixing of integer increments and
> > pointer increments. Could you explain why in this code you chose to
> > increment a pointer for "ScanKey cur", while using arrray indexing for
> > other fields? It feels very arbitrary to me, and that makes the code
> > difficult to follow.
>
> Because in one case we follow the convention for scan keys, whereas
> so->orderProcs is accessed via an attribute number subscript.

Okay, but how about this in _bt_merge_arrays?

+        Datum       *elem = elems_orig + i;

I'm not familiar with the scan key convention, as most other places
use reference+subscripting.

> > > +++ b/src/test/regress/sql/btree_index.sql
> > > +-- Add tests to give coverage of various subtle issues.
> > > +--
> > > +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> > > +--
> > > +-- Here we don't remember the scan's array keys before processing a page, only
> > > +-- after processing a page (which is implicit, it's just the scan's current
> > > +-- keys).  So when we move the scan backwards we think that the top-level scan
> > > +-- should terminate, when in reality it should jump backwards to the leaf page
> > > +-- that we last visited.
> >
> > I notice this adds a complex test case that outputs many rows. Can we
> > do with less rows if we build the index after data insertion, and with
> > a lower (non-default) fillfactor?
>
> Probably not. It was actually very hard to come up with these test
> cases, which tickle the implementation in just the right way to
> demonstrate that the code in places like _bt_steppage() is actually
> required. It took me a rather long time to just prove that much. Not
> sure that we really need this. But thought I'd include it for the time
> being, just so that reviewers could understand those changes.

Makes sense, thanks for the explanation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

19 January 2024, 22:41:59

On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > I'm not going to break out the planner changes, because they're *not*
> > independent in any way.
>
> The planner changes depend on the btree changes, that I agree with.
> However, I don't think that the btree changes depend on the planner
> changes.

Yeah, technically it would be possible to break out the indxpath.c
changes -- that wouldn't leave the tree in a state with visibly-wrong
behavior at any point. But that's far from the only thing that
matters. If that was the standard that we applied, then I might have
to split the patch into 10 or more patches.

What it boils down to is this: it is totally natural for me to think
of the planner changes as integral to the nbtree changes, so I'm going
to include them together in one commit. That's just how the code was
written -- I always thought about it as one single thing. The majority
of the queries that I've used to promote the patch directly rely on
the planner changes in one way or another (even back in v1, when the
planner side of things had lots of kludges).

I don't necessarily always follow that standard. Sometimes it is
useful to further break things up. Like when it happens to make the
high-level division of labor a little bit clearer. The example that
comes to mind is commit dd299df8 and commit fab25024. These nbtree
commits were essentially one piece of work that was broken into two,
but committed within minutes of each other, and left the tree in a
momentary state that was not-very-useful (though still correct). That
made sense to me at the time because the code in question was
mechanically distinct, while at the same time modifying some of the
same nbtree files. Drawing a line under it seemed to make sense.

I will admit that there is some amount of subjective judgement (gasp!)
here. Plus I'll acknowledge that it's *slightly* odd that the most
compelling cases for the SAOP patch almost all involve savings in heap
page accesses, even though it is fundamentally a B-Tree patch. But
that's just how it came out. Slightly odd things happen all the time.

> I would agree with you if this was about new APIs and features, but
> here existing APIs are being repurposed without changing them. A
> maintainer of index AMs would not look at the commit title 'Enhance
> nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> my existing amsearcharray+amcanorder handling actually supports
> non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

> > As I said to Heikki, thinking about this some more is on my todo list.
> > I mean the way that this worked had substantial revisions on HEAD
> > right before I posted v9. v9 was only to fix the bit rot that that
> > caused.
>
> Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason. At this point I'm
still totally neutral.

> > Even single digit
> > thousand of array elements should be considered huge. Plus this code
> > path is only hit with poorly written queries.
>
> Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

> > > Please fix this, as it shows up in profiling of large array merges.
> > > Additionally, as it merges two arrays of unique items into one,
> > > storing only matching entries, I feel that it is quite wasteful to do
> > > this additional allocation here. Why not reuse the original allocation
> > > immediately?
> >
> > Because that's adding more complexity for a code path that will hardly
> > ever be used in practice. For something that happens once, during a
> > preprocessing step.
>
> Or many times, when we're in a parameterized loop, as was also pointed
> out by Heikki.

That's not really what Heikki said. Heikki had a general concern about
the startup costs with nestloop joins.

> > > Additionally, could you try to create a single point of entry for the
> > > array key stuff that covers the new systems? I've been trying to wrap
> > > my head around this, and it's taking a lot of effort.
> >
> > I don't understand what you mean here.
>
> The documentation (currently mostly code comments) is extensive, but
> still spread around various inline comments and comments on functions;
> with no place where a clear top-level design is detailed.
> I'll agree that we don't have that for the general systems in
> _bt_next() either, but especially with this single large patch it's
> difficult to grasp the specific differences between the various
> functions.

Between what functions? The guts of this work are in the new
_bt_advance_array_keys and _bt_tuple_before_array_skeys (honorable
mention for _bt_array_keys_remain). It seems pretty well localized to
me.

Granted, there are a few places where we rely on things being set up a
certain way by other code that's quite some distance away (from the
code doing the depending). For example, the new additions to
_bt_preprocess_keys that we need later on, in
_bt_advance_array_keys/_bt_update_keys_with_arraykeys. These bits and
pieces of code are required, but are not particularly crucial to
understanding the general design.

For the most part the design is that we cede control of array key
advancement to _bt_checkkeys() -- nothing much else changes. There are
certainly some tricky details behind the scenes, which we should be
verifying via testing and especially via robust invariants (verified
with assertions). But this is almost completely hidden from the
nbtsearch.c caller -- there are no real changes required there.

> It wasn't really meant as a restatement: It is always unsafe to ignore
> upper bits, as the index isn't organized by that. However, it *could*
> be safe to ignore the bits with lowest significance, as the index
> would still be organized correctly even in that case, for int4 at
> least. Similar to how you can have required scankeys for the prefix of
> an (int2, int2) index, but not the suffix (as of right now at least).

It isn't safe to assume that types appearing together within an
opfamily are binary coercible (or capable of being type-punned like
this) in the general case. That often works, but it isn't reliable.
Even with integer_ops, it breaks on big-endian machines.

> This won't ever be an issue when there is a requirement that if a = b
> and b = c then a = c must hold for all triples of typed values and
> operations in a btree opclass, as you mention above.

Right. It also doesn't matter if there are redundant equality
conditions that we cannot formally prove are redundant during
preprocessing, for want of appropriate cross-type comparison routines
from the opfamily. The actual behavior of index scans (in terms of
when and how we skip) won't be affected by that at all. The only
problem that it creates is that we'll waste CPU cycles, relative to
the case where we can somehow detect this redundancy.

In case it wasn't clear before: the optimizations I've added to
_bt_preprocess_array_keys are intended to compensate for the
pessimizations added to _bt_preprocess_keys (both the optimizations
and the pessimizations only affect equality type array scan keys). I
don't think that this needs to be perfect; it just seems like a good
idea.

> > > I'm also no fan of the (tail) recursion. I would agree that this is
> > > unlikely to consume a lot of stack, but it does consume stackspace
> > > nonetheless, and I'd prefer if it was not done this way.
> >
> > I disagree. The amount of stack space it consumes in the worst case is fixed.
>
> Is it fixed? Without digging very deep into it, it looks like it can
> iterate on the order of n_scankeys deep? The comment above does hint
> on 2 iterations, but is not very clear about the conditions and why.

The recursive call to _bt_advance_array_keys is needed to deal with
unsatisfied-and-required inequalities that were not detected by an
initial call to _bt_check_compare, following a second retry call to
_bt_check_compare. We know that this recursive call to
_bt_advance_array_keys won't cannot recurse again because we've
already identified which specific inequality scan key it is that's a
still-unsatisfied inequality, following an initial round of array key
advancement.

We're pretty much instructing _bt_advance_array_keys to perform
"beyond_end_advance" type advancement for the specific known
unsatisfied inequality scan key (which must be required in the current
scan direction, per the assertions for that) here. So clearly it
cannot end up recursing any further -- the recursive call is
conditioned on "all_arraylike_sk_satisfied && arrays_advanced", but
that'll be unset when "ikey == sktrig && !array" from inside the loop.

This is a narrow edge-case -- it is an awkward one. Recursion does
seem natural here, because we're essentially repeating the call to
_bt_advance_array_keys from inside _bt_advance_array_keys, rather than
leaving it up to the usual external caller to do later on. If you
ripped this code out it would barely be noticeable at all. But it
seems worth it so that we can make a uniform promise to always advance
the array keys to the maximum extent possible, based on what the
caller's tuple tells us about the progress of the scan.

Since all calls to _bt_advance_array_keys are guaranteed to advance
the array keys to the maximum extent that's safely possible (barring
this one edge-case with recursive calls), it almost follows that there
can't be another recursive call. This is the one edge case where the
implementation requires a second pass over the tuple -- we advanced
the array keys to all-matching values, but nevertheless couldn't
finish there due to the unsatisfied required inequality (which must
have become unsatisfied _at the same time_ as some earlier required
equality scan key for us to end up requiring a recursive call).

> > Because in one case we follow the convention for scan keys, whereas
> > so->orderProcs is accessed via an attribute number subscript.
>
> Okay, but how about this in _bt_merge_arrays?
>
> +        Datum       *elem = elems_orig + i;
>
> I'm not familiar with the scan key convention, as most other places
> use reference+subscripting.

I meant the convention used in code like _bt_check_compare (which is
what we call _bt_checkkeys on HEAD, basically).

Note that the _bt_merge_arrays code that you've highlighted isn't
iterating through so->keyData[] -- it is iterating through the
function caller's elements array, which actually come from
so->arrayKeys[].

Like every other Postgres contributor, I do my best to follow the
conventions established by existing code. Sometimes that leads to
pretty awkward results, where CamelCase and underscore styles are
closely mixed together, because it works out to be the most consistent
way of doing it overall.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

22 January 2024, 21:12:55

On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote:
>

Thank you for your replies so far.

> On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I would agree with you if this was about new APIs and features, but
> > here existing APIs are being repurposed without changing them. A
> > maintainer of index AMs would not look at the commit title 'Enhance
> > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> > my existing amsearcharray+amcanorder handling actually supports
> > non-prefix arrays keys and return data in index order".
>
> This is getting ridiculous.
>
> It is quite likely that there are exactly zero affected out-of-core
> index AMs. I don't count pgroonga as a counterexample (I don't think
> that it actually fullfills the definition of a ). Basically,
> "amcanorder" index AMs more or less promise to be compatible with
> nbtree, down to having the same strategy numbers. So the idea that I'm
> going to upset amsearcharray+amcanorder index AM authors is a
> completely theoretical problem. The planner code evolved with nbtree,
> hand-in-glove.

And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way. I don't mind changes in
the planner for nbtree per se, but as I've mentioned before in other
places, I really don't like how we handle amcanorder as if it were
amisbtree. But it's not called that, so we shouldn't expect that to
remain the case; and definitely not keep building on those
expectations that it always is going to be the nbtree when amcanorder
is set (or amsearcharray is set, or ..., or any combination of those
that is currently used by btree). By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

/rant

I'll see about the

> I'm more than happy to add a reminder to the commit message about
> adding a proforma listing to the compatibility section of the Postgres
> 17 release notes. Can we actually agree on which theoretical index AM
> types are affected first, though? Isn't it actually
> amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
> have that right?

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

> BTW, how should we phrase this compatibility note, so that it can be
> understood? It's rather awkward.

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

> > > As I said to Heikki, thinking about this some more is on my todo list.
> > > I mean the way that this worked had substantial revisions on HEAD
> > > right before I posted v9. v9 was only to fix the bit rot that that
> > > caused.
> >
> > Then I'll be waiting for that TODO item to be resolved.
>
> My TODO item is to resolve the question of whether and to what extent
> these two optimizations should be combined. It's possible that I'll
> decide that it isn't worth it, for whatever reason.

That's fine, this decision (and any work related to it) is exactly
what I was referring to with that mention of the TODO.

> > > Even single digit
> > > thousand of array elements should be considered huge. Plus this code
> > > path is only hit with poorly written queries.
> >
> > Would you accept suggestions?
>
> You mean that you want to have a go at it yourself? Sure, go ahead.
>
> I cannot promise that I'll accept your suggested revisions, but if
> they aren't too invasive/complicated compared to what I have now, then
> I will just accept them. I maintain that this isn't really necessary,
> but it might be the path of least resistance at this point.

Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
on top of your earlier set, and both don't allocate additional memory
in the merge operation in non-assertion builds.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of compare operations if both arrays
are of similar size and value range. It writes data directly back into
the main array on non-assertion builds, and with assertions it reuses
your binary search join on scratch space for algorithm validation.

patch-b is an improved version of patch-a with reduced time complexity
in some cases: It applies exponential search to reduce work done where
there are large runs of unmatched values in either array, and does
that while keeping O(n+m) worst-case complexity, now getting a
best-case O(log(n)) complexity.

> > > > I'm also no fan of the (tail) recursion. I would agree that this is
> > > > unlikely to consume a lot of stack, but it does consume stackspace
> > > > nonetheless, and I'd prefer if it was not done this way.
> > >
> > > I disagree. The amount of stack space it consumes in the worst case is fixed.
> >
> > Is it fixed? Without digging very deep into it, it looks like it can
> > iterate on the order of n_scankeys deep? The comment above does hint
> > on 2 iterations, but is not very clear about the conditions and why.
>
> The recursive call to _bt_advance_array_keys is needed to deal with
> unsatisfied-and-required inequalities that were not detected by an
> initial call to _bt_check_compare, following a second retry call to
> _bt_check_compare. We know that this recursive call to
> _bt_advance_array_keys won't cannot recurse again because we've
> already identified which specific inequality scan key it is that's a
> still-unsatisfied inequality, following an initial round of array key
> advancement.

So, it's a case of this?

scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
(2 (=current), 3, 4)
tuple: a=2, b=4

We first match the 'a' inequality key, then find an array-key mismatch
breaking out, then move the array keys forward to a=2 (of ANY
(1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
b < 4 inequality key, as that required inequality key wasn't checked
because the earlier arraykey did not match in _bt_check_compare,
causing it to stop at the a=1 condition as opposed to check the b < 4
condition?

If so, then this visual explanation helped me understand the point and
why it can't repeat more than once much better than all that text.
Maybe this can be integrated?

> This is a narrow edge-case -- it is an awkward one. Recursion does
> seem natural here, because we're essentially repeating the call to
> _bt_advance_array_keys from inside _bt_advance_array_keys, rather than
> leaving it up to the usual external caller to do later on. If you
> ripped this code out it would barely be noticeable at all. But it
> seems worth it so that we can make a uniform promise to always advance
> the array keys to the maximum extent possible, based on what the
> caller's tuple tells us about the progress of the scan.

Agreed on the awkwardness and recursion.

> > > Because in one case we follow the convention for scan keys, whereas
> > > so->orderProcs is accessed via an attribute number subscript.
> >
> > Okay, but how about this in _bt_merge_arrays?
> >
> > +        Datum       *elem = elems_orig + i;
> >
> > I'm not familiar with the scan key convention, as most other places
> > use reference+subscripting.
>
> I meant the convention used in code like _bt_check_compare (which is
> what we call _bt_checkkeys on HEAD, basically).
>
> Note that the _bt_merge_arrays code that you've highlighted isn't
> iterating through so->keyData[] -- it is iterating through the
> function caller's elements array, which actually come from
> so->arrayKeys[].

It was exactly why I started asking about the use of pointer
additions: _bt_merge_arrays is new code and I can't think of any other
case of this style being used in new code, except maybe when
surrounding code has this style. The reason I first highlighted
_bt_update_keys_with_arraykeys is because it had a clear case of 2
different styles in the same function.

> Like every other Postgres contributor, I do my best to follow the
> conventions established by existing code. Sometimes that leads to
> pretty awkward results, where CamelCase and underscore styles are
> closely mixed together, because it works out to be the most consistent
> way of doing it overall.

I'm slightly surprised by that, as after this patch I can't find any
code that wasn't touched by this patch in nbtutils that uses + for
pointer offsets.
Either way, that's the style for indexing into a ScanKey, apparently?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

23 January 2024, 20:22:58

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg@bowt.ie> wrote:
> And this is where I disagree with your (percieved) implicit argument
> that this should be and always stay this way.

I never said that, and didn't intend to imply it. As I outlined to you
back in November, my general philosophy around APIs (such as the index
AM API) is that ambiguity about the limits and extent of an abstract
interface isn't necessarily a bad thing [1]. It can actually be a good
thing! Ever hear of Hyrum's Law? Abstractions are very often quite
leaky.

APIs like the index AM API inevitably make trade-offs. Trade-offs are
almost always political, in one way or another. Litigating every
possible question up-front requires knowing ~everything before you
really get started. This mostly ends up being a waste of time, since
many theoretical contentious trade-offs just won't matter one bit in
the long run, for one reason or another (not necessarily because there
is only ever one consumer of an API, for all sorts of reasons).

You don't have to agree with me. That's just what my experience
indicates works best on average, and in the long run. I cannot justify
it any further than that.

> By keeping that expectation alive,
> this becomes a self-fulfilling prophecy, and I really don't like such
> restrictive self-fulfilling prophecies. It's nice that we have index
> AM feature flags, but if we only effectively allow one implementation
> for this set of flags by ignoring potential other users when changing
> behavior, then what is the API good for (apart from abstraction
> layering, which is nice)?

I explicitly made it clear that I don't mean that.

> I think that may be right, but I could very well have built a
> btree-lite that doesn't have the historical baggage of having to deal
> with pages from before v12 (version 4) and some improvements that
> haven't made it to core yet.

Let me know if you ever do that. Let me know what the problems you
encounter are. I'm quite happy to revise my position on this in light
of new information. I change my mind all the time!

> Something like the following, maybe?
>
> """
> Compatibility: The planner will now generate paths with array scan
> keys in any column for ordered indexes, rather than on only a prefix
> of the index columns. The planner still expects fully ordered data
> from those indexes.
> Historically, the btree AM couldn't output correctly ordered data for
> suffix array scan keys, which was "fixed" by prohibiting the planner
> from generating array scan keys without an equality prefix scan key up
> to that attribute. In this commit, the issue has been fixed, and the
> planner restriction has thus been removed as the only in-core IndexAM
> that reports amcanorder now supports array scan keys on any column
> regardless of what prefix scan keys it has.
> """

Even if I put something like this in the commit message, I doubt that
Bruce would pick it up in anything like this form (I have my doubts
about it happening no matter what wording is used, actually).

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

I'll think about this some more.

> patch-b is an improved version of patch-a with reduced time complexity
> in some cases: It applies exponential search to reduce work done where
> there are large runs of unmatched values in either array, and does
> that while keeping O(n+m) worst-case complexity, now getting a
> best-case O(log(n)) complexity.

This patch seems massively over-engineered, though. Over 200 new lines
of code, all for a case that is only used when queries are written
with obviously-contradictory quals? It's just not worth it.

> > The recursive call to _bt_advance_array_keys is needed to deal with
> > unsatisfied-and-required inequalities that were not detected by an
> > initial call to _bt_check_compare, following a second retry call to
> > _bt_check_compare. We know that this recursive call to
> > _bt_advance_array_keys won't cannot recurse again because we've
> > already identified which specific inequality scan key it is that's a
> > still-unsatisfied inequality, following an initial round of array key
> > advancement.
>
> So, it's a case of this?
>
> scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
> (2 (=current), 3, 4)
> tuple: a=2, b=4

I don't understand why your example starts with "scankey: a > 1" and
uses redundant/contradictory scan keys (for both "a" and "b"). For a
forward scan the > scan key won't be seen as required by
_bt_advance_array_keys, which means that it cannot be relevant to the
branch containing the recursive call to _bt_advance_array_keys. (The
later branch that calls _bt_check_compare is another matter, but that
doesn't call _bt_advance_array_keys recursively at all -- that's not
what we're discussing.)

I also don't get why you've added all of this tricky redundancy to the
example you've proposed. That seems to make the example a lot more
complicated, without any apparent advantage. This particular piece of
code has nothing to do with redundant/contradictory scan keys.

> We first match the 'a' inequality key, then find an array-key mismatch
> breaking out, then move the array keys forward to a=2 (of ANY
> (1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
> b < 4 inequality key, as that required inequality key wasn't checked
> because the earlier arraykey did not match in _bt_check_compare,
> causing it to stop at the a=1 condition as opposed to check the b < 4
> condition?

I think that that's pretty close, yes.

Obviously, _bt_check_compare() is going to give up upon finding the
most significant now-unsatisfied scan key (which must be a required
scan key in cases relevant to the code in question) -- at that point
we need to advance the array keys. If we go on to successfully advance
all required arrays to values that all match the corresponding values
from caller's tuple, we might still have to consider inequalities
(that are required in the current scan direction).

It might still turn out that the relevant second call to
_bt_check_compare() (the first one inside _bt_advance_array_keys)
still sets continuescan=false -- despite what we were able to do with
the scan's array keys. At that point, the only possible explanation is
that there is a required inequality that still isn't satisfied. We
should use "beyond_end_advance" advancement to advance the array keys
"incrementally" a second time (in a recursive call "against the same
original tuple").

This can only happen when the value of each of two required scan keys
(some equality scan key plus some later inequality scan key) both
cease to be satisfied at the same time, *within the same tuple*. In
practice this just doesn't happen very often. It could in theory be
avoided altogether (without any behavioral change) by forcing
_bt_check_compare to always assess every scan key, rather than giving
up upon finding any unsatisfied scan key (though that would be very
inefficient).

It'll likely be much easier to see what I mean by considering a real
example. See my test case for this, which I don't currently plan to
commit:

https://github.com/petergeoghegan/postgres/blob/saop-dynamic-skip-v10.0/src/test/regress/sql/dynamic_saop_advancement.sql#L3600

I think that only this test case is the only one that'll actually
break when you just rip out the recursive _bt_advance_array_keys call.
And it still won't give incorrect answers when it breaks -- it just
accesses a single extra leaf page.

As I went into a bit already, upthread, this recursive call business
is a good example of making the implementation more complicated, in
order to preserve interface simplicity and generality. I think that
that's the right trade-off here, despite being kinda awkward.

> If so, then this visual explanation helped me understand the point and
> why it can't repeat more than once much better than all that text.
> Maybe this can be integrated?

An example seems like it'd be helpful as a code comment. Can you come
up with a simplified one?

> It was exactly why I started asking about the use of pointer
> additions: _bt_merge_arrays is new code and I can't think of any other
> case of this style being used in new code, except maybe when
> surrounding code has this style. The reason I first highlighted
> _bt_update_keys_with_arraykeys is because it had a clear case of 2
> different styles in the same function.

Meh.

[1] https://postgr.es/m/CAH2-WzmWn2_eS_4rWy90DRzC-NW-oponONR6PwMqy+OOuvVyFA@mail.gmail.com

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

15 February 2024, 23:01:50

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
> on top of your earlier set, and both don't allocate additional memory
> in the merge operation in non-assertion builds.
>
> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

This patch fails some of my tests on non-assert builds only (assert
builds pass all my tests, though). I'm using the first patch on its
own here.

While I tend to be relatively in favor of complicated assertions (I
tend to think the risk of introducing side-effects is worth it), it
looks like you're only performing certain steps in release builds.
This is evident from just looking at the code (there is an #else block
just for the release build in the loop). Note also that
"Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig,
&cxt) == 0)" has side-effects in assert-enabled builds only (it
increments merged_nelems). While it's true that you *also* increment
merged_nelems *outside* of the assertion (or in an #else block used
during non-assert builds), that is conditioned on some other thing (so
it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING
case). It's also just really hard to understand what's going on here.

If I was going to do this kind of thing, I'd use two completely
separate loops, that were obviously completely separate (maybe even
two functions). I'd then memcmp() each array at the end.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

15 February 2024, 23:36:24

On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I could include something less verbose, mentioning a theoretical risk
> to out-of-core amcanorder routines that coevolved with nbtree,
> inherited the same SAOP limitations, and then never got the same set
> of fixes.

Attached is v11, which now says something like that in the commit
message. Other changes:

* Fixed buggy sorting of arrays using cross-type ORDER procs, by
recognizing that we need to consistently use same-type ORDER procs for
sorting and merging the arrays during array preprocessing.

Obviously, when we sort, we compare array elements to other array
elements (all of the same type). This is true independent of whether
the query itself happens to use a cross type operator/ORDER proc, so
we will need to do two separate ORDER proc lookups in cross-type
scenarios.

* No longer subscript the ORDER proc used for array binary searches
using a scankey subscript. Now there is an additional indirection that
works even in the presence of multiple redundant scan keys that cannot
be detected as such due to a lack of appropriate cross-type support
within an opfamily.

This was subtly buggy before now. Requires a little more coordination
between array preprocessing and standard/primitive index scan
preprocessing, which isn't ideal but seems unavoidable.

* Lots of code polishing, especially within _bt_advance_array_keys().

While _bt_advance_array_keys() still works in pretty much exactly the
same way as it did back in v10, there are now better comments.
Including something about why its recursive call to itself is
guaranteed to use a low, fixed amount of stack space, verified using
an assertion. That addresses a concern held by Matthias.

Outlook
=======

This patch is approaching being committable now. Current plan is to
commit this within the next few weeks.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all. I've put that off until now
because it isn't particularly fundamental to what I'm doing here, and
seems optional.

I would also like to do more performance validation. Things like the
parallel index scan code could stand to be revisited once again. Plus
I should think about the overhead of array preprocessing when
btrescan() is called many times, from a nested loop join -- I should
have something to say about that concern (raised by Heikki at one
point) before too long.

--
Peter Geoghegan

Attachment

v11-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

02 March 2024, 01:30:15

On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v11, which now says something like that in the commit
> message.

Attached is v12.

> All that really remains now is to research how we might integrate this
> work with the recently added continuescanPrechecked/haveFirstMatch
> stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

The haveFirstMatch optimization (not a great name BTW) was the easier
of the two to integrate. I just invalidate the flag (set it to false)
when the array keys advance. It works in exactly the same way as in
the simple no-arrays case, except that there can be more than one
"first match" per page. (This approach was directly enabled by the new
design that came out of Alexander's bugfix commit 7e6fb5da)

The continuescanPrechecked optimization was trickier. The hard part
was avoiding confusion when the start of matches for the current set
of array keys starts somewhere in the middle of the page -- that needs
to be a gating condition on applying the optimization, applied within
_bt_readpage.

continuescanPrechecked
======================

To recap, this precheck optimization is predicated on the idea that
the precheck _bt_checkkeys call tells us all we need to know about
required-in-same-direction scan keys for every other tuple on the page
(assuming continuescan=true is set during the precheck). Critically,
this assumes that there can be no confusion between "before the start
of matching tuples" and "after the end of matching tuples" (in the
presence of required equality strategy scan keys). In other words, it
tacitly assumes that we can't break a very old rule that says that
_bt_first must never call _bt_readpage with an offnum that's before
the start of the first match (the leaf page position located by our
insertion scan key). Although it isn't obvious that this is relevant
at all right now (partly because the _bt_first call to _bt_readpage
won't even use the optimization on the grounds that to do so would
regress point queries), it becomes more obvious once my patch is added
to the mix.

To recap again, making the arrays advance dynamically necessarily
requires that I teach _bt_checkkeys to avoid confusing "before the
start of matching tuples" with "after the end of matching tuples" --
we must give _bt_checkkeys a set of 3-way ORDER procs (and the
necessary context) to avoid that confusion -- even for any non-array
required equality keys.

Putting it all together (having laid the groundwork with my recaps):
This means that it is only safe (for my patch) to even attempt the
precheck optimization when we already know that the array keys cover
keyspace at the start of the page (and if the attempt actually
succeeds, it succeeds because the current array keys cover the entire
page). And so in v12 I track that state in so->scanBehind. In
practice, we only rarely need to avoid the continuescanPrechecked
optimization as a result of this gating condition -- it hardly reduces
the effectiveness of the optimization at all. In fact, it isn't even
possible for so->scanBehind to ever be set during a backward scan.

Saving cycles in_bt_checkkeys with so->scanBehind
-------------------------------------------------

It turns out that this so->scanBehind business is generally useful. We
now expect _bt_advance_array_keys to establish whether or not the
_bt_readpage-wise scan will continue to the end of the current leaf
page up-front, each time the arrays are advanced (advanced past the
tuple being scanned by _bt_readpage). When _bt_advance_array_keys
decides to move onto the next page, it might also set so->scanBehind.
This structure allowed me to simplify the hot code paths within
_bt_checkkeys. Now _bt_checkkeys doesn't need to check the page high
key (or check the first non-pivot tuple in the case of backwards
scans) at all in the very common case where so->scanBehind hasn't been
set by _bt_advance_array_keys.

I mentioned that only forward scans can ever set the so->scanBehind
flag variable. That's because only forward scans can advance the array
keys using the page high key, which (due to suffix truncation) creates
the possibility that the next _bt_readpage will start out on a page
whose earlier tuples are before the real start of matches (for the
current array keys) -- see the comments in _bt_advance_array_keys for
the full explanation.

so->scanBehind summary
----------------------

In summary, so->scanBehind serves two quite distinct purposes:

1. Makes it safe to apply the continuescanPrechecked optimization
during scans that happen to use array keys -- with hardly any changes
required to _bt_readpage to make this work (just testing the
so->scanBehind flag).

2. Gives _bt_checkkeys() a way to know when to check if an earlier
speculative choice (by _bt_advance_array_keys) to move on to the
current leaf page without it yet being 100% clear that its key space
has exact matches for all the scan's equality/array keys (which is
only possible when suffix truncation obscures the issue, hence the
thing about so->scanBehind only ever being set during forward scans).

This speculative choice can only be made during forward scans of a
composite index, whose high key has one or more truncated suffix
attributes that correspond to a required scan key.
_bt_advance_array_keys deems that the truncated attribute "satisfies"
the scan key, but remembers that it wasn't strictly speaking
"satisfied", by setting so->scanBehind (so it is "satisfied" with the
truncated attributes being a "match", but not so satisfied that it's
willing to allow it without also giving _bt_checkkeys a way to back
out if it turns out to have been the wrong choice once on the next
page).

_bt_preprocess_keys contradictory key array advancement
=======================================================

The other big change in v12 concerns _bt_preprocess_keys (not to be
confused with _bt_preprocess_array_keys). It has been taught to treat
the scan's array keys as a distinct type of scan key for the purposes
of detecting redundant and contradictory scan keys. In other words, it
treats scan keys with arrays as their own special type of equality
scan key -- it doesn't treat them as just another type of equality
strategy key in v12. In other other words, _bt_preprocess_keys doesn't
"work at the level of individual primitive index scans" in v12.

Technically this is an optimization that targets cases with many
distinct sets of array keys that have contradictory quals. But I
mostly did this because it has what I now consider to be far better
code structure than what we had in previous versions (I changed my
mind, having previously considered the changes I'd made to
_bt_preprocess_keys for arrays to be less than desirable). And because
it defensively makes the B-Tree code tolerant of scans that have an
absurdly large number of distinct sets of array keys (e.g., 3 arrays
with 1,000 elements, which gives us a billion distinct sets of array
keys in total).

In v12 an index scan can use (say) a billion distinct sets of array
keys, and still expect optimal performance -- even in the corner case
where (for whatever reason) the scan's quals also happen to be
contradictory. There is no longer any risk of calling
_bt_preprocess_keys a billion times, making the query time explode,
all because somebody added an "a = 5 AND a = 6" to the end of the
query's qual. In fact, there isn't ever a need to call
_bt_preprocess_keys() more than once, no matter the details of the
scan -- not now that _bt_preprocess_keys literally operates on whole
arrays as a separate thing to other equality keys. Once you start to
think in terms of array scan keys (not primitive index scans), it
becomes obvious that a single call to _bt_preprocess_keys is all you
really need,

(Note: While we actually do still always call _bt_preprocess_keys from
_bt_first, every call to _bt_preprocess_keys after the first one can
now simply reuse the existing scan keys that were output during the
first call. So effectively there's only one call to
_bt_preprocess_keys required per top-level scan.)

Potential downsides
-------------------

Admittedly, this new approach in _bt_preprocess_keys has some
potential downsides. _bt_preprocess_keys is now largely incapable of
treating quals like "WHERE a IN(1,2,3) AND a = 2" as "partially
contradictory". What it'll actually do is allow the qual to contain
duplicative scan keys (just like when we can't detect redundancy due
to a lack of cross-type support), and leave it all up to
_bt_checkkeys/_bt_advance_array_keys to figure it out later on.  (At
one point I actually taught _bt_preprocesses_keys to perform
"incremental advancement" of the scan's arrays itself, which worked,
but that still left the code vulnerable to having to call
_bt_preprocesses_keys an enormous number of times in extreme cases.
That was deemed unacceptable, because it still seemed like a POLA
violation.)

This business with just leaving in some extra scan keys might seem a
little bit sloppy. I thought so myself, for a while, but now I think
that it's actually fine. For the following reasons:

* Even with arrays, contradictory quals like "WHERE a = 4 AND a = 5"
are still detected as contradictory, while quals like "WHERE a = 5 and
a = 5" are still detected as containing a redundancy. We'll only "fail
to detect redundancy/contradictoriness" with cases such as "WHERE a IN
(1, 2, 3) and a = 2" -- cases that (once you buy into this new way of
thinking about arrays and preprocessing) aren't really
contradictory/redundant at all.

* We have _bt_preprocess_array_keys, which was first taught to merge
together redundant array keys in an earlier version of this patch. So,
for example, quals like "WHERE a IN (1, 2, 3) AND a IN (3,4,5)" can be
preprocessed into "WHERE a in (3)". Plus quals like "WHERE a IN (1, 2,
3) AND a IN (18,72)" can be ruled contradictory there and then, in
_bt_preprocess_array_keys , ending the scan immediately.

These cases seem like the really important ones for array
redundancy/contradictoriness. We're not going to miss these array
redundancies.

* The new approach to array key advancement is very good at quickly
eliminating redundant subsets of the array keys that couldn't be
detected as such by preprocessing, due to a lack of suitable
infrastructure. While there is some downside from allowing
preprocessing to fail to detect "partially contradictory" quals, the
new code is so good at advancing the array keys efficiently at runtime
the added overhead is hardly noticeable at all. We're talking maybe
one or two extra descents of the index, for just a subset of all badly
written queries that show some kind of redundancy/contradictoriness
involving array scan keys.

Even if somebody takes the position that this approach isn't
acceptable (not that I expect that anybody will), then the problem
won't be that I've gone too far. The problem will just be that I
haven't gone far enough. If it somehow becomes a blocker to commit,
I'd rather handle the problem by compensating for the minor downsides
that the v12 _bt_preprocess_keys changes arguably create, by adding
more types of array-specific preprocessing to
_bt_preprocess_array_keys. You know, preprocessing that can actually
eliminate a subset of array keys given a qual like "WHERE a IN (1, 2,
3) AND a < 2".

To be clear, I really don't think it's worth adding new types of array
preprocessing to _bt_preprocess_array_keys, just for these cases --
my current approach works just fine, by any relevant metric (even if
you assume that the sorts of array redundancy/contradictoriness we can
miss out on are common and important, it's small potatoes). Just
bringing the idea of adding more stuff to _bt_preprocess_array_keys to
the attention of the group, as an option (this idea might also provide
useful context, that makes my high level thinking on this a bit easier
to follow).

--
Peter Geoghegan

Attachment

v12-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

04 March 2024, 20:51:37

On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v11, which now says something like that in the commit
> > message.
>
> Attached is v12.

Some initial comments on the documentation:

> +    that searches for multiple values together.  Queries that use certain
> +    <acronym>SQL</acronym> constructs to search for rows matching any value
> +    out of a list (or an array) of multiple scalar values might perform
> +    multiple <quote>primitive</quote> index scans (up to one primitive scan
> +    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
> +    for details.

I don't think the "see <functions-comparisons> for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+    that searches for multiple values together.  Queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list or array of multiple scalar values (such as those
described in
+    <functions-comparisons> might perform multiple <quote>primitive</quote>
+    index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

> > All that really remains now is to research how we might integrate this
> > work with the recently added continuescanPrechecked/haveFirstMatch
> > stuff from Alexander Korotkov, if at all.
>
> The main change in v12 is that I've integrated both the
> continuescanPrechecked and the haveFirstMatch optimizations. Both of
> these fields are now page-level state, shared between the _bt_readpage
> caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> appear next to the new home for _bt_checkkeys' continuescan field, in
> the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

06 March 2024, 00:50:18

On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > +    that searches for multiple values together.  Queries that use certain
> > +    <acronym>SQL</acronym> constructs to search for rows matching any value
> > +    out of a list (or an array) of multiple scalar values might perform
> > +    multiple <quote>primitive</quote> index scans (up to one primitive scan
> > +    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
> > +    for details.
>
> I don't think the "see <functions-comparisons> for details" is
> correctly worded: The surrounding text implies that it would contain
> details about in which precise situations multiple primitive index
> scans would be consumed, while it only contains documentation about
> IN/NOT IN/ANY/ALL/SOME.
>
> Something like the following would fit better IMO:
>
> +    that searches for multiple values together.  Queries that use certain
> +    <acronym>SQL</acronym> constructs to search for rows matching any value
> +    out of a list or array of multiple scalar values (such as those
> described in
> +    <functions-comparisons> might perform multiple <quote>primitive</quote>
> +    index scans (up to one primitive scan per scalar value) at runtime.

I think that there is supposed to be a closing parenthesis here? So
"... (such as those described in <functions-comparisons>") might
perform...".

FWM, if that's what you meant.

> Then there is a second issue in the paragraph: Inverted indexes such
> as GIN might well decide to start counting more than one "primitive
> scan" per scalar value, because they may need to go through their
> internal structure more than once to produce results for a single
> scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
> trigram index would likely use at least 4 descents here: one for each
> of "som", "ome", "wor", "ord".

Calling anything other than an executor invocation of
amrescan/index_rescan an "index scan" (or "primitive index scan") is a
bit silly IMV, since, as you point out, whether the index AM manages
to do things like descend the underlying physical index structure is
really just an implementation detail. GIN has more than one B-Tree, so
it's not like there's necessarily just one "index structure", anyway.
It seems to me the only meaningful definition of index scan is
something directly tied to how the index AM gets invoked. That's a
logical, index-AM-agnostic concept. It has a fixed relationship to how
EXPLAIN ANALYZE displays information.

But we're not living in a perfect world. The fact is that the
pg_stat_*_tables system views follow a definition that brings these
implementation details into it. Its definition of index scan is
probably a legacy of when SAOPs could only be executed in a way that
was totally opaque to the index AM (which is how it still works for
every non-nbtree index AM). Back then, "index scan" really was
synonymous with an executor invocation of amrescan/index_rescan with
SAOPs.

I've described the issues in this area (in the docs) in a way that is
most consistent with historical conventions. That seems to have the
fewest problems, despite everything I've said about it.

> > > All that really remains now is to research how we might integrate this
> > > work with the recently added continuescanPrechecked/haveFirstMatch
> > > stuff from Alexander Korotkov, if at all.
> >
> > The main change in v12 is that I've integrated both the
> > continuescanPrechecked and the haveFirstMatch optimizations. Both of
> > these fields are now page-level state, shared between the _bt_readpage
> > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> > appear next to the new home for _bt_checkkeys' continuescan field, in
> > the new page state struct).
>
> Cool. I'm planning to review the rest of this patch this
> week/tomorrow, could you take some time to review some of my btree
> patches, too?

Okay, I'll take a look again.

Attached is v13.

At one point Heikki suggested that I just get rid of
BTScanOpaqueData.arrayKeyData (which has been there for as long as
nbtree had native support for SAOPs), and use
BTScanOpaqueData.keyData exclusively instead. I've finally got around
to doing that now.

These simplifications were enabled by my new approach within
_bt_preprocess_keys, described when I posted v12. v13 goes even
further than v12 did, by demoting _bt_preprocess_array_keys to a
helper function for _bt_preprocess_keys. That means that we do all of
our scan key preprocessing at the same time, at the start of _bt_first
(though only during the first _bt_first, or to be more precise during
the first per btrescan). If we need fancier
preprocessing/normalization for arrays, then it ought to be a lot
easier with this structure.

Note that we no longer need to have an independent representation of
so->qual_okay for array keys (the convention of setting
so->numArrayKeys to -1 for unsatisfiable array keys is no longer
required). There is no longer any need for a separate pass to carry
over the contents of BTScanOpaqueData.arrayKeyData to
BTScanOpaqueData.keyData, which was confusing.

Are you still interested in working directly on the preprocessing
stuff? I have a feeling that I was slightly too optimistic about how
likely we were to be able to get away with not having certain kinds of
array preprocessing, back when I posted v12. It's true that the
propensity of the patch to not recognize "partial
redundancies/contradictions" hardly matters with redundant equalities,
but inequalities are another matter. I'm slightly worried about cases
like this one:

select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;

Maybe we need to do better with that. What do you think?

--
Peter Geoghegan

Attachment

v13-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

06 March 2024, 21:46:00

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > +    that searches for multiple values together.  Queries that use certain
> > > +    <acronym>SQL</acronym> constructs to search for rows matching any value
> > > +    out of a list (or an array) of multiple scalar values might perform
> > > +    multiple <quote>primitive</quote> index scans (up to one primitive scan
> > > +    per scalar value) at runtime.  See <xref linkend="functions-comparisons"/>
> > > +    for details.
> >
> > I don't think the "see <functions-comparisons> for details" is
> > correctly worded: The surrounding text implies that it would contain
> > details about in which precise situations multiple primitive index
> > scans would be consumed, while it only contains documentation about
> > IN/NOT IN/ANY/ALL/SOME.
> >
> > Something like the following would fit better IMO:
> >
> > +    that searches for multiple values together.  Queries that use certain
> > +    <acronym>SQL</acronym> constructs to search for rows matching any value
> > +    out of a list or array of multiple scalar values (such as those
> > described in
> > +    <functions-comparisons> might perform multiple <quote>primitive</quote>
> > +    index scans (up to one primitive scan per scalar value) at runtime.
>
> I think that there is supposed to be a closing parenthesis here? So
> "... (such as those described in <functions-comparisons>") might
> perform...".

Correct.

> FWM, if that's what you meant.

WFM, yes?

> > Then there is a second issue in the paragraph: Inverted indexes such
> > as GIN might well decide to start counting more than one "primitive
> > scan" per scalar value,
[...]
> I've described the issues in this area (in the docs) in a way that is
> most consistent with historical conventions. That seems to have the
> fewest problems, despite everything I've said about it.

Clear enough, thank you for explaining your thoughts on this.

> > > > All that really remains now is to research how we might integrate this
> > > > work with the recently added continuescanPrechecked/haveFirstMatch
> > > > stuff from Alexander Korotkov, if at all.
> > >
> > > The main change in v12 is that I've integrated both the
> > > continuescanPrechecked and the haveFirstMatch optimizations. Both of
> > > these fields are now page-level state, shared between the _bt_readpage
> > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> > > appear next to the new home for _bt_checkkeys' continuescan field, in
> > > the new page state struct).
> >
> > Cool. I'm planning to review the rest of this patch this
> > week/tomorrow, could you take some time to review some of my btree
> > patches, too?
>
> Okay, I'll take a look again.

Thanks, greatly appreciated.

> At one point Heikki suggested that I just get rid of
> BTScanOpaqueData.arrayKeyData (which has been there for as long as
> nbtree had native support for SAOPs), and use
> BTScanOpaqueData.keyData exclusively instead. I've finally got around
> to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

> These simplifications were enabled by my new approach within
> _bt_preprocess_keys, described when I posted v12. v13 goes even
> further than v12 did, by demoting _bt_preprocess_array_keys to a
> helper function for _bt_preprocess_keys. That means that we do all of
> our scan key preprocessing at the same time, at the start of _bt_first
> (though only during the first _bt_first, or to be more precise during
> the first per btrescan). If we need fancier
> preprocessing/normalization for arrays, then it ought to be a lot
> easier with this structure.

Agreed.

> Note that we no longer need to have an independent representation of
> so->qual_okay for array keys (the convention of setting
> so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> required). There is no longer any need for a separate pass to carry
> over the contents of BTScanOpaqueData.arrayKeyData to
> BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

> Are you still interested in working directly on the preprocessing
> stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

>  I have a feeling that I was slightly too optimistic about how
> likely we were to be able to get away with not having certain kinds of
> array preprocessing, back when I posted v12. It's true that the
> propensity of the patch to not recognize "partial
> redundancies/contradictions" hardly matters with redundant equalities,
> but inequalities are another matter. I'm slightly worried about cases
> like this one:
>
> select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
>
> Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final part of nbtutils.

> Attached is v13.

It looks like there are few issues remaining outside the changes in
nbtutils. I've reviewed the changes to those files, and ~half of
nbtutils (up to _bt_advance_array_keys_increment) now. I think I can
get the remainder done tomorrow.

> +++ b/src/backend/access/nbtree/nbtutils.c
>  /*
>   * _bt_start_array_keys() -- Initialize array keys at start of a scan
>  *
>  * Set up the cur_elem counters and fill in the first sk_argument value for
> - * each array scankey.  We can't do this until we know the scan direction.
> + * each array scankey.
>  */
> void
> _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
> @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
>     BTScanOpaque so = (BTScanOpaque) scan->opaque;
>     int            i;
>
> +    Assert(so->numArrayKeys);
> +    Assert(so->qual_ok);

Has the requirement for a known scan direction been removed, or should
this still have an Assert(dir != NoMovementScanDirection)?

> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -1713,17 +1756,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
> [...]
> -            _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false, false);
> +            pstate.prechecked = false;    /* prechecked earlier tuple */

I'm not sure that comment is correct, at least it isn't as clear as
can be. Maybe something more in the line of the following?
+            pstate.prechecked = false;    /* prechecked didn't cover HIKEY */


+++ b/src/backend/access/nbtree/nbtutils.c
> @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> +        elemtype = cur->sk_subtype;
> +        if (elemtype == InvalidOid)
> +            elemtype = rel->rd_opcintype[cur->sk_attno - 1];

Should we Assert() that this elemtype matches the array element type
ARR_ELEMTYPE(arrayval) used to deconstruct the array?

> +        /*
> +         * If this scan key is semantically equivalent to a previous equality
> +         * operator array scan key, merge the two arrays together to eliminate
> [...]
> +        if (prevArrayAtt == cur->sk_attno && prevElemtype == elemtype)

This is not "a" previous equality key, but "the" previous equality
operator array scan key.
Do we want to expend some additional cycles for detecting duplicate
equality array key types in interleaved order like =int[] =bigint[]
=int[]? I don't think it would be very expensive considering the
limited number of cross-type equality operators registered in default
PostgreSQL, so a simple loop that checks matching element types
starting at the first array key scankey for this attribute should
suffice. We could even sort the keys by element type if we wanted to
fix any O(n) issues for this type of behaviour (though this is
_extremely_ unlikely to become a performance issue).

> +             * qual is unsatisfiable
> +             */
> +            if (num_elems == 0)
> +            {
> +                so->qual_ok = false;
> +                return NULL;
> +            }

I think there's a memory context switch back to oldContext missing
here, as well as above at `if (num_nonnulls == 0)`. These were
probably introduced by changing from break to return; and the paths
aren't yet exercised in regression tests.

> +++ b/src/backend/utils/adt/selfuncs.c
has various changes in comments that result in spelling and style issues:

> +     * primitive index scans that will be performed for caller
+     * primitive index scans that will be performed for the caller.
(missing "the", period)

> -         * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
> -         * since that's internal to the indexscan.)
> +         * share for each outer scan

The trailing period was lost:
+         * share for each outer scan.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

06 March 2024, 21:51:18

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:
> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.
Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

- Matthias

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Benoit Tigeot

Date:

07 March 2024, 15:41:55

Hello,

I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we have at the moment and the result are still very very good. [1]

In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We hope that Postgres 17 will contain those improvements.

Kind regards,

Benoit

[1] - https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__________
Benoit Tigeot

Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan <pg@bowt.ie> a écrit :

On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I could include something less verbose, mentioning a theoretical risk
> to out-of-core amcanorder routines that coevolved with nbtree,
> inherited the same SAOP limitations, and then never got the same set
> of fixes.

Attached is v11, which now says something like that in the commit
message. Other changes:

* Fixed buggy sorting of arrays using cross-type ORDER procs, by
recognizing that we need to consistently use same-type ORDER procs for
sorting and merging the arrays during array preprocessing.

Obviously, when we sort, we compare array elements to other array
elements (all of the same type). This is true independent of whether
the query itself happens to use a cross type operator/ORDER proc, so
we will need to do two separate ORDER proc lookups in cross-type
scenarios.

* No longer subscript the ORDER proc used for array binary searches
using a scankey subscript. Now there is an additional indirection that
works even in the presence of multiple redundant scan keys that cannot
be detected as such due to a lack of appropriate cross-type support
within an opfamily.

This was subtly buggy before now. Requires a little more coordination
between array preprocessing and standard/primitive index scan
preprocessing, which isn't ideal but seems unavoidable.

* Lots of code polishing, especially within _bt_advance_array_keys().

While _bt_advance_array_keys() still works in pretty much exactly the
same way as it did back in v10, there are now better comments.
Including something about why its recursive call to itself is
guaranteed to use a low, fixed amount of stack space, verified using
an assertion. That addresses a concern held by Matthias.

Outlook
=======

This patch is approaching being committable now. Current plan is to
commit this within the next few weeks.

All that really remains now is to research how we might integrate this
work with the recently added continuescanPrechecked/haveFirstMatch
stuff from Alexander Korotkov, if at all. I've put that off until now
because it isn't particularly fundamental to what I'm doing here, and
seems optional.

I would also like to do more performance validation. Things like the
parallel index scan code could stand to be revisited once again. Plus
I should think about the overhead of array preprocessing when
btrescan() is called many times, from a nested loop join -- I should
have something to say about that concern (raised by Heikki at one
point) before too long.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

07 March 2024, 20:34:52

On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that there is supposed to be a closing parenthesis here? So
> > "... (such as those described in <functions-comparisons>") might
> > perform...".
>
> Correct.
>
> > FWM, if that's what you meant.
>
> WFM, yes?

Then we're in agreement on this.

> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

I agree that the patch is quite complex, especially relative to its size.

> > Note that we no longer need to have an independent representation of
> > so->qual_okay for array keys (the convention of setting
> > so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> > required). There is no longer any need for a separate pass to carry
> > over the contents of BTScanOpaqueData.arrayKeyData to
> > BTScanOpaqueData.keyData, which was confusing.
>
> I wasn't very confused by it, but sure.
>
> > Are you still interested in working directly on the preprocessing
> > stuff?
>
> If you mean my proposed change to merging two equality AOPs, then yes.
> I'll try to fit it in tomorrow with the rest of the review.

I didn't just mean that stuff. I was also suggesting that you could
join the project directly (not just as a reviewer). If you're
interested, you could do general work on the preprocessing of arrays.
Fancier array-specific preprocessing.

For example, something that can transform this: select * from foo
where a in (1,2,3) and a < 3;

Into this: select * from foo where a in (1,2);

Or, something that can just set qual_okay=false, given a query such
as: select * from foo where a in (1,2,3) and a < 5;

This is clearly doable by reusing the binary search code during
preprocessing. The first example transformation could work via a
binary search for the constant "3", followed by a memmove() to shrink
the array in-place (plus the inequality itself would need to be fully
eliminated). The second example would work a little like the array
merging thing that the patch has already, except that there'd only be
one array involved (there wouldn't be a pair of arrays).

> > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
> >
> > Maybe we need to do better with that. What do you think?
>
> Let me come back to that when I'm done reviewing the final part of nbtutils.

Even if this case doesn't matter (which I now doubt), it's probably
easier to just get it right than it would be to figure out if we can
live without it.

> > void
> > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
> > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
> >     BTScanOpaque so = (BTScanOpaque) scan->opaque;
> >     int            i;
> >
> > +    Assert(so->numArrayKeys);
> > +    Assert(so->qual_ok);
>
> Has the requirement for a known scan direction been removed, or should
> this still have an Assert(dir != NoMovementScanDirection)?

I agree that such an assertion is worth having. Added that locally.

> I'm not sure that comment is correct, at least it isn't as clear as
> can be. Maybe something more in the line of the following?
> +            pstate.prechecked = false;    /* prechecked didn't cover HIKEY */

I agree that that's a little better than what I had.

> +++ b/src/backend/access/nbtree/nbtutils.c
> > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> > +        elemtype = cur->sk_subtype;
> > +        if (elemtype == InvalidOid)
> > +            elemtype = rel->rd_opcintype[cur->sk_attno - 1];
>
> Should we Assert() that this elemtype matches the array element type
> ARR_ELEMTYPE(arrayval) used to deconstruct the array?

Yeah, good idea.

> This is not "a" previous equality key, but "the" previous equality
> operator array scan key.
> Do we want to expend some additional cycles for detecting duplicate
> equality array key types in interleaved order like =int[] =bigint[]
> =int[]? I don't think it would be very expensive considering the
> limited number of cross-type equality operators registered in default
> PostgreSQL, so a simple loop that checks matching element types
> starting at the first array key scankey for this attribute should
> suffice. We could even sort the keys by element type if we wanted to
> fix any O(n) issues for this type of behaviour (though this is
> _extremely_ unlikely to become a performance issue).

Yes, I think that we should probably have this (though likely wouldn't
bother sorting the scan keys themselves).

You'd need to look-up cross-type operators for this. They could
possibly fail, even when nothing else fails, because in principle you
could have 3 types involved for only 2 scan keys: the opclass/on-disk
type, plus 2 separate types. We could fail to find a cross-type
operator for our "2 separate types", while still succeeding in finding
2 cross type operators for each of the 2 separate scan keys (each
paired up the indexed column).

When somebody writes a silly query with such obvious redundancies,
there needs to be a sense of proportion about it. Fixed preprocessing
costs don't seem that important. What's important is that we make a
reasonable effort to avoid horrible runtime performance when it can be
avoided, such as scanning the whole index. (If a user has a complaint
about added cycles during preprocessing, then the fix for that is to
just not write silly queries. I care much less about added cycles if
they have only a fixed, small-ish added cost, that isn't borne by
sensibly-written queries.)

> > +             * qual is unsatisfiable
> > +             */
> > +            if (num_elems == 0)
> > +            {
> > +                so->qual_ok = false;
> > +                return NULL;
> > +            }
>
> I think there's a memory context switch back to oldContext missing
> here, as well as above at `if (num_nonnulls == 0)`. These were
> probably introduced by changing from break to return; and the paths
> aren't yet exercised in regression tests.

I agree that this is buggy. But it doesn't seem to actually fail in
practice (it "accidentally fails to fail"). In practice, the whole
index scan is shut down before anything breaks anyway.

I mention this only because it's worth understanding that I do have
coverage for this case (at least in my own expansive test suite). It
just wasn't enough to detect this particular oversight.

> > +++ b/src/backend/utils/adt/selfuncs.c
> has various changes in comments that result in spelling and style issues:
>
> > +     * primitive index scans that will be performed for caller
> +     * primitive index scans that will be performed for the caller.
> (missing "the", period)

I'll just keep the previous wording and punctuation, with one
difference: "index scans" will be changed to "primitive index scans".

> > -         * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
> > -         * since that's internal to the indexscan.)
> > +         * share for each outer scan
>
> The trailing period was lost:
> +         * share for each outer scan.

I'm not in the habit of including a period/full stop after a
standalone sentence (actually, I'm in the habit of *not* doing so,
specifically). But in the interest of consistency with surrounding
code (and because it doesn't really matter), I'll add one back here.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

07 March 2024, 21:07:28

On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> To clarify, what I mean here is that merging the changes of both the
> SAOPs changes and the removal of arrayKeyData seems to increase the
> patch size and increases the maximum complexity of each component
> patch's review.

Removing arrayKeyData probably makes the patch very slightly smaller,
actually. But even if it's really the other way around, I'd still like
to get rid of it as part of the same commit as everything else. It
just makes sense that way.

> Separate patches may make this more reviewable, or not, but no comment
> was given on why it is better to merge the changes into a single
> patch.

Fair enough. Here's why:

The original SAOP design (commit 9e8da0f7, "Teach btree to handle
ScalarArrayOpExpr quals natively") added a layer of indirection
between scan->keyData (input scan keys) and so->keyData (output scan
keys): it added another scan key array, so->arrayKeyData. There was
array-specific preprocessing in _bt_preprocess_array_keys, that
happened before the first primitive index scan even began -- that
transformed our true input scan keys (scan->keyData) into a copy of
the array with limited amounts of array-specific preprocessing already
performed (so->arrayKeyData).

This made a certain amount of sense at the time, because
_bt_preprocess_keys was intended to be called once per primitive index
scan. Kind of like the inner side of a nested loop join's inner index
scan, where we call _bt_preprocess_keys once per inner-side
scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess
once per primitive index scan per btrescan call, which might matter in
those rare cases where the inner side of a nestloop join had SAOP
quals.)

What I now propose to do is to just call _bt_preprocess_keys once per
btrescan (actually, it's still called once per primitive index scan,
but all calls after the first are just no-ops after v12 of the patch).
This makes sense because SAOP array constants aren't like nestloop
joins with an inner index scans, in one important way: we really can
see everything up-front. We can see all of the array elements, and
operate on whole arrays as necessary during preprocessing (e.g.,
performing the array merging thing I added to
_bt_preprocess_array_keys).

It's not like the next array element is only visible to prepocessing
only after the outer side of a nestloop join runs, and next calls
btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is
almost the same thing as "WHERE a = any('{1,2}')", so why not use
essentially the same approach to preprocessing in both cases? We don't
need to copy the input keys into so->arrayKeyData, because the
indirection (which is a bit like a fake nested loop join) doesn't buy
us anything.

v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While
it removes the arrayKeyData field from the BTScanOpaqueData struct, we
still use a temporary array (accessed via a pointer that's just a
local variable) that's also called arrayKeyData. And that also stores
array-preprocessing-only copies of the input scan keys. That happens
within _bt_preprocess_keys.

So we do "still need arrayKeyData", but we only need it for a brief
period at the start of the index scan. It just doesn't make any sense
to keep it around for longer than that, in a world where
_bt_preprocess_keys "operates directly on arrays".  That only made
sense (a bit of sense) back when _bt_preprocess_keys was subordinate
to _bt_preprocess_array_keys, but it's kinda the other way around now.
We could probably even get rid of this remaining limited form of
arrayKeyData, but that doesn't seem like it would add much.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

08 March 2024, 14:00:14

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > Are you still interested in working directly on the preprocessing
> > stuff?
>
> If you mean my proposed change to merging two equality AOPs, then yes.
> I'll try to fit it in tomorrow with the rest of the review.

I've attached v14, where 0001 is v13, 0002 is a patch with small
changes + some large comment suggestions, and 0003 which contains
sorted merge join code for _bt_merge_arrays.

I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
can make of it.

> >  I have a feeling that I was slightly too optimistic about how
> > likely we were to be able to get away with not having certain kinds of
> > array preprocessing, back when I posted v12. It's true that the
> > propensity of the patch to not recognize "partial
> > redundancies/contradictions" hardly matters with redundant equalities,
> > but inequalities are another matter. I'm slightly worried about cases
> > like this one:
> >
> > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
> >
> > Maybe we need to do better with that. What do you think?
>
> Let me come back to that when I'm done reviewing the final part of nbtutils.
>
> > Attached is v13.
>
> It looks like there are few issues remaining outside the changes in
> nbtutils. I've reviewed the changes to those files, and ~half of
> nbtutils (up to _bt_advance_array_keys_increment) now. I think I can
> get the remainder done tomorrow.

> +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 [...]
> +        if (!(cur->sk_flags & SK_SEARCHARRAY) &&
> +            cur->sk_strategy != BTEqualStrategyNumber)
> +            continue;

This should use ||, not &&: if it's not an array, or not an equality
array key, it's not using an array key slot and we're not interested.
Note that _bt_verify_arrays_bt_first does have the right condition already.

> +        if (readpagetup || result != 0)
> +        {
> +            Assert(result != 0);
> +            return false;
> +        }

I'm confused about this. By asserting !readpagetup after this exit, we
could save a branch condition for the !readpagetup result != 0 path.
Can't we better assert the inverse just below, or is this specifically
for defense-in-depth against bug? E.g.

+        if (result != 0)
+            return false;
+
+        Assert(!readpagetup);

> +    /*
> +     * By here we have established that the scan's required arrays were
> +     * advanced, and that they haven't become exhausted.
> +     */
> +    Assert(arrays_advanced || !arrays_exhausted);

Should use &&, based on the comment.

> +     * We generally permit primitive index scans to continue onto the next
> +     * sibling page when the page's finaltup satisfies all required scan keys
> +     * at the point where we're between pages.

This should probably describe that we permit primitive scans with
array keys to continue until we get to the sibling page, rather than
this rather obvious and generic statement that would cover even the
index scan for id > 0 ORDER BY id asc; or this paragraph can be
removed.

+    if (!all_required_satisfied && pstate->finaltup &&
+        _bt_tuple_before_array_skeys(scan, dir, pstate->finaltup, false, 0,
+                                     &so->scanBehind))
+        goto new_prim_scan;

> +_bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
[...]
> +        if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
> +            ((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
> +            continue;

I think a simple check if any SK_BT_REQ flag is set should be OK here:
The key must be an equality key, and those must be required either in
both directions, or in neither direction.

-----

Further notes:

I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
Scan might be behind arrays? */" doesn't give me enough hints here.

I find it weird that we call _bt_advance_array_keys for non-required
sktrig. Shouldn't that be as easy as doing a binary search through the
array? Why does this need to hit the difficult path?

Kind regards,

Matthias van de Meent

Attachment

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

16 March 2024, 00:12:02

On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I've attached v14, where 0001 is v13, 0002 is a patch with small
> changes + some large comment suggestions, and 0003 which contains
> sorted merge join code for _bt_merge_arrays.

I have accepted your changes from 0003. Agree that it's better that
way. It's at least a little faster, but not meaningfully more
complicated.

This is part of my next revision, v15, which I've attached (along with
a test case that you might find useful, explained below).

> I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
> can make of it.

That's been the big focus of this new v15, which now goes all out with
teaching _bt_preprocess_keys with how to deal with arrays. We now do
comprehensive normalization of even very complicated combinations of
arrays and non-array scan keys in this version.

For example, consider this query:

select *
from multi_test
where
  a = any('{1, 2, 3, 4, 5}'::int[])
  and
  a > 2::bigint
  and
  a = any('{2, 3, 4, 5, 6}'::bigint[])
  and
  a < 6::smallint
  and
  a = any('{2, 3, 4, 5, 6}'::smallint[])
  and
  a < 4::int;

This has a total of 6 input scankeys -- 3 of which are arrays. But by
the time v15's _bt_preprocess_keys is done with it, it'll have only 1
scan key -- which doesn't even have an array (not anymore). And so we
won't even need to advance the array keys one single time -- there'll
simply be no array left to advance. In other words, it'll be just as
if the query was written this way from the start:

select *
from multi_test
where
  a = 3::int;

(Though of course the original query will spend more cycles on
preprocessing, compared to this manually simplified variant.)

In general, preprocessing can now simplify queries like this to the
maximum extent possible (without bringing the optimizer into it), no
matter how much crud like this is added -- even including adversarial
cases, with massive arrays that have some amount of
redundancy/contradictoriness to them.

It turned out to not be terribly difficult to teach
_bt_preprocess_keys everything it could possibly need to know about
arrays, so that it can operate on them directly, as a variant of the
standard equality strategy (we do still need _bt_preprocess_array_keys
for basic preprocessing of arrays, mostly just merging them). This is
better overall (in that it gets every little subtlety right), but it
also simplified a number of related issues. For example, there is no
longer any need to maintain a mapping between so->keyData[]-wise scan
keys (output scan keys), and scan->keyData[]-wise scan keys (input
scan keys). We can just add a step to fix-up the references to the end
of _bt_preprocess_keys, to make life easier within
_bt_advance_array_keys.

This preprocessing work should all be happening during planning, not
during query execution -- that's the design that makes the most sense.
This is something we've discussed in the past in the context of skip
scan (see my original email to this thread for the reference). It
would be especially useful for the very fancy kinds of preprocessing
that are described by the MDAM paper, like using an index scan for a
NOT IN() list/array (this can actually make sense with a low
cardinality index).

The structure for preprocessing that I'm working towards (especially
in v15) sets the groundwork for making those shifts in the planner,
because we'll no longer treat each array constant as its own primitive
index scan during preprocessing. Right now, on HEAD, preprocessing
with arrays kinda treats each array constant like the parameter of an
imaginary inner index scan, from an imaginary nested loop join. But
the planner really needs to operate on the whole qual, all at once,
including any arrays. An actual nestloop join's inner index scan
naturally cannot do that, and so might still require runtime/dynamic
preprocessing in a world where that mostly happens in the planner --
but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
a in(4, 5)" are almost the same thing, and so should be handled in
almost the same way by preprocessing).

> > +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
>  [...]
> > +        if (!(cur->sk_flags & SK_SEARCHARRAY) &&
> > +            cur->sk_strategy != BTEqualStrategyNumber)
> > +            continue;
>
> This should use ||, not &&

Fixed. Yeah, that was a bug.

> > +        if (readpagetup || result != 0)
> > +        {
> > +            Assert(result != 0);
> > +            return false;
> > +        }
>
> I'm confused about this. By asserting !readpagetup after this exit, we
> could save a branch condition for the !readpagetup result != 0 path.
> Can't we better assert the inverse just below, or is this specifically
> for defense-in-depth against bug? E.g.
>
> +        if (result != 0)
> +            return false;
> +
> +        Assert(!readpagetup);

Yeah, that's what it was -- defensively. It seems slightly better
as-is, because you'll get an assertion failure if a "readpagetup"
caller gets "result == 0". That's never supposed to happen (if it did
happen then our ORDER proc won't be in agreement with what our =
operator indicated about the same tuple attribute value moments
earlier, inside _bt_check_compare).

> > +    /*
> > +     * By here we have established that the scan's required arrays were
> > +     * advanced, and that they haven't become exhausted.
> > +     */
> > +    Assert(arrays_advanced || !arrays_exhausted);
>
> Should use &&, based on the comment.

Fixed by getting rid of the arrays_exhausted variable, which wasn't
adding much anyway.

> > +     * We generally permit primitive index scans to continue onto the next
> > +     * sibling page when the page's finaltup satisfies all required scan keys
> > +     * at the point where we're between pages.
>
> This should probably describe that we permit primitive scans with
> array keys to continue until we get to the sibling page, rather than
> this rather obvious and generic statement that would cover even the
> index scan for id > 0 ORDER BY id asc; or this paragraph can be
> removed.

It's not quite obvious. The scan's array keys change as the scan makes
progress, up to once per tuple read. But even if it really was
obvious, it's really supposed to frame the later discussion --
discussion of those less common cases where this isn't what happens.
The exceptions. These exceptions are:

1. When a required scan key is deemed "satisfied" only because its
value was truncated in a high key finaltup. (Technically this is what
_bt_checkkeys has always done, but we're much more sensitive to this
stuff now, because we won't necessarily get to make another choice
about starting a new primitive index scan for a long time.)

2. When we apply the has_required_opposite_direction_only stuff, and
decide to start a new primitive index scan, even though technically
all of our required-in-this-direction scan keys are still satisfied.

Separately, there is also the potential for undesirable interactions
between 1 and 2, which is why we don't let them mix. (We have the "if
(so->scanBehind && has_required_opposite_direction_only) goto
new_prim_scan" gating condition.)

> Further notes:
>
> I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
> Scan might be behind arrays? */" doesn't give me enough hints here.

Yes, it is complicated. The best explanation is the one I've added to
_bt_readpage, next to the precheck. But that does need more work.

Note that the so->scanBehind thing solves two distinct problems for
the patch (related, but still clearly distinct). These problem are:

1. It makes the existing precheck/continuescan optimization in
_bt_readpage safe -- we'd sometimes get wrong answers to queries if we
didn't limit application of the optimization to !so->scanBehind cases.

I have a test case that proves that this is true -- the one I
mentioned in my introduction. I'm attaching that as
precheck_testcase.sql now. It might help you to understand
so->scanBehind, particularly this point 1 about the basic correctness
of the precheck thing (possibly point 2 also).

2. It is used to speculatively visit the next leaf page in corner
cases where truncated -inf attributes from the high key are deemed
"satisfied".

Generally speaking, we don't want to visit the next leaf page unless
we're already 100% sure that it might have matches (if any page has
matches for our current array keys at all, that is). But in a certain
sense we're only guessing. It isn't guaranteed (and fundamentally
can't be guaranteed) to work out once on the next sibling page. We've
merely assumed that our array keys satisfy truncated -inf columns,
without really knowing what it is that we'll find on the next page
when we actually visit it at that point (we're not going to see -inf
in the non-pivot tuples on the next page, we'll see some
real/non-sentinel low-sorting value).

We have good practical reasons to not want to treat them as
non-matching (though that would be correct), and to take this limited
gamble (we can discuss that some more if you'd like). Once you accept
that we have to do this, it follows that we need to be prepared to:

A. Notice that we're really just guessing in the sense I've described
(before leaving for the next sibling leaf page), by setting
so->scanBehind. We'll remember that it happened.

and:

B. Notice that that limited gamble didn't pay off once on the
next/sibling leaf page, so that we can cut our losses and start a new
primitive index scan at that point. We do this by checking
so->scanBehind (along with the sibling page's high key), once on the
sibling page. (We don't need explicit handling for the case when it
works out, as it almost always will.)

If you want to include discussion of problem 1 here too (not just
problem 2), then I should add a third thing that we need to notice:

C. Notice that we can't do the precheck thing once in _bt_readpage,
because it'd be wrong to allow it.

> I find it weird that we call _bt_advance_array_keys for non-required
> sktrig. Shouldn't that be as easy as doing a binary search through the
> array? Why does this need to hit the difficult path?

What difficult path?

Advancing non-required arrays isn't that different to advancing
required ones. We will never advance required arrays when called just
to advance a non-required one, obviously. But we can do the opposite.
In fact, we absolutely have to advance non-required arrays (as best we
can) when advancing a required one (or when the call was triggered by
a non-array required scan key).

That said, it could have been clearer than it was in earlier versions.
v15 makes the difference between the non-required scan key trigger and
required scan key trigger cases clearer within _bt_advance_array_keys.

--
Peter Geoghegan

Attachment

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

18 March 2024, 13:24:46

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> I have accepted your changes from 0003. Agree that it's better that
> way. It's at least a little faster, but not meaningfully more
> complicated.

Thanks.

> > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
> > can make of it.
>
> That's been the big focus of this new v15, which now goes all out with
> teaching _bt_preprocess_keys with how to deal with arrays. We now do
> comprehensive normalization of even very complicated combinations of
> arrays and non-array scan keys in this version.

I was thinking about a more unified processing model, where
_bt_preprocess_keys would iterate over all keys, including processing
of array keys (by merging and reduction to normal keys) if and when
found. This would also reduce the effort expended when there are
contradictory scan keys, as array preprocessing is relatively more
expensive than other scankeys and contradictions are then found before
processing of later keys.
As I wasn't very far into the work yet it seems I can reuse a lot of
your work here.

> For example, consider this query:
[...]
> This has a total of 6 input scankeys -- 3 of which are arrays. But by
> the time v15's _bt_preprocess_keys is done with it, it'll have only 1
> scan key -- which doesn't even have an array (not anymore). And so we
> won't even need to advance the array keys one single time -- there'll
> simply be no array left to advance. In other words, it'll be just as
> if the query was written this way from the start:
>
> select *
> from multi_test
> where
>   a = 3::int;
>
> (Though of course the original query will spend more cycles on
> preprocessing, compared to this manually simplified variant.)

That's a good improvement, much closer to an optimal access pattern.

> It turned out to not be terribly difficult to teach
> _bt_preprocess_keys everything it could possibly need to know about
> arrays, so that it can operate on them directly, as a variant of the
> standard equality strategy (we do still need _bt_preprocess_array_keys
> for basic preprocessing of arrays, mostly just merging them). This is
> better overall (in that it gets every little subtlety right), but it
> also simplified a number of related issues. For example, there is no
> longer any need to maintain a mapping between so->keyData[]-wise scan
> keys (output scan keys), and scan->keyData[]-wise scan keys (input
> scan keys). We can just add a step to fix-up the references to the end
> of _bt_preprocess_keys, to make life easier within
> _bt_advance_array_keys.
>
> This preprocessing work should all be happening during planning, not
> during query execution -- that's the design that makes the most sense.
> This is something we've discussed in the past in the context of skip
> scan (see my original email to this thread for the reference).

Yes, but IIRC we also agreed that it's impossible to do this fully in
planning, amongst others due to joins on array fields.

> It
> would be especially useful for the very fancy kinds of preprocessing
> that are described by the MDAM paper, like using an index scan for a
> NOT IN() list/array (this can actually make sense with a low
> cardinality index).

Yes, indexes such as those on enums. Though, in those cases the NOT IN
() could be transformed into IN()-lists by the planner, but not the
index.

> The structure for preprocessing that I'm working towards (especially
> in v15) sets the groundwork for making those shifts in the planner,
> because we'll no longer treat each array constant as its own primitive
> index scan during preprocessing.

I hope that's going to be a fully separate patch. I don't think I can
handle much more complexity in this one.

> Right now, on HEAD, preprocessing
> with arrays kinda treats each array constant like the parameter of an
> imaginary inner index scan, from an imaginary nested loop join. But
> the planner really needs to operate on the whole qual, all at once,
> including any arrays. An actual nestloop join's inner index scan
> naturally cannot do that, and so might still require runtime/dynamic
> preprocessing in a world where that mostly happens in the planner --
> but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
> a in(4, 5)" are almost the same thing, and so should be handled in
> almost the same way by preprocessing).

Yeah, if the planner could handle some of this that'd be great. At the
same time, I think that this might need to be gated behind a guc for
more expensive planner-time deductions.

> > > +     * We generally permit primitive index scans to continue onto the next
> > > +     * sibling page when the page's finaltup satisfies all required scan keys
> > > +     * at the point where we're between pages.
> >
> > This should probably describe that we permit primitive scans with
> > array keys to continue until we get to the sibling page, rather than
> > this rather obvious and generic statement that would cover even the
> > index scan for id > 0 ORDER BY id asc; or this paragraph can be
> > removed.
>
> It's not quite obvious.
[...]
> Separately, there is also the potential for undesirable interactions
> between 1 and 2, which is why we don't let them mix. (We have the "if
> (so->scanBehind && has_required_opposite_direction_only) goto
> new_prim_scan" gating condition.)

I see.

> > Further notes:
> >
> > I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
> > Scan might be behind arrays? */" doesn't give me enough hints here.
>
> Yes, it is complicated. The best explanation is the one I've added to
> _bt_readpage, next to the precheck. But that does need more work.

Yeah. The _bt_readpage comment doesn't actually contain the search
term scanBehind, so I wasn't expecting that to be documented there.

> > I find it weird that we call _bt_advance_array_keys for non-required
> > sktrig. Shouldn't that be as easy as doing a binary search through the
> > array? Why does this need to hit the difficult path?
>
> What difficult path?

"Expensive" would probably have been a better wording: we do a
comparative lot of processing in the !_bt_check_compare() +
!continuescan path; much more than the binary searches you'd need for
non-required array key checks.

> Advancing non-required arrays isn't that different to advancing
> required ones. We will never advance required arrays when called just
> to advance a non-required one, obviously. But we can do the opposite.
> In fact, we absolutely have to advance non-required arrays (as best we
> can) when advancing a required one (or when the call was triggered by
> a non-array required scan key).

I think it's a lot more expensive to do the non-required array key
increment for non-required triggers. What are we protecting against
(or improving) by always doing advance_array_keys on non-required
trigger keys?

I mean that we should just do the non-required array key binary search
inside _bt_check_compare for non-required array keys, as that would
skip a lot of the rather expensive other array key infrastructure, and
only if we're outside the minimum or maximum bounds of the
non-required scankeys should we trigger advance_array_keys (unless
scan direction changed).

A full review of the updated patch will follow soon.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Matthias van de Meent

Date:

20 March 2024, 19:26:38

On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> This is part of my next revision, v15, which I've attached (along with
> a test case that you might find useful, explained below).
>
> v15 makes the difference between the non-required scan key trigger and
> required scan key trigger cases clearer within _bt_advance_array_keys.

OK, here's a small additional review, with a suggestion for additional
changes to _bt_preprocess:

> @@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
>      /*
>      * The opfamily we need to worry about is identified by the index column.
>      */
>     Assert(leftarg->sk_attno == rightarg->sk_attno);
>
> +    /*
> +     * If either leftarg or rightarg are equality-type array scankeys, we need
> +     * specialized handling (since by now we know that IS NULL wasn't used)
> +     */
> [...]
> +    }
> +
>     opcintype = rel->rd_opcintype[leftarg->sk_attno - 1];

Here, you insert your code between the comment about which opfamily to
choose and the code assigning the opfamily. I think this needs some
cleanup.

> +             * Don't call _bt_preprocess_array_keys_final in this fast path
> +             * (we'll miss out on the single value array transformation, but
> +             * that's not nearly as important when there's only one scan key)

Why is it OK to ignore it? Or, why don't we apply it here?

---

Attached 2 patches for further optimization of the _bt_preprocess_keys
path (on top of your v15), according to the following idea:

Right now, we do "expensive" processing with xform merging for all
keys when we have more than 1 keys in the scan. However, we only do
per-attribute merging of these keys, so if there is only one key for
any attribute, the many cycles spent in that loop are mostly wasted.
By checking for single-scankey attributes, we can short-track many
multi-column index scans because they usually have only a single scan
key per attribute.
The first implements that idea, the second reduces the scope of
various variables so as to improve compiler optimizability.

I'll try to work a bit more on merging the _bt_preprocess steps into a
single main iterator, but this is about as far as I got with clean
patches. Merging the steps for array preprocessing with per-key
processing and post-processing is proving a bit more complex than I'd
anticipated, so I don't think I'll be able to finish that before the
feature freeze, especially with other things that keep distracting me.

Matthias van de Meent

Attachment

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

21 March 2024, 22:05:39

On Thu, Mar 7, 2024 at 10:42 AM Benoit Tigeot <benoit.tigeot@gmail.com> wrote:
> I am not up to date with the last version of patch but I did a regular benchmark with version 11 and typical issue we
haveat the moment and the result are still very very good. [1] 

Thanks for providing the test case. It was definitely important back
when the ideas behind the patch had not yet fully developed. It helped
me to realize that my thinking around non-required arrays (meaning
arrays that cannot reposition the scan, and just filter out
non-matching tuples) was still sloppy.

> In term of performance improvement the last proposals could be a real game changer for 2 of our biggest databases. We
hopethat Postgres 17 will contain those improvements. 

Current plan is to commit this patch in the next couple of weeks,
ahead of Postgres 17 feature freeze.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

21 March 2024, 23:21:53

On Mon, Mar 18, 2024 at 9:25 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I was thinking about a more unified processing model, where
> _bt_preprocess_keys would iterate over all keys, including processing
> of array keys (by merging and reduction to normal keys) if and when
> found. This would also reduce the effort expended when there are
> contradictory scan keys, as array preprocessing is relatively more
> expensive than other scankeys and contradictions are then found before
> processing of later keys.

Does it really matter? I just can't get excited about maybe getting
one less syscache lookup in queries that involve *obviously*
contradictory keys at the SQL level. Especially because we're so much
better off with the new design here anyway; calling
_bt_preprocess_keys once rather than once per distinct set of array
keys is an enormous improvement, all on its own.

My concern with preprocessing overhead is almost entirely limited to
pathological performance issues involving extreme (even adversarial)
input scan keys/arrays. I feel that if we're going to guarantee that
we won't choke in _bt_checkkeys, even given billions of distinct sets
of array keys, we should make the same guarantee in
_bt_preprocess_keys -- just to avoid POLA violations. But that's the
only thing that seems particularly important in the general area of
preprocessing performance. (Preprocessing performance probably does
matter quite a bit in more routine cases, where </<= and >/>= are
mixed together on the same attribute. But there'll be no new
_bt_preprocess_keys operator/function lookups for stuff like that.)

The advantage of not completely merging _bt_preprocess_array_keys with
_bt_preprocess_keys is that preserving this much of the existing code
structure allows us to still decide how many array keys we'll need for
the scan up front (if the scan doesn't end up having an unsatisfiable
qual). _bt_preprocess_array_keys will eliminate redundant arrays
early, in practically all cases (the exception is when it has to deal
with an incomplete opfamily lacking the appropriate cross-type ORDER
proc), so we don't have to think about merging arrays after that
point. Rather like how we don't have to worry about "WHERE a <
any('{1, 2, 3}')" type inequalities after _bt_preprocess_array_keys
does an initial round of array-specific preprocessing
(_bt_preprocess_keys can deal with those in the same way as it will
with standard inequalities).

> > This preprocessing work should all be happening during planning, not
> > during query execution -- that's the design that makes the most sense.
> > This is something we've discussed in the past in the context of skip
> > scan (see my original email to this thread for the reference).
>
> Yes, but IIRC we also agreed that it's impossible to do this fully in
> planning, amongst others due to joins on array fields.

Even with a nested loop join's inner index scan, where the constants
used for each btrescan are not really predictable in the planner, we
can still do most preprocessing in the planner, at least most of the
time.

We could still easily do analysis that is capable of ruling out
redundant or contradictory scan keys for any possible parameter value
-- seen or unseen. I'd expect this to be the common case -- most of
the time these inner index scans need only one simple = operator
(maybe 2 = operators). Obviously, tjat approach still requires that
btrescan at least accept a new set of constants for each new inner
index scan invocation. But that's still much cheaper than calling
_bt_preprocess_keys from scratch every time btresca is called.

> > It
> > would be especially useful for the very fancy kinds of preprocessing
> > that are described by the MDAM paper, like using an index scan for a
> > NOT IN() list/array (this can actually make sense with a low
> > cardinality index).
>
> Yes, indexes such as those on enums. Though, in those cases the NOT IN
> () could be transformed into IN()-lists by the planner, but not the
> index.

I think that it would probably be built as just another kind of index
path, like the ones we build for SAOPs. Anti-SAOPs?

Just as with SAOPs, the disjunctive accesses wouldn't really be
something that the planner would need too much direct understanding of
(except during costing). I'd only expect the plan to use such an index
scan when it wasn't too far off needing a full index scan anyway. Just
like with skip scan, the distinction between these NOT IN() index scan
paths and a simple full index scan path is supposed to be fuzzy (maybe
they'll actually turn out to be full scans at runtime, since the
number of primitive index scans is determined dynamically, based on
the structure of the index at runtime).

> > The structure for preprocessing that I'm working towards (especially
> > in v15) sets the groundwork for making those shifts in the planner,
> > because we'll no longer treat each array constant as its own primitive
> > index scan during preprocessing.
>
> I hope that's going to be a fully separate patch. I don't think I can
> handle much more complexity in this one.

Allowing the planner to hook into _bt_preprocess_keys is absolutely
not in scope here. I was just making the point that treating array
scan keys like just another type of scan key during preprocessing is
going to help with that too.

> Yeah. The _bt_readpage comment doesn't actually contain the search
> term scanBehind, so I wasn't expecting that to be documented there.

Can you think of a better way of explaining it?

> > > I find it weird that we call _bt_advance_array_keys for non-required
> > > sktrig. Shouldn't that be as easy as doing a binary search through the
> > > array? Why does this need to hit the difficult path?
> >
> > What difficult path?
>
> "Expensive" would probably have been a better wording: we do a
> comparative lot of processing in the !_bt_check_compare() +
> !continuescan path; much more than the binary searches you'd need for
> non-required array key checks.

I've come around to your point of view on this -- at least to a
degree. I'm now calling _bt_advance_array_keys from within
_bt_check_compare for non-required array keys only.

If nothing else, it's clearer this way because it makes it obvious
that non-required arrays cannot end the (primitive) scan. There's no
further need for the wart in _bt_check_compare's contract about
"continuescan=false" being associated with a non-required scan key in
this one special case.

> > Advancing non-required arrays isn't that different to advancing
> > required ones. We will never advance required arrays when called just
> > to advance a non-required one, obviously. But we can do the opposite.
> > In fact, we absolutely have to advance non-required arrays (as best we
> > can) when advancing a required one (or when the call was triggered by
> > a non-array required scan key).
>
> I think it's a lot more expensive to do the non-required array key
> increment for non-required triggers. What are we protecting against
> (or improving) by always doing advance_array_keys on non-required
> trigger keys?

We give up on the first non-required array that fails to find an exact
match, though. Regardless of whether the scan was triggered by a
required or a non-required scan key (it's equally useful in both types
of array advancement, because they're not all that different).

There were and are steps around starting a new primitive index scan at
the end of _bt_advance_array_keys, that aren't required when array
advancement was triggered by an unsatisfied non-required array scan
key. But those steps were skipped given a non-required trigger scan
key, anyway.

> I mean that we should just do the non-required array key binary search
> inside _bt_check_compare for non-required array keys, as that would
> skip a lot of the rather expensive other array key infrastructure, and
> only if we're outside the minimum or maximum bounds of the
> non-required scankeys should we trigger advance_array_keys (unless
> scan direction changed).

I've thought about optimizing non-required arrays along those lines.
But it doesn't really seem to be necessary at all.

If we were going to do it, then it ought to be done in a way that
works independent of the trigger condition for array advancement (that
is, it'd work for non-required arrays that have a required scan key
trigger, too).

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

22 March 2024, 00:32:42

On Wed, Mar 20, 2024 at 3:26 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> OK, here's a small additional review, with a suggestion for additional
> changes to _bt_preprocess:

Attached is v16. This has some of the changes that you asked for. But
my main focus has been on fixing regressions that were discovered
during recent performance validation.

Up until now, my stress testing of the patch (not quite the same thing
as benchmarking) more or less consisted of using pgbench to look at
two different types of cases. These are:

1. Extreme cases where the patch obviously helps the most, at least
insofar as navigating the index structure itself is concerned (note
that avoiding heap accesses isn't in scope for the pgbench stress
testing).

These are all variants of pgbench's SELECT workload, with a huge IN()
list containing contiguous integers (we can get maybe a 5.3x increase
in TPS here), or integers that are spaced apart by maybe 20 - 50
tuples (where we can get maybe a 20% - 30% improvement in TPS here).

2. The opposite extreme, where an IN() list has integers that are
essentially random, and so are spaced very far apart on average.

Another variant of pgbench's SELECT workload, with an IN() list. This
is a case that the patch has hardly any chance of helping, so the goal
here is to just avoid regressions. Since I've been paying attention to
this for a long time, this wasn't really a problem for v15.

3. Cases where matching tuples are spaced apart by 100 - 300
tuples/integer values.

These cases are "somewhere between 1 and 2". Cases where I would
expect to win by a smaller amount (say a 5% improvement in TPS), but
v15 nevertheless lost by as much as 12% here. This was a problem that
seemed to demand a fix.

The underlying reason why all earlier versions of the patch had these
regressions came down to their use of a pure linear search (by which I
mean having _bt_readpage call _bt_checkeys again and again on the same
page). That was just too slow here. v15 could roughly halve the number
of index descents compared to master, as expected -- but that wasn't
enough to make up for the overhead of having to plow through hundreds
of non-matching tuples on every leaf page. v15 couldn't even break
even.

v16 of the patch fixes the problem by adding a limited additional form
of "skipping", used *within* a page, as it is scanned by _bt_readpage.
(As opposed to skipping over the index by redescending anew, starting
from the root page.)

In other words, v16 teaches the linear search process to occasionally
"look ahead" when it seems like the linear search process has required
too many comparisons at that point (comparisons by
_bt_tuple_before_array_skeys made from within _bt_checkkeys). You can
think of this new "look ahead" mechanism as bridging the gap between
case 1 and case 2 (fixing case 3, a hybrid of 1 and 2). We still
mostly want to use a "linear search" from within _bt_readpage, but we
do benefit from having this fallback when that just isn't working very
well. Now even these new type 3 stress tests see an increase in TPS of
perhaps 5% - 12%, depending on just how far apart you arrange for
matching tuples to be spaced apart by (the total size of the IN list
is also relevant, but much less so).

Separately, I added a new optimization to the binary search process
that selects the next array element via a binary search of the array
(actually, to the function _bt_binsrch_array_skey), which lowers the
cost of advancing required arrays that trigger advancement: we only
really need one comparison per _bt_binsrch_array_skey call in almost
all cases. In practice we'll almost certainly find that the next array
element in line is <= the unsatisfied tuple datum (we already know
from context that the current/former array element must now be < that
same tuple datum at that point, so the conditions under which this
doesn't work out right away are narrow). This second optimization is
much more general than the first one. It helps with pretty much any
kind of adversarial pgbench stress test.

> Here, you insert your code between the comment about which opfamily to
> choose and the code assigning the opfamily. I think this needs some
> cleanup.

Fixed.

> > +             * Don't call _bt_preprocess_array_keys_final in this fast path
> > +             * (we'll miss out on the single value array transformation, but
> > +             * that's not nearly as important when there's only one scan key)
>
> Why is it OK to ignore it? Or, why don't we apply it here?

It doesn't seem worth adding any cycles to that fast path, given that
the array transformation in question can only help us to convert "=
any('{1}')" into "= 1", because there's no other scan keys that can
cut down the number of array constants first (through earlier
array-specific preprocessing steps).

Note that even this can't be said for converting "IN (1)" into "= 1",
since the planner already does that for us, without nbtree ever seeing
it directly.

> Attached 2 patches for further optimization of the _bt_preprocess_keys
> path (on top of your v15), according to the following idea:
>
> Right now, we do "expensive" processing with xform merging for all
> keys when we have more than 1 keys in the scan. However, we only do
> per-attribute merging of these keys, so if there is only one key for
> any attribute, the many cycles spent in that loop are mostly wasted.

Why do you say that? I agree that calling _bt_compare_scankey_args()
more than necessary might matter, but that doesn't come up when
there's only one scan key per attribute. We'll only copy each scan key
into the strategy-specific slot for the attribute's temp xform[]
array.

Actually, it doesn't necessarily come up when there's more than one
scan key per index attribute, depending on the strategies in use. That
seems like an existing problem on HEAD; I think we should be doing
this more. We could stand to do better at preprocessing when
inequalities are mixed together. Right now, we can't detect
contradictory quals, given a case such as "SELECT * FROM
pgbench_accounts WHERE aid > 5 AND aid < 3". It'll only work for
something like "WHERE aid = 5 AND aid < 3", so we'll still useless
descend the index once (not a disaster, but not ideal either).

Honestly, I don't follow what the goal is here. Does it really help to
restructure the loop like this? Feels like I'm missing something. We
do this when there's only one scan key, sort of, but that's probably
not really that important anyway. Maybe it helps a bit for nestloop
joins with an inner index scan, where one array scan key is common.

--
Peter Geoghegan

Attachment

v16-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

26 March 2024, 18:01:58

On Thu, Mar 21, 2024 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v16. This has some of the changes that you asked for. But
> my main focus has been on fixing regressions that were discovered
> during recent performance validation.

Attached is v17. This revision deals with problems with parallel index
scans that use array keys.

This patch is now *very* close to being committable. My only
outstanding concern is the costing/selfuncs.c aspects of the patch,
which I haven't thought about in many months. I'd estimate that I'm
now less than a week away from pushing this patch. (Plus I need to
decide which tests  from my massive test suite should be included in
the committed patch, as standard regression tests.)

Back to v17. Earlier versions dealt with parallel index scans in a way
that could lead to very unbalanced utilization of worker processes (at
least with the wrong sort of query/index), which was clearly
unacceptable. The underlying issue was the continued use of
BTParallelScanDescData.btps_arrayKeyCount field in shared memory,
which was still used to stop workers from moving on to the next
primitive scan. That old approach made little sense with this new high
level design. (I have been aware of the problem for months, but didn't
get around to it until now.)

I wasn't specifically trying to win any benchmarks here -- I really
just wanted to make the design for parallel index scans with SAOPs
into a coherent part of the wider design, without introducing any
regressions. I applied my usual charter for the patch when working on
v17, by more or less removing any nbtree.c parallel index scan code
that had direct awareness of array keys as distinct primitive index
scans. This fixed the problem of unbalanced worker utilization, as
expected, but it also seems to have benefits particular to parallel
index scans with array keys -- which was unexpected.

We now use an approach that's almost identical to the approach taken
by every other type of parallel scan. My new approach naturally
reduces contention among backends that want to advance the scan. (Plus
parallel index scans get all of the same benefits as serial index
scans, of course.)

v17 replaces _bt_parallel_advance_array_keys with a new function,
called _bt_parallel_primscan_advance, which is now called from
_bt_advance_array_keys. The new function doesn't need to increment
btscan->btps_arrayKeyCount (nor does it increment a local copy in
BTScanOpaqueData.arrayKeyCount). It is called at the specific point
that array key advancement *tries* to start another primitive index
scan. The difference (relative to the serial case) is that it might
not succeed in doing so. It might not be possible to start another
primitive index scan, since it might turn out that some other backend
(one page ahead, or even one page behind) already did it for
everybody. At that point it's easy for the backend that failed to
start the next primitive scan to have _bt_advance_array_keys back out
of everything. Then the backend just goes back to consuming pages by
seizing the scan.

This approach preserves the ability of parallel scans to have
_bt_readpage release the next page before _bt_readpage even starts
examining tuples from the current page. It's possible that 2 or 3
backends (each of which scans its own leaf pages from a group of
adjacent pages) will "independently" decide that it's time to start
another scan --  but at most one backend can be granted the right to
do so.

It's even possible that one such backend among several (all of which
might be expected to try to start the next primitive scan in unison)
will *not* want to do so -- it could know better than the rest. This
can happen when the backend lands one page ahead of the rest, and it
just so happens to see no reason to not just continue the scan on the
leaf level for now. Such a backend can effectively veto the idea of
starting another primitive index scan for the whole top-level parallel
scan. This probably doesn't really matter much, in and of itself
(skipping ahead by only one or two pages is suboptimal but not really
a problem), but that's beside the point. The point is that having very
flexible rules like this works out to be both simpler and better
performing than a more rigid, pessimistic approach would be. In
particular, keeping the time that the scan is seized by any one
backend to an absolute minimum is important -- it may be better to
detect and recover from contention than to try to prevent them
altogether.

Just to be clear, all of this only comes up during parts of the scan
where primitive index scans actually look like a good idea in the
first place. It's perfectly possible for the scan to process thousands
of array keys, without any backend ever considering starting another
primitive index scan. Much of the time, every worker process can
independently advance their own private array keys, without that
requiring any sort of special coordination (nothing beyond the
coordination required by *all* parallel index scans, including those
without any SAOP arrays).

As I've said many times already, the high level charter of this
project is to make scans that have arrays (whether serial or parallel)
as similar as possible to any other type of scan.

--
Peter Geoghegan

Attachment

v17-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

01 April 2024, 22:33:22

On Tue, Mar 26, 2024 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> v17 replaces _bt_parallel_advance_array_keys with a new function,
> called _bt_parallel_primscan_advance, which is now called from
> _bt_advance_array_keys. The new function doesn't need to increment
> btscan->btps_arrayKeyCount (nor does it increment a local copy in
> BTScanOpaqueData.arrayKeyCount). It is called at the specific point
> that array key advancement *tries* to start another primitive index
> scan. The difference (relative to the serial case) is that it might
> not succeed in doing so. It might not be possible to start another
> primitive index scan, since it might turn out that some other backend
> (one page ahead, or even one page behind) already did it for
> everybody.

There was a bug in v17 around how we deal with parallel index scans.
Under the right circumstances, a parallel scan with array keys put all
of its worker processes to sleep, without subsequently waking them up
(so the scan would hang indefinitely). The underlying problem was that
v17 assumed that the backend that scheduled the next primitive index
scan would also be the one that actually performs that same primitive
scan. While it is natural and desirable for the backend that schedules
the primitive scan to be the one that actually calls _bt_search, v17
tacitly relied on that to successfully finish the scan. It's
particularly important that the leader process always be able to stop
participating as a worker immediately, whenever it needs to, and for
as long as it needs to.

Attached is v18, which adds an additional layer of indirection to fix
the bug: worker processes now schedule primitive index scans
explicitly. They store their current array keys in shared memory when
primitive scans are scheduled (during a parallel scan), allowing other
backends to pick things up where the scheduling backend left off. And
so with v18, any other backend can be the one that performs the actual
descent of the index within _bt_search.

The design is essentially the same as before, though. We'd still
prefer that it be the backend that scheduled the primitive index scan.
Other backends might well be busy finishing off their scan of
immediately preceding leaf pages.

Separately, v18 also adds many new regression tests. This greatly
improves the general situation around test coverage, which I'd put off
before now. Now there is coverage for parallel scans with array keys,
as well as coverage for the new array-specific preprocessing code.

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

--
Peter Geoghegan

Attachment

v18-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

03 April 2024, 19:53:37

On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Note: v18 doesn't have any adjustments to the costing, as originally
> planned. I'll probably need to post a revised patch with improved (or
> at least polished) costing in the next few days, so that others will
> have the opportunity to comment before I commit the patch.

Attached is v19, which dealt with remaining concerns I had about the
costing in selfuncs.c. My current plan is to commit this on Saturday
morning (US Eastern time).

I ultimately concluded that selfuncs.c shouldn't be doing anything to
cap the estimated number of primitive index scans for an SAOP clause,
unless doing so is all but certain to make the estimate more accurate.
And so v19 makes the changes in selfuncs.c much closer to the approach
used to cost SOAP clauses on master. In short, v19 is a lot more
conservative in the changes made to selfuncs.c.

v19 does hold onto the idea of aggressively capping the estimate in
extreme cases -- cases where the estimate begins to approach the total
number of leaf pages in the index. This is clearly safe (it's
literally impossible to have more index descents than there are leaf
pages in the index), and probably necessary given that index paths
with SAOP filter quals (on indexed attributes) will no longer be
generated by the planner.

To recap, since nbtree more or less behaves as if scans with SAOPs are
one continuous index scan under the new design from patch, it was
tempting to make code in genericcostestimate/btcostestimate treat it
like that, too (the selfuncs.c changes from earlier versions tried to
do that). As I said already, I've now given up on that approach in
v19. This does mean that genericcostestimate continues to apply the
Mackert-Lohman formula for btree SAOP scans. This is a bit odd in a
world where nbtree specifically promises to not repeat any leaf page
accesses. I was a bit uneasy about that aspect for a while, but
ultimately concluded that it wasn't very important in the grand scheme
of things.

The problem with teaching code in genericcostestimate/btcostestimate
to treat nbtree scans with SAOPs are one continuous index scan is that
it hugely biases genericcostestimate's simplistic approach to
estimating numIndexPages. genericcostestimate does this by simply
prorating using numIndexTuples/index->pages/index->tuples. That naive
approach probably works alright with simple scalar operators, but it's
not going to work with SAOPs directly. I can't think of a good way of
estimating numIndexPages with SAOPs, and suspect that the required
information just isn't available. It seems best to treat it as a
possible area for improvement in a later release. The really important
thing for v17 is that we never wildly overestimate the number of
descents when multiple SAOPs are used, each with a medium or large
array.

v19 also makes sure that genericcostestimate doesn't allow a
non-boundary SAOP clause/non-required array scan key to affect
num_sa_scans -- it'll just use the num_sa_scans values used by
btcostestimate in v19. This makes sense because nbtree is guaranteed
to not start new primitive scans for these sorts of scan keys (plus
there's no need to calculate num_sa_scans twice, in two places, using
two slightly different approaches).

--
Peter Geoghegan

Attachment

v19-0001-Enhance-nbtree-ScalarArrayOp-execution.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Alexander Lakhin

Date:

07 April 2024, 17:00:00

Hello Peter,

03.04.2024 22:53, Peter Geoghegan wrote:
> On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Note: v18 doesn't have any adjustments to the costing, as originally
>> planned. I'll probably need to post a revised patch with improved (or
>> at least polished) costing in the next few days, so that others will
>> have the opportunity to comment before I commit the patch.
> Attached is v19, which dealt with remaining concerns I had about the
> costing in selfuncs.c. My current plan is to commit this on Saturday
> morning (US Eastern time).

Please look at an assertion failure (reproduced starting from 5bf748b86),
triggered by the following query:
CREATE TABLE t (a int, b int);
CREATE INDEX t_idx ON t (a, b);
INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
ANALYZE t;
SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267

Best regards,
Alexander

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

07 April 2024, 17:18:29

On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at an assertion failure (reproduced starting from 5bf748b86),
> triggered by the following query:
> CREATE TABLE t (a int, b int);
> CREATE INDEX t_idx ON t (a, b);
> INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
> ANALYZE t;
> SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);
>
> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.

The assertion failure demonstrates that
_bt_preprocess_array_keys_final can currently be called when there can
be no real work for it to do. The problem is that we condition the
call to _bt_preprocess_array_keys_final() on whether or not we had to
do "real work" back when _bt_preprocess_array_keys() was called, at
the start of _bt_preprocess_keys(). That usually leaves us with
"so->numArrayKeys > 0", because the "real work" typically includes
equality type array keys. But not here -- here you have two SAOP
inequalities, which become simple scalar scan keys through early
array-specific preprocessing in _bt_preprocess_array_keys(). There are
no arrays left at this point, so "so->numArrayKeys == 0".

FWIW I missed this because the tests only cover cases with one SOAP
inequality, which will always return early from _bt_preprocess_keys
(by taking its generic single scan key fast path). Your test case has
2 scan keys, avoiding the fast path, allowing us to reach
_bt_preprocess_array_keys_final().

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Tom Lane

Date:

08 April 2024, 00:48:29

Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:

/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys()
2944                              * need to make sure that we don't throw away an array
2945                              * scan key.  _bt_compare_scankey_args expects us to
2946                              * always keep arrays (and discard non-arrays).
2947                              */
2948                             Assert(j == (BTEqualStrategyNumber - 1));
2949                             Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
>>>     CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "array".
2950                             Assert(xform[j].ikey == array->scan_key);
2951                             Assert(!(cur->sk_flags & SK_SEARCHARRAY));
2952                         }
2953                     }
2954                     else if (j == (BTEqualStrategyNumber - 1))

Above this there is an assertion

                    Assert(!array || array->num_elems > 0);

which certainly makes it look like array->scan_key could be
a null-pointer dereference.

            regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

08 April 2024, 01:11:48

On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Coverity pointed out something that looks like a potentially live
> problem in 5bf748b86:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 2950 in _bt_preprocess_keys()
> 2944                              * need to make sure that we don't throw away an array
> 2945                              * scan key.  _bt_compare_scankey_args expects us to
> 2946                              * always keep arrays (and discard non-arrays).
> 2947                              */
> 2948                             Assert(j == (BTEqualStrategyNumber - 1));
> 2949                             Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
> >>>     CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
> >>>     Dereferencing null pointer "array".
> 2950                             Assert(xform[j].ikey == array->scan_key);
> 2951                             Assert(!(cur->sk_flags & SK_SEARCHARRAY));
> 2952                         }
> 2953                     }
> 2954                     else if (j == (BTEqualStrategyNumber - 1))
>
> Above this there is an assertion
>
>                     Assert(!array || array->num_elems > 0);
>
> which certainly makes it look like array->scan_key could be
> a null-pointer dereference.

But the "Assert(xform[j].ikey == array->scan_key)" assertion is
located in a block where it's been established that the scan key (the
one stored in xform[j] at this point in execution) must have an array.
It has been marked SK_SEARCHARRAY, and uses the equality strategy, so
it had better have one or we're in big trouble either way.

This is probably very hard for tools like Coverity to understand. We
also rely on the fact that only one of the two scan keys (only one of
the pair of scan keys that were passed to _bt_compare_scankey_args)
can have an array at the point of the assertion that Coverity finds
suspicious. It's possible that both of those scan keys actually did
have arrays, but _bt_compare_scankey_args just treats that as a case
of being unable to prove which scan key was redundant/contradictory
due to a lack of suitable cross-type support -- so the assertion won't
be reached.

Would Coverity stop complaining if I just removed the assertion? I
could just do that, I suppose, but that seems backwards to me.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Tom Lane

Date:

08 April 2024, 01:25:38

Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Apr 7, 2024 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Coverity pointed out something that looks like a potentially live
>> problem in 5bf748b86:
>> ... which certainly makes it look like array->scan_key could be
>> a null-pointer dereference.

> But the "Assert(xform[j].ikey == array->scan_key)" assertion is
> located in a block where it's been established that the scan key (the
> one stored in xform[j] at this point in execution) must have an array.

> This is probably very hard for tools like Coverity to understand.

It's not obvious to human readers either ...

> Would Coverity stop complaining if I just removed the assertion? I
> could just do that, I suppose, but that seems backwards to me.

Perhaps this'd help:

-                        Assert(xform[j].ikey == array->scan_key);
+                        Assert(array && xform[j].ikey == array->scan_key);

If that doesn't silence it, I'd be prepared to just dismiss the
warning.

Some work in the comment to explain why we must have an array here
wouldn't be out of place either, perhaps.

            regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

08 April 2024, 01:36:17

On Sun, Apr 7, 2024 at 9:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Perhaps this'd help:
>
> -                        Assert(xform[j].ikey == array->scan_key);
> +                        Assert(array && xform[j].ikey == array->scan_key);
>
> If that doesn't silence it, I'd be prepared to just dismiss the
> warning.

The assertions in question are arguably redundant. There are very
similar assertions just a little earlier on, as we initially set up
the array stuff (right before _bt_compare_scankey_args is called).
I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
assertion that Coverity doesn't like, in addition to the
"Assert(!array || array->scan_key == i)" assertion, on the grounds
that they're redundant.

> Some work in the comment to explain why we must have an array here
> wouldn't be out of place either, perhaps.

There is a comment block about this right above the assertion in question:

/*
 * Both scan keys might have arrays, in which case we'll
 * arbitrarily pass only one of the arrays.  That won't
 * matter, since _bt_compare_scankey_args is aware that two
 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
 * failed to eliminate redundant arrays through array merging.
 * _bt_compare_scankey_args just returns false when it sees
 * this; it won't even try to examine either array.
 */

Do you think it needs more work?

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Tom Lane

Date:

08 April 2024, 01:50:40

Peter Geoghegan <pg@bowt.ie> writes:
> The assertions in question are arguably redundant. There are very
> similar assertions just a little earlier on, as we initially set up
> the array stuff (right before _bt_compare_scankey_args is called).
> I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
> assertion that Coverity doesn't like, in addition to the
> "Assert(!array || array->scan_key == i)" assertion, on the grounds
> that they're redundant.

If you're doing that, then surely

                    if (j != (BTEqualStrategyNumber - 1) ||
                        !(xform[j].skey->sk_flags & SK_SEARCHARRAY))
                    {
                        ...
                    }
                    else
                    {
                        Assert(j == (BTEqualStrategyNumber - 1));
                        Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
                        Assert(xform[j].ikey == array->scan_key);
                        Assert(!(cur->sk_flags & SK_SEARCHARRAY));
                    }

those first two Asserts are redundant with the "if" as well.

            regards, tom lane

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

08 April 2024, 01:57:23

On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If you're doing that, then surely
>
>                     if (j != (BTEqualStrategyNumber - 1) ||
>                         !(xform[j].skey->sk_flags & SK_SEARCHARRAY))
>                     {
>                         ...
>                     }
>                     else
>                     {
>                         Assert(j == (BTEqualStrategyNumber - 1));
>                         Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
>                         Assert(xform[j].ikey == array->scan_key);
>                         Assert(!(cur->sk_flags & SK_SEARCHARRAY));
>                     }
>
> those first two Asserts are redundant with the "if" as well.

I'll get rid of those other two assertions as well, then.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

08 April 2024, 02:13:57

On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Apr 7, 2024 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > those first two Asserts are redundant with the "if" as well.
>
> I'll get rid of those other two assertions as well, then.

Done that way.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Donghang Lin

Date:

18 April 2024, 06:12:52

Hi Peter

There seems to be an assertion failure with this change in HEAD

TRAP: failed Assert("leftarg->sk_attno == rightarg->sk_attno"), File: "../../src/backend/access/nbtree/nbtutils.c", Line: 3246, PID: 1434532

It can be reproduced by:

create table t(a int);
insert into t select 1 from generate_series(1,10);
create index on t (a desc);
set enable_seqscan = false;
select * from t where a IN (1,2) and a IN (1,2,3);

It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column,

the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid.

Attached a demo fix.

Regards,

Donghang Lin

(ServiceNow)

Attachment

demo-fix.patch

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

18 April 2024, 15:50:53

On Thu, Apr 18, 2024 at 2:13 AM Donghang Lin <donghanglin@gmail.com> wrote:
> It's triggered when a scankey's strategy is set to invalid. While for a descending ordered column,
> the strategy needs to get fixed to its commute strategy. That doesn't work if the strategy is invalid.

The problem is that _bt_fix_scankey_strategy shouldn't have been doing
anything with already-eliminated array scan keys in the first place
(whether or not they're on a DESC index column). I just pushed a fix
along those lines.

Thanks for the report!

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Alexander Lakhin

Date:

22 April 2024, 08:00:00

Hello Peter,

07.04.2024 20:18, Peter Geoghegan wrote:
> On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);
>>
>> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 3251267
> I immediately see what's up here. WIll fix this in the next short
> while. There is no bug here in builds without assertions, but it's
> probably best to keep the assertion, and to just make sure not to call
> _bt_preprocess_array_keys_final() unless it has real work to do.

Please look at another case, where a similar Assert (but this time in
_bt_preprocess_keys()) is triggered:
CREATE TABLE t (a text, b text);
INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g;
CREATE INDEX t_idx ON t USING btree(a);
BEGIN;
DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
FETCH FROM c;
FETCH RELATIVE 0 FROM c;

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962

Best regards,
Alexander

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

22 April 2024, 15:13:44

On Mon, Apr 22, 2024 at 4:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at another case, where a similar Assert (but this time in
> _bt_preprocess_keys()) is triggered:
> CREATE TABLE t (a text, b text);
> INSERT INTO t (a, b) SELECT 'a', repeat('b', 100) FROM generate_series(1, 500) g;
> CREATE INDEX t_idx ON t USING btree(a);
> BEGIN;
> DECLARE c CURSOR FOR SELECT a FROM t WHERE a = 'a';
> FETCH FROM c;
> FETCH RELATIVE 0 FROM c;
>
> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 2582, PID: 1130962

I'm pretty sure that I could fix this by simply removing the
assertion. But I need to think about it a bit more before I push a
fix.

The test case you've provided proves that _bt_preprocess_keys's
new no-op path isn't just used during scans that have array keys (your
test case doesn't have a SAOP at all). This was never intended. On the
other hand, I think that it's still correct (or will be once the assertion is
gone), and it seems like it would be simpler to allow this case (and
document it) than to not allow it at all.

The general idea that we only need one "real" _bt_preprocess_keys call
per btrescan (independent of the presence of array keys) still seems
sound.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

22 April 2024, 17:59:22

On Mon, Apr 22, 2024 at 11:13 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm pretty sure that I could fix this by simply removing the
> assertion. But I need to think about it a bit more before I push a
> fix.
>
> The test case you've provided proves that _bt_preprocess_keys's
> new no-op path isn't just used during scans that have array keys (your
> test case doesn't have a SAOP at all). This was never intended. On the
> other hand, I think that it's still correct (or will be once the assertion is
> gone), and it seems like it would be simpler to allow this case (and
> document it) than to not allow it at all.

Pushed a fix like that just now.

Thanks for the report, Alexander.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

"Anton A. Melnikov"

Date:

31 July 2024, 04:47:33

Hi, Peter!

On 20.01.2024 01:41, Peter Geoghegan wrote:
> It is quite likely that there are exactly zero affected out-of-core
> index AMs. I don't count pgroonga as a counterexample (I don't think
> that it actually fullfills the definition of a ). Basically,
> "amcanorder" index AMs more or less promise to be compatible with
> nbtree, down to having the same strategy numbers. So the idea that I'm
> going to upset amsearcharray+amcanorder index AM authors is a
> completely theoretical problem. The planner code evolved with nbtree,
> hand-in-glove.

 From the 5bf748b86bc commit message:

> There is a theoretical risk that removing restrictions on SAOP index
> paths from the planner will break compatibility with amcanorder-based
> index AMs maintained as extensions.  Such an index AM could have the
> same limitations around ordered SAOP scans as nbtree had up until now.
> Adding a pro forma incompatibility item about the issue to the Postgres
> 17 release notes seems like a good idea.

Seems, this commit broke our posted knn_btree patch. [1]
If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp,
then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp
in descending order and starts searching from smaller to larger
and always expects that for each element of ScalarArrayOp there will be a separate scan.
And now it does not work. Reproduction is described in [2].

Seems it is impossible to solve this problem only from the knn-btree patch side.
Could you advise any ways how to deal with this. Would be very grateful.

With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[1]
https://commitfest.postgresql.org/48/4871/
[2]
https://www.postgresql.org/message-id/47adb0b0-6e65-4b40-8d93-20dcecc21395%40postgrespro.ru

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

07 August 2024, 14:59:11

On Wed, Jul 31, 2024 at 12:47 AM Anton A. Melnikov
<a.melnikov@postgrespro.ru> wrote:
>  From the 5bf748b86bc commit message:
>
> > There is a theoretical risk that removing restrictions on SAOP index
> > paths from the planner will break compatibility with amcanorder-based
> > index AMs maintained as extensions.  Such an index AM could have the
> > same limitations around ordered SAOP scans as nbtree had up until now.
> > Adding a pro forma incompatibility item about the issue to the Postgres
> > 17 release notes seems like a good idea.

Here you've quoted the commit message's description of the risks
around breaking third party index AMs maintained as extensions. FWIW
that seems like a rather different problem to the one that you ran
into with your KNN nbtree patch.

> Seems, this commit broke our posted knn_btree patch. [1]
> If the point from which ORDER BY goes by distance is greater than the elements of ScalarArrayOp,
> then knn_btree algorithm will give only the first tuple. It sorts the elements of ScalarArrayOp
> in descending order and starts searching from smaller to larger
> and always expects that for each element of ScalarArrayOp there will be a separate scan.
> And now it does not work. Reproduction is described in [2].
>
> Seems it is impossible to solve this problem only from the knn-btree patch side.
> Could you advise any ways how to deal with this. Would be very grateful.

Well, you're actually patching nbtree. Your patch isn't upstream of
the nbtree code. You're going to have to reconcile your design with
the design for advancing arrays that is primarily implemented by
_bt_advance_array_keys.

I'm not entirely sure what the best way to go about doing that is --
that would require a lot of context about the KNN patch. It wouldn't
be particularly hard to selectively reenable the old Postgres 16 SAOP
behavior, though. You could conditionally force "goto new_prim_scan"
whenever the KNN patch's mechanism was in use.

That kind of approach might have unintended consequences elsewhere,
though. For example it could break things like mark/restore processing
by merge joins, which assumes that the scan's current array keys can
always be reconstructed after restoring a marked position by resetting
the array's to their first elements (or final elements if it's a
backwards scan). On the other hand, I think that you already shouldn't
expect anything like that to work with your patch. After all, it
changes the order of the tuples returned by the scan, which is already
bound to break certain assumptions made by "regular" index scans.

--
Peter Geoghegan

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Alexander Lakhin

Date:

26 August 2024, 17:00:00

Hello Peter,

22.04.2024 20:59, Peter Geoghegan wrote:
>
> Pushed a fix like that just now. 

I'm sorry to bother you again, but I've come across another assertion
failure. Please try the following query (I use a clean "postgres" database,
just after initdb):
EXPLAIN SELECT conname
   FROM pg_constraint WHERE conname IN ('pkey', 'ID')
   ORDER BY conname DESC;

SELECT conname
   FROM pg_constraint WHERE conname IN ('pkey', 'ID')
   ORDER BY conname DESC;

It fails for me as below:
                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
  Index Only Scan Backward using pg_constraint_conname_nsp_index on pg_constraint  (cost=0.14..4.18 rows=2 width=64)
    Index Cond: (conname = ANY ('{pkey,ID}'::name[]))
(2 rows)

server closed the connection unexpectedly
...
with the stack trace:
...
#5  0x000055a49f81148d in ExceptionalCondition (conditionName=0x55a49f8bb540 "ItemIdHasStorage(itemId)",
     fileName=0x55a49f8bb4a8 "../../../../src/include/storage/bufpage.h", lineNumber=355) at assert.c:66
#6  0x000055a49f0f2ddd in PageGetItem (page=0x7f97cbf17000 "", itemId=0x7f97cbf2f064)
     at ../../../../src/include/storage/bufpage.h:355
#7  0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2,
     tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
#8  0x000055a49f0f8ac3 in _bt_checkkeys (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, arrayKeys=true,
     tuple=0x7f97cbf18890, tupnatts=2) at nbtutils.c:3612
#9  0x000055a49f0ebb4b in _bt_readpage (scan=0x55a4a0ac4548, dir=BackwardScanDirection, offnum=20, firstPage=true)
     at nbtsearch.c:1863
...
(gdb) f 7
#7  0x000055a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, pstate=0x7ffd1a103670, tupnatts=2,
     tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
4105            ahead = (IndexTuple) PageGetItem(pstate->page,
(gdb) p aheadoffnum
$1 = 24596
(gdb) p pstate->offnum
$2 = 20
(gdb) p pstate->targetdistance
$3 = -24576

Best regards,
Alexander

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From

Peter Geoghegan

Date:

26 August 2024, 17:58:50

On Mon, Aug 26, 2024 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I'm sorry to bother you again, but I've come across another assertion
> failure.

You've found a real bug. I should be the one apologizing - not you.

> Please try the following query (I use a clean "postgres" database,
> just after initdb):
> EXPLAIN SELECT conname
>    FROM pg_constraint WHERE conname IN ('pkey', 'ID')
>    ORDER BY conname DESC;
>
> SELECT conname
>    FROM pg_constraint WHERE conname IN ('pkey', 'ID')
>    ORDER BY conname DESC;

The problem is that _bt_checkkeys_look_ahead didn't quite get
everything right with sanitizing the page offset number it uses to
check if a later tuple is before the recently advanced array scan
keys. The page offset itself was checked, but in a way that was faulty
in cases where the int16 we use could overflow.

I can fix the bug by making sure that pstate->targetdistance (an int16
variable) can only be doubled when it actually makes sense. That way
there can never be an int16 overflow, and so the final offnum cannot
underflow to a value that's much higher than the page's max offset
number.

This approach works:

     /*
      * The look ahead distance starts small, and ramps up as each call here
      * allows _bt_readpage to skip over more tuples
      */
     if (!pstate->targetdistance)
         pstate->targetdistance = LOOK_AHEAD_DEFAULT_DISTANCE;
-    else
+    else if (pstate->targetdistance < MaxIndexTuplesPerPage / 2)
         pstate->targetdistance *= 2;

I'll push a fix along these lines shortly.

Thanks for the report!
--
Peter Geoghegan