Thread: Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In general, when we end a primitive index scan, the code that sets
> continuescan=false (any such code, not just _bt_check_rowcompare NULL
> row member code) has to make sure that starting the next primitive
> index scan will actually allow the top-level scan to make forward
> progress -- the new primscan needs to land on some later leaf page.
> Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
> consider truly robust. We need that guarantee to avoid these
> cycles/infinite looping. AFAIK there are no bugs of that general
> nature on Postgres 18

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's
0002-* patch previously appeared as 0001-* in v2 and v1).

The bug in question involves redundant keys that could not be
eliminated by preprocessing due to a lack of available cross-type
support. Quals with certain combinations of redundant scalar keys can
cycle through the index being scanned ceaselessly. The scan fails to
make any real progress, again and again. Unlike the row compare case,
we lack even kludgey handling that avoids the problem within
_bt_set_startikey.

There's a live bug here, so I have to act. It seems to make sense to
apply everything on HEAD/Postgres 18, and be done with any question of
getting stuck like this/infinite cycling from resetting the scan's
arrays to get a clean slate. But I would certainly welcome other
opinions on this.

Problem statement
=================

I have confirmed the existence of the live bug using fuzz-testing,
combined with hard-coding that makes _bt_compare_scankey_args return
false unconditionally at the point where we're supposed to compare
each sk_argument from a pair of scan keys against the same attribute
(that's the easiest way to test problems in this area). With the right
combination of contradictory runtime keys, and with the right index
scan/index, we can get stuck.

When we get stuck its because we apply forcenonrequired mode in
_bt_readpage, reset the arrays at the end of the same _bt_readpage,
start another primitive index scan when pstate.finaltup is considered,
and then arrive at the same leaf page we ended on last time. We'll
then start the cycle anew, without any hope of ever making useful
progress. Not good.

Fundamentally, the problem here is that _bt_advance_array_keys expects
certain things from _bt_preprocess_keys and _bt_first that they won't
always honor. My approach in 0001-* more or less makes things always
work in the way that _bt_advance_array_keys already expects, rather
than changing _bt_advance_array_keys (or related code such as
_bt_set_startikey). More on that below, under "bugfix".

Background info on _bt_advance_array_keys' expectations
=======================================================

Notably, _bt_advance_array_keys would like to be able to determine if
_bt_first will be able to perform another primitive index scan based
on lower-order required inequalities marked required in the
opposite-to-scan direction only (i.e. an > or >= key when scanning
forwards, a < or <= key when scanning backwards). That detail is
important to my repro of the bug. _bt_advance_array_keys expects to be
able to apply required-in-opposite-direction inequalities to reason
about how _bt_first will behave the next time it is called.

During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5",
_bt_advance_array_keys needs to know how "b > 5" will affect _bt_first
should it choose to schedule another primscan. It needs to decide
whether or not to actually schedule such a primscan. It might make
sense for _bt_advance_array_keys to stick to the leaf level for the
time being upon reaching the threshold between "a = 1" and "a = 2"
tuples -- the first "a = 2 AND b > 5" might be on the same leaf page.
OTOH, scheduling a new primscan is far preferable when it'll allow
_bt_first to land the scan on a much later leaf page (we'll skip over
many irrelevant leaf pages).

In general, the first leaf page/position that the first tuple "a = 2
AND b > 5" appears on might be many leaf pages after the first leaf
page that the first tuple "a =2" appears on. It all depends.

This scheme works well if there are no redundant > keys (which is very
much the common case, since incomplete opfamilies are rare). It
tacitly assumes that _bt_first shares the same definition of "required
in the opposite-to-scan direction key". But that isn't quite true
right now.

Bugfix
======

My 0001-* bugfix makes nbtree preprocessing deal with a qual like
"WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an
incomplete opfamily on "b" by leaving so->keyData[] in a state that
reflects the redundancy. _bt_preprocess_keys will now make sure that
keys left in so->keyData[] still appear in the most useful order. This
avoids creating confusion elsewhere.

Preprocessing will arbitrarily decide that only the first "b >"
condition gets to be marked required, while *unmarking* the
requiredness marking on the second "b >" condition. That way _bt_first
doesn't get to make its own independent choice about initial
positioning keys, based on ever-so-slightly different rules. Rather,
_bt_first has to agree with everybody else about which > or >= key
should be used, and about which < or <= key should be used (and even
about which "=" key should be used).

_bt_first does what preprocessing tells it to do. Just like
_bt_advance_array_keys will. No ifs, no buts.

Placing nonrequired keys at the end of so->keyData[]
----------------------------------------------------

Preprocessing also reorders so->keyData[] such that the "unlucky" keys
that _bt_preprocess_keys *doesn't* pick go to the end of the array --
we want to keep them out of the way, at least until they're needed.
Note that this makes it possible for keys to appear out of index
attribute order, but that's okay. Nonrequired keys don't need to be in
any particular order.

Reordering so->keyData[] in this way eliminates any risk that some
early nonrequired key on "b" will stop us from getting to some later
required key of interest on "c". Not in _bt_first, not in
_bt_checkkeys or _bt_advance_array_keys. Nowhere.

In general, the rule going forward is that all nbtree code (barring
preprocessing code) can assume that redundant keys don't really exist,
as long as they're marked required (which virtually all keys are now,
thanks to the skip scan/skip array work). There can be at most one
inequality key marked SK_BT_REQFWD and one inequality key marked
SK_BT_REQBKWD per index attribute -- no matter what.

nbtree is blasé about keeping around redundant scan keys right now.
That seems like it has the potential to cause more bugs in the future.
It's easy to forget about redundant required keys, and they're rare
(except perhaps when row compares are used, since preprocessing has no
hardly any smarts about redundancy that happens to involve row
compares right now). For a recent example a bug caused by such an
oversight, see bugfix commit 7e6fb5da.

New _bt_first behavior with row compares
----------------------------------------

Currently, on HEAD, _bt_first is prepared to use a seemingly random
mix of the first row compare member followed by some later
redudant-ish scalar on some later column (not the key on the same
column that appears in the row compare). This v3's row compare patch
now avoids that behavior. That happens automatically in v3, by virtue
of the fact that there can't be a row compare that's marked required
in the opposite-to-scan scan direction followed by some other key
that's also marked required (skip arrays don't accept row compare
inequalities, so it's just not possible).

In short, if _bt_first gets to use a row compare to build its
insertion scan key at all, then it *must* use all of the row members;
it cannot use any additional lower-order keys. (Actually, _bt_first
can only do that with individual row members marked required).

Again, there needs to be symmetry here (perhaps it isn't strictly
necessary to go exactly this far, but the row compare change to
_bt_first *is* necessary). Again, the rules that decide when we'll
start an index scan when scanning in one particular scan direction
need to be symmetric with the rules that decide when we'll end a index
scan with the same qual moving in the opposite scan direction.

Impact on Postgres 17
=====================

On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2,
3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling
with an incomplete opfamily on "b". That can only happen on HEAD
because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys
call to _bt_readpage, to reset the scan's arrays after we used
forcenonrequired mode to read all the tuples on the page.

Postgres 17 *does* reset the array keys through a call to
_bt_start_array_keys, to get a clean slate. But that only happens in
the context of mark/restore for merge joins, which seems robust
against infinite cycling of the kind I'm seeing on HEAD.

--
Peter Geoghegan

Attachment
On Wed, Jun 25, 2025 at 5:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Correction: there *is* a live bug like this on Postgres 18/HEAD, which
> involves simple scalar inequalities with an incomplete opfamily.
> Attached v3 fixes the bug in its new 0001-* patch.

Attached is v4, which is largely just a polished version of v3. It has
improved comments and more worked out commit messages. Plus the second
patch (the row compare patch) now teaches _bt_first to fully rely on
scan key requiredness markings, just like with other scan keys.

Current plan is to commit these two some time next week. It'd be nice
to get some code review before then, or any kind of input on my
general direction.

Note again that the approach I've taken to fixing the bug *adds a
performance optimization*. It will fix the row compare performance
problem described on a thread from February of this year:


https://www.postgresql.org/message-id/flat/BLAPR09MB64993673A034EE93C6C6F4B6CCF62%40BLAPR09MB6499.namprd09.prod.outlook.com

It will *also* fix the row compare performance problem that Tom
complained about last year, which included its own test case:

https://postgr.es/m/66001.1715455158@sss.pgh.pa.us

I am aware that this all sounds quite odd. Yes, I'm proposing a bug
fix that improves performance in what must seem like fairly tangential
cases. As I went into already, it really did just work out that way.

I had forgotten about the two threads I'm referencing here until I saw
references to each of them in my skip scan project notes. I'm
mentioning them on the thread now in case they're useful breadcrumbs
for somebody down that revisits this thread in the future.

--
Peter Geoghegan

Attachment
On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v4, which is largely just a polished version of v3. It has
> improved comments and more worked out commit messages. Plus the second
> patch (the row compare patch) now teaches _bt_first to fully rely on
> scan key requiredness markings, just like with other scan keys.

Heikki said he'd be able to give this patch set at least a quick
review, so here's a new revision, v4.

This isn't really different to v4. It has more comment cleanup, and
better commit messages.

--
Peter Geoghegan

Attachment
On 01/07/2025 19:37, Peter Geoghegan wrote:
> On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Attached is v4, which is largely just a polished version of v3. It has
>> improved comments and more worked out commit messages. Plus the second
>> patch (the row compare patch) now teaches _bt_first to fully rely on
>> scan key requiredness markings, just like with other scan keys.
>
> Heikki said he'd be able to give this patch set at least a quick
> review, so here's a new revision, v4.
>
> This isn't really different to v4. It has more comment cleanup, and
> better commit messages.

I like how this makes row comparisons less special. To be honest I don't
quite understand what the bug was, I'd have to dig much deeper into this
to swap enough context into my memory for that. But just reading the
comments in these patches, I get the impression that they make this all
less fragile.

On bt_unmark_extra_keys() function: The function comment explains well
what it does; that's good. A few very minor stylistic remarks on its
implementation:

 > +    /* Set things up for first key's attr */
 > +    origkey = so->keyData;
 > +    curattr = origkey->sk_attno;
 > +    firsti = 0;
 > +    haveReqEquals = false;
 > +    haveReqForward = false;
 > +    haveReqBackward = false;
 > +    for (int i = 0; i < so->numberOfKeys; origkey++, i++)
 > +    {

It took me a moment to understand that origkey is always pointing to
&so->keyData[i]. I'd suggest making it more clear with:

    /* Set things up for first key's attr */
    curattr = so->keyData[0].sk_attno;
    firsti = 0;
    haveReqEquals = false;
    haveReqForward = false;
    haveReqBackward = false;
    for (int i = 0; i < so->numberOfKeys; i++)
    {
        ScanKey        origkey = &so->keyData[i];

In the second loop in the function you're actually doing this:

 > +        origkey = so->keyData + i;

which is yet another way to say the same thing. IMHO "&so->keyData[i]"
is more readable.

Would it be worth adding a comment that we assume keyData to be in
sk_attno order? Or is that a widely-known assumption? I don't remember.

 > +    /*
 > +     * Next, allocate temp arrays: one set for unchanged keys, another for
 > +     * keys that will be unmarked/made non-required
 > +     */
 > +    unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
 > +    keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData));
Doesn't matter much but I think this could be:

    unmarkKeys = palloc(nunmark * sizeof(ScanKeyData));
    keepKeys = palloc((so->numberOfKeys - nunmark) * sizeof(ScanKeyData));

Or use a single array, putting the unmarked entries to the end of the array.

PS. Not a new thing, but the "Add coverage..." comments in the tests
seem redundant. All tests add coverage for something.

- Heikki




On Tue, Jul 1, 2025 at 2:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I like how this makes row comparisons less special. To be honest I don't
> quite understand what the bug was, I'd have to dig much deeper into this
> to swap enough context into my memory for that.

If you're interested, I can provide you with all you'll need to
reproduce the infinite cycling behavior, where _bt_first is called
again and again, without the scan ever making useful progress. The
test data is fairly small, but it's a bit too big to just post to the
list.

My testing *simulates* cases where preprocessing cannot eliminate
redundant keys, without actually creating an incomplete opfamily, and
without really using cross-type operators in the queries. This
shouldn't matter -- it's just much easier to work with.

> But just reading the
> comments in these patches, I get the impression that they make this all
> less fragile.

Technically the second patch (the row compare patch) isn't strictly
necessary to make my tests stop showing infinite cycling behavior. But
it's very easy to show infinite cycling behavior if I remove the
"refuse to forcenonrequired on row compare" kludge that I added to
_bt_set_startikey in bugfix commit 5f4d98d4 -- that's why I added it
in the first place. As you said, the goal is to make row compare quals
less special. And, in particular, to make it safe to remove that hack.

IMO the second patch is essential, since the hack I have in place
right now seems very fragile. The first patch (which adds the new
preprocessing step) is unambiguously needed, to fix a live bug
(without it, the infinite cycling behavior *is* still possible with
redundant keys that preprocessing can't eliminate).

> On bt_unmark_extra_keys() function: The function comment explains well
> what it does; that's good. A few very minor stylistic remarks on its
> implementation:

I agree with all of your feedback; will make all those changes.

> Would it be worth adding a comment that we assume keyData to be in
> sk_attno order? Or is that a widely-known assumption? I don't remember.

I'll add a comment like that, just to be on the safe side.

There is an old comment at the top of _bt_preprocess_keys that says
that scan->keyData[] is in attribute order. And I add a new one in the
patch, which points out that that *isn't* always true anymore when the
so->keyData[] output array has nonrequired keys. There's no comment
that says what the new preprocessing function/pass expects about the
order of the so->keyDatap[] (which is both its input array and its
output array).

> PS. Not a new thing, but the "Add coverage..." comments in the tests
> seem redundant. All tests add coverage for something.

I'll adjust them along those lines.

Unless there are any objections, and assuming you're done giving
feedback, I'll commit both patches tomorrow.

Thanks!
--
Peter Geoghegan