Thread: Re: Making Row Comparison NULL row member handling more robust during skip scans
Re: Making Row Comparison NULL row member handling more robust during skip scans
From
Peter Geoghegan
Date:
On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > In general, when we end a primitive index scan, the code that sets > continuescan=false (any such code, not just _bt_check_rowcompare NULL > row member code) has to make sure that starting the next primitive > index scan will actually allow the top-level scan to make forward > progress -- the new primscan needs to land on some later leaf page. > Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd > consider truly robust. We need that guarantee to avoid these > cycles/infinite looping. AFAIK there are no bugs of that general > nature on Postgres 18 Correction: there *is* a live bug like this on Postgres 18/HEAD, which involves simple scalar inequalities with an incomplete opfamily. Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's 0002-* patch previously appeared as 0001-* in v2 and v1). The bug in question involves redundant keys that could not be eliminated by preprocessing due to a lack of available cross-type support. Quals with certain combinations of redundant scalar keys can cycle through the index being scanned ceaselessly. The scan fails to make any real progress, again and again. Unlike the row compare case, we lack even kludgey handling that avoids the problem within _bt_set_startikey. There's a live bug here, so I have to act. It seems to make sense to apply everything on HEAD/Postgres 18, and be done with any question of getting stuck like this/infinite cycling from resetting the scan's arrays to get a clean slate. But I would certainly welcome other opinions on this. Problem statement ================= I have confirmed the existence of the live bug using fuzz-testing, combined with hard-coding that makes _bt_compare_scankey_args return false unconditionally at the point where we're supposed to compare each sk_argument from a pair of scan keys against the same attribute (that's the easiest way to test problems in this area). With the right combination of contradictory runtime keys, and with the right index scan/index, we can get stuck. When we get stuck its because we apply forcenonrequired mode in _bt_readpage, reset the arrays at the end of the same _bt_readpage, start another primitive index scan when pstate.finaltup is considered, and then arrive at the same leaf page we ended on last time. We'll then start the cycle anew, without any hope of ever making useful progress. Not good. Fundamentally, the problem here is that _bt_advance_array_keys expects certain things from _bt_preprocess_keys and _bt_first that they won't always honor. My approach in 0001-* more or less makes things always work in the way that _bt_advance_array_keys already expects, rather than changing _bt_advance_array_keys (or related code such as _bt_set_startikey). More on that below, under "bugfix". Background info on _bt_advance_array_keys' expectations ======================================================= Notably, _bt_advance_array_keys would like to be able to determine if _bt_first will be able to perform another primitive index scan based on lower-order required inequalities marked required in the opposite-to-scan direction only (i.e. an > or >= key when scanning forwards, a < or <= key when scanning backwards). That detail is important to my repro of the bug. _bt_advance_array_keys expects to be able to apply required-in-opposite-direction inequalities to reason about how _bt_first will behave the next time it is called. During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5", _bt_advance_array_keys needs to know how "b > 5" will affect _bt_first should it choose to schedule another primscan. It needs to decide whether or not to actually schedule such a primscan. It might make sense for _bt_advance_array_keys to stick to the leaf level for the time being upon reaching the threshold between "a = 1" and "a = 2" tuples -- the first "a = 2 AND b > 5" might be on the same leaf page. OTOH, scheduling a new primscan is far preferable when it'll allow _bt_first to land the scan on a much later leaf page (we'll skip over many irrelevant leaf pages). In general, the first leaf page/position that the first tuple "a = 2 AND b > 5" appears on might be many leaf pages after the first leaf page that the first tuple "a =2" appears on. It all depends. This scheme works well if there are no redundant > keys (which is very much the common case, since incomplete opfamilies are rare). It tacitly assumes that _bt_first shares the same definition of "required in the opposite-to-scan direction key". But that isn't quite true right now. Bugfix ====== My 0001-* bugfix makes nbtree preprocessing deal with a qual like "WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an incomplete opfamily on "b" by leaving so->keyData[] in a state that reflects the redundancy. _bt_preprocess_keys will now make sure that keys left in so->keyData[] still appear in the most useful order. This avoids creating confusion elsewhere. Preprocessing will arbitrarily decide that only the first "b >" condition gets to be marked required, while *unmarking* the requiredness marking on the second "b >" condition. That way _bt_first doesn't get to make its own independent choice about initial positioning keys, based on ever-so-slightly different rules. Rather, _bt_first has to agree with everybody else about which > or >= key should be used, and about which < or <= key should be used (and even about which "=" key should be used). _bt_first does what preprocessing tells it to do. Just like _bt_advance_array_keys will. No ifs, no buts. Placing nonrequired keys at the end of so->keyData[] ---------------------------------------------------- Preprocessing also reorders so->keyData[] such that the "unlucky" keys that _bt_preprocess_keys *doesn't* pick go to the end of the array -- we want to keep them out of the way, at least until they're needed. Note that this makes it possible for keys to appear out of index attribute order, but that's okay. Nonrequired keys don't need to be in any particular order. Reordering so->keyData[] in this way eliminates any risk that some early nonrequired key on "b" will stop us from getting to some later required key of interest on "c". Not in _bt_first, not in _bt_checkkeys or _bt_advance_array_keys. Nowhere. In general, the rule going forward is that all nbtree code (barring preprocessing code) can assume that redundant keys don't really exist, as long as they're marked required (which virtually all keys are now, thanks to the skip scan/skip array work). There can be at most one inequality key marked SK_BT_REQFWD and one inequality key marked SK_BT_REQBKWD per index attribute -- no matter what. nbtree is blasé about keeping around redundant scan keys right now. That seems like it has the potential to cause more bugs in the future. It's easy to forget about redundant required keys, and they're rare (except perhaps when row compares are used, since preprocessing has no hardly any smarts about redundancy that happens to involve row compares right now). For a recent example a bug caused by such an oversight, see bugfix commit 7e6fb5da. New _bt_first behavior with row compares ---------------------------------------- Currently, on HEAD, _bt_first is prepared to use a seemingly random mix of the first row compare member followed by some later redudant-ish scalar on some later column (not the key on the same column that appears in the row compare). This v3's row compare patch now avoids that behavior. That happens automatically in v3, by virtue of the fact that there can't be a row compare that's marked required in the opposite-to-scan scan direction followed by some other key that's also marked required (skip arrays don't accept row compare inequalities, so it's just not possible). In short, if _bt_first gets to use a row compare to build its insertion scan key at all, then it *must* use all of the row members; it cannot use any additional lower-order keys. (Actually, _bt_first can only do that with individual row members marked required). Again, there needs to be symmetry here (perhaps it isn't strictly necessary to go exactly this far, but the row compare change to _bt_first *is* necessary). Again, the rules that decide when we'll start an index scan when scanning in one particular scan direction need to be symmetric with the rules that decide when we'll end a index scan with the same qual moving in the opposite scan direction. Impact on Postgres 17 ===================== On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling with an incomplete opfamily on "b". That can only happen on HEAD because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys call to _bt_readpage, to reset the scan's arrays after we used forcenonrequired mode to read all the tuples on the page. Postgres 17 *does* reset the array keys through a call to _bt_start_array_keys, to get a clean slate. But that only happens in the context of mark/restore for merge joins, which seems robust against infinite cycling of the kind I'm seeing on HEAD. -- Peter Geoghegan
Attachment
Re: Making Row Comparison NULL row member handling more robust during skip scans
From
Peter Geoghegan
Date:
On Wed, Jun 25, 2025 at 5:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > Correction: there *is* a live bug like this on Postgres 18/HEAD, which > involves simple scalar inequalities with an incomplete opfamily. > Attached v3 fixes the bug in its new 0001-* patch. Attached is v4, which is largely just a polished version of v3. It has improved comments and more worked out commit messages. Plus the second patch (the row compare patch) now teaches _bt_first to fully rely on scan key requiredness markings, just like with other scan keys. Current plan is to commit these two some time next week. It'd be nice to get some code review before then, or any kind of input on my general direction. Note again that the approach I've taken to fixing the bug *adds a performance optimization*. It will fix the row compare performance problem described on a thread from February of this year: https://www.postgresql.org/message-id/flat/BLAPR09MB64993673A034EE93C6C6F4B6CCF62%40BLAPR09MB6499.namprd09.prod.outlook.com It will *also* fix the row compare performance problem that Tom complained about last year, which included its own test case: https://postgr.es/m/66001.1715455158@sss.pgh.pa.us I am aware that this all sounds quite odd. Yes, I'm proposing a bug fix that improves performance in what must seem like fairly tangential cases. As I went into already, it really did just work out that way. I had forgotten about the two threads I'm referencing here until I saw references to each of them in my skip scan project notes. I'm mentioning them on the thread now in case they're useful breadcrumbs for somebody down that revisits this thread in the future. -- Peter Geoghegan
Attachment
Re: Making Row Comparison NULL row member handling more robust during skip scans
From
Peter Geoghegan
Date:
On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v4, which is largely just a polished version of v3. It has > improved comments and more worked out commit messages. Plus the second > patch (the row compare patch) now teaches _bt_first to fully rely on > scan key requiredness markings, just like with other scan keys. Heikki said he'd be able to give this patch set at least a quick review, so here's a new revision, v4. This isn't really different to v4. It has more comment cleanup, and better commit messages. -- Peter Geoghegan
Attachment
Re: Making Row Comparison NULL row member handling more robust during skip scans
From
Heikki Linnakangas
Date:
On 01/07/2025 19:37, Peter Geoghegan wrote: > On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Attached is v4, which is largely just a polished version of v3. It has >> improved comments and more worked out commit messages. Plus the second >> patch (the row compare patch) now teaches _bt_first to fully rely on >> scan key requiredness markings, just like with other scan keys. > > Heikki said he'd be able to give this patch set at least a quick > review, so here's a new revision, v4. > > This isn't really different to v4. It has more comment cleanup, and > better commit messages. I like how this makes row comparisons less special. To be honest I don't quite understand what the bug was, I'd have to dig much deeper into this to swap enough context into my memory for that. But just reading the comments in these patches, I get the impression that they make this all less fragile. On bt_unmark_extra_keys() function: The function comment explains well what it does; that's good. A few very minor stylistic remarks on its implementation: > + /* Set things up for first key's attr */ > + origkey = so->keyData; > + curattr = origkey->sk_attno; > + firsti = 0; > + haveReqEquals = false; > + haveReqForward = false; > + haveReqBackward = false; > + for (int i = 0; i < so->numberOfKeys; origkey++, i++) > + { It took me a moment to understand that origkey is always pointing to &so->keyData[i]. I'd suggest making it more clear with: /* Set things up for first key's attr */ curattr = so->keyData[0].sk_attno; firsti = 0; haveReqEquals = false; haveReqForward = false; haveReqBackward = false; for (int i = 0; i < so->numberOfKeys; i++) { ScanKey origkey = &so->keyData[i]; In the second loop in the function you're actually doing this: > + origkey = so->keyData + i; which is yet another way to say the same thing. IMHO "&so->keyData[i]" is more readable. Would it be worth adding a comment that we assume keyData to be in sk_attno order? Or is that a widely-known assumption? I don't remember. > + /* > + * Next, allocate temp arrays: one set for unchanged keys, another for > + * keys that will be unmarked/made non-required > + */ > + unmarkKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData)); > + keepKeys = palloc(so->numberOfKeys * sizeof(ScanKeyData)); Doesn't matter much but I think this could be: unmarkKeys = palloc(nunmark * sizeof(ScanKeyData)); keepKeys = palloc((so->numberOfKeys - nunmark) * sizeof(ScanKeyData)); Or use a single array, putting the unmarked entries to the end of the array. PS. Not a new thing, but the "Add coverage..." comments in the tests seem redundant. All tests add coverage for something. - Heikki
Re: Making Row Comparison NULL row member handling more robust during skip scans
From
Peter Geoghegan
Date:
On Tue, Jul 1, 2025 at 2:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I like how this makes row comparisons less special. To be honest I don't > quite understand what the bug was, I'd have to dig much deeper into this > to swap enough context into my memory for that. If you're interested, I can provide you with all you'll need to reproduce the infinite cycling behavior, where _bt_first is called again and again, without the scan ever making useful progress. The test data is fairly small, but it's a bit too big to just post to the list. My testing *simulates* cases where preprocessing cannot eliminate redundant keys, without actually creating an incomplete opfamily, and without really using cross-type operators in the queries. This shouldn't matter -- it's just much easier to work with. > But just reading the > comments in these patches, I get the impression that they make this all > less fragile. Technically the second patch (the row compare patch) isn't strictly necessary to make my tests stop showing infinite cycling behavior. But it's very easy to show infinite cycling behavior if I remove the "refuse to forcenonrequired on row compare" kludge that I added to _bt_set_startikey in bugfix commit 5f4d98d4 -- that's why I added it in the first place. As you said, the goal is to make row compare quals less special. And, in particular, to make it safe to remove that hack. IMO the second patch is essential, since the hack I have in place right now seems very fragile. The first patch (which adds the new preprocessing step) is unambiguously needed, to fix a live bug (without it, the infinite cycling behavior *is* still possible with redundant keys that preprocessing can't eliminate). > On bt_unmark_extra_keys() function: The function comment explains well > what it does; that's good. A few very minor stylistic remarks on its > implementation: I agree with all of your feedback; will make all those changes. > Would it be worth adding a comment that we assume keyData to be in > sk_attno order? Or is that a widely-known assumption? I don't remember. I'll add a comment like that, just to be on the safe side. There is an old comment at the top of _bt_preprocess_keys that says that scan->keyData[] is in attribute order. And I add a new one in the patch, which points out that that *isn't* always true anymore when the so->keyData[] output array has nonrequired keys. There's no comment that says what the new preprocessing function/pass expects about the order of the so->keyDatap[] (which is both its input array and its output array). > PS. Not a new thing, but the "Add coverage..." comments in the tests > seem redundant. All tests add coverage for something. I'll adjust them along those lines. Unless there are any objections, and assuming you're done giving feedback, I'll commit both patches tomorrow. Thanks! -- Peter Geoghegan