Re: BUG #18205: Performance regression with NOT NULL checks. - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: BUG #18205: Performance regression with NOT NULL checks. |
Date | |
Msg-id | 20231119211203.o5qczelolmbczawt@awork3.anarazel.de Whole thread Raw |
In response to | Re: BUG #18205: Performance regression with NOT NULL checks. (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #18205: Performance regression with NOT NULL checks.
|
List | pgsql-bugs |
Hi, On 2023-11-19 14:08:05 -0500, Tom Lane wrote: > So that results in not having to deconstruct most of the tuple, > whereas in the new code we do have to, thanks to b8d7f053c's > decision to batch all the variable-value-extraction work. Yea, I think we were aware at the time that this does have downsides - it's just that the worst case behaviour of *not* batching are much bigger than the worst-case downside of batching. > This is a pretty narrow corner case: it would only matter if the > column you're testing for null-ness is far past any other column > the query needs to fetch. So I'm not sure that it's worth doing > anything about. You could imagine special processing for > NullTest-on-a-simple-Var: exclude the Var from the ones that we > extract normally and instead compile it into some sort of "is column k > NULL" test on the HeapTuple. We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc will just use slot_getattr(). These can be used when the result is just a single variable. However, the goal there was more to avoid "interpreter startup" overhead, rather than evaluation overhead. Those fastpaths don't match in this case though, the generated "program" is: EEOP_SCAN_FETCHSOME EEOP_SCAN_VAR EEOP_NULLTEST_ISNULL EEOP_QUAL EEOP_DONE which matches none of the fastpaths we recognize. In fact, I don't think any of the fastpaths currently can be applied to quals, which is somewhat sad :( Right now the "pattern matching" happens in ExecReadyInterpretedExpr(). For portions that makes sense (as they're interpretation specific), but the transformation from EEOP_SCAN_FETCHSOME to just fetching a single column imo should be moved into execExpr.c, by making expr_setup_walker() a bit smarter. I think that might help in a fair number of cases - quals on a single column aren't exactly rare. In this specific case we also could elide the EEOP_QUAL, the NULLTEST cannot return a NULL, so we don't need any qual specific behaviour. For this case alone it'd probably not be worth tracking whether something can be NULL. But it could be worthwhile if we made it more generic, e.g. eliding strictness checks before function calls could be a nice win. > But that seems messy, and it could be a significant pessimization for > storage methods that aren't like heapam. My concern is more that they're pretty narrow - they apply only if a single column is referenced. If we just need attributed 100 and 101, it's still expensive to deform leading columns. To address that, Ithink we could benefit from a tuple deforming routine that's more aware of which columns are needed. Even just omitting storing values in tts_values/nulls that aren't needed yielded decent wins when I experimented in the past. The hard part of course is to figure out when to use a "selective" deforming routine, which will have *worse* performance if most columns are used, and when to just deform everything up to natts. I think the current deforming logic is actually optimized towards heapam to a problematic degree - deforming leading columns in a columnar database hurts really badly. > On the whole I'm inclined to say "sorry, that's the price of > progress". Agreed, we can't really go back to deforming individual columns more widely - that can get *really* expensive. But I think we can do plenty to make the performance of this specific case, both by improving the generated expression program and: > But it is a bit sad that a patch intended to be a > performance win shows a loss this big on a (presumably) real-world > case. I think we might be able to micro-optimize that code a bit to reduce the decrease in performance. Neither gcc nor clang is able to to optimize the cost of the null-check meaningfully, and there a number of dependent instructions: if (hasnulls && att_isnull(attnum, bp)) { values[attnum] = (Datum) 0; isnull[attnum] = true; slow = true; /* can't use attcacheoff anymore */ continue; } static inline bool att_isnull(int ATT, const bits8 *BITS) { return !(BITS[ATT >> 3] & (1 << (ATT & 0x07))); } What if we instead load 8 bytes of the bitmap into a uint64 before entering the loop, and shift an "index" mask into the bitmap by one each iteration through the loop? Then we'd only need to do the access to bitmap every 64 attributes. I think we could also get rid of the hasnulls check in each iteration that way, by just checking hasnull whenever we load portions of the bitmap into a local variable, and loading "no-nulls" into it when !hasnulls. Greetings, Andres Freund
pgsql-bugs by date: