Re: Improving executor performance - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Improving executor performance |
Date | |
Msg-id | 20160714011850.bd5zhu35szle3n3c@alap3.anarazel.de Whole thread Raw |
In response to | Improving executor performance (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Improving executor performance
Re: Improving executor performance Re: Improving executor performance Re: Improving executor performance |
List | pgsql-hackers |
Hi, On 2016-06-24 16:29:53 -0700, Andres Freund wrote: > 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all > queries I looked at. There's various places we trigger deforming, > most ending in either slot_deform_tuple(), heap_getattr(), > heap_deform_tuple(). > > This can be optimized significantly, but even after that it's still a > major issue. > > Part of the problem is that we sometimes don't know how many elements > to access (especially in index and hashing related code), the other > part that we're deforming a lot of columns that we'll never need, > just because we need a subsequent one. > > The other part is that our tuple format requires a number of > relatively expensive operations to access data (e.g. alignment > computations, checking the null bitmap). > 6) The use of linked lists adds noticeable #instructions overhead and > branch misses. Just replacing {FuncExprState,BoolExprState}->args > with an array gives a ~10%ish boost for queries with a bunch of > quals. Another example that appears to hurts noticeably, but which > I've not experimentally fixed, is AggState->hash_needed. > > To a large degree these seem fairly easily fixable; it's kinda boring > work, but not really hard. As previously discussed many of the "architectural" changes show limited success until a few other bottlenecks are fixed. Most prominently slot deforming and, what I'm planning to primarily discuss in this email, expression evaluation. Even in the current state, profiles of queries evaluating a large number of tuples very commonly show expression evaluation to be one of the major costs. Due to the number of recursive calls that's not always easy to pinpoint though, the easiest thing to spot is usually MakeFunctionResultNoSet showing up prominently. While, as in 6) above, removing linked lists from the relevant structures helps, it's not that much. Staring at this for a long while made me realize that, somewhat surprisingly to me, is that one of the major problems is that we are bottlenecked on stack usage. Constantly entering and leaving this many functions for trivial expressions evaluations hurts considerably. Some of that is the additional numbers of instructions, some of that is managing return jump addresses, and some of that the actual bus traffic. It's also rather expensive to check for stack limits at a very high rate. Attached (in patch 0003) is a proof-of-concept implementing an expression evalution framework that doesn't use recursion. Instead ExecInitExpr2 computes a number of 'steps' necessary to compute an expression. These steps are stored in a linear array, and executed one after another (save boolean expressions, which can jump to later steps). E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate const, 3) call function. Expression evaluation then is a switch statement over the opcodes of each of these steps. Using the machinery of those 'steps' to compute projections, quals, and constraint evaluation then allows to reduce the number of indirect function calls/jumps further. My preliminary implementation so far implements only function calls, boolean expression, constant evaluation and variable evaluation. For everything else I'm falling back to the current expression machinery. By combining expression, qual and target list processing, we can also always generate the appropriate slot_getsomeattrs() calls. Note that the patch currently does *NOT* support targetlist SRFs, and instead just errors out. This is not a fundamental issue. I just didn't want to invest time in supporting something we want to reimplement anyway. Similarly subplans currently don't work because of: /* * Evaluate lefthand expressions and form a projection tuple. First we * have to set the econtext to use (hack alert!). which doesn't work quite like that atm. I've used (0004) a similar method to reduce the number of branches and pipeline stalls in slot_deform_tuple considerably. For each attribute a 'step' is generated, which contains exactly the computations required to deform that individual datum. That e.g. allows to separate cases for different alignments and null-checks. Having expression evaluation and slot deforming as a series of simple sequential steps, instead of complex recursive calls, would also make it fairly straightforward to optionally just-in-time compile those. For motivation, here's some random performance differences: SELECT SUM(l_quantity * l_extendedprice) FROM lineitem; master: 5038.382 4965.310 4983.146 patches: 4194.593 4153.759 4168.986 tpch-q1 master: 21274.896 dev: 17952.678 For queries involving more complex expressions, the difference can be a larger. Both, expression processing and tuple deforming, can use considerable additional improvements. But I think the approaches presented here are the biggest step that I can see. The reason that I'm bringing this up before submitting actual 'batch operation' patches is that the architectural improvements are quickly hidden behind these bottlenecks. Before spending time polishing up these approaches, I'd like if anybody fundamentally disagrees with either, or has a better proposal. If not I'm hoping to first "properly" submit the slot deforming for review, and then the expression evaluation. The latter would obviously need to be a lot more complete than now; and we'd likely want to the targetlist SRF rearchitecting beforehand. Comments? Regards, Andres
Attachment
pgsql-hackers by date: