Redesigning the executor (async, JIT, memory efficiency) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Redesigning the executor (async, JIT, memory efficiency) |
Date | |
Msg-id | 20180525033538.6ypfwcqcxce6zkjj@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Redesigning the executor (async, JIT, memory efficiency)
Re: Redesigning the executor (async, JIT, memory efficiency) |
List | pgsql-hackers |
Hi, The current executor structure limits us in a number of ways that are becoming problematic. I'll outline in this email how I think we should rework it. Including a prototype for the first, most drastic, steps. The primary problems with the current design that I want to address are: 1) Executor structure makes it hard to do asynchronous execution. That comes into play in several cases. For example we want to be able to continue execution if no tuples are ready in one FDW below an append node, instead switch to another one. Similarly that could be relevant for parallel queries, with multiple gather nodes. A bit further out that'd also be very useful for continuing execution in a different part of the tree once blocked on IO. That'd e.g. be useful to build several hashtables for hashjoins at once. There's some limitations of how we do IO for that atm however, it'd be easier to efficiently implement this if we had AIO support. With the current, tree-walking based architecture, it's hard to implement something like that, because we basically have to re-enter into the subtrees where we'd stopped. That'd requires adding new state-machine logic in every tree, including additional branches. What we basically need here is the ability to stop execution inside a specific executor node and have a multiplexer node (which can use WaitEventSets or such) decide which part of the execution tree to continue executing instead. 2) The tree-walking based approach makes it very hard to JIT compile the main portion of query execution without manually re-emitting all of executor/, which is obviously entirely unattractive. For the expression evaluation JIT the biggest benefit comes from removing the indirect branches between the expression steps and then, a close second, from having more efficient implementations of hot-path expression steps. That approach currently isn't feasible for the whole executor. The main sources for indirect branches / calls are ExecProcNode() and ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make those constant we'd basically have to manually emit code for all of these (or attempt to rely on the inliner, but that doesn't work across deep, potentially self invoking, trees). The expression code, which also used to be tree walking, avoids that now by having the indirect function calls nearly exclusively at the top-level. Secondary issues that I also want to address, or at least get ready to address, are: 3) Reduce the overhead of executor startup. In a number of workloads involving simple queries we currently spend most of the time within ExecInitNode(). There's two major problems here: For one a lot of work is done at executor initialization that should be part of planning (especially creating tupledescs and nodeAgg initialization). The other problem is that initialization does a lot of tiny allocations. 4) Make the executor faster, leaving JIT aside. Moving to a "push based" executor can yield noticable speedups: Right now returning a single tuple will often pass through a lot of indirect function calls (via ExecProcNode()). By moving to a push based executor (where the first executed piece is reachable without any recursion), this can be significantly cheaper. We also exercise the stack *a lot* right now, constantly pushing/poping stuff onto it that's not going to be used anytime soon. Profiling shows that stack usage is quite a bottleneck right now. 5) Memory overhead of query execution. Right now executing a prepared statment has quite a massive memory overhead. We copy around a lot of memory, initialize a lot of nodes from scratch, etc. It'd be far better if we could share a lot of state between successive executions of the same prepared statement. 6) Plan ahead for vectorized / batched execution. Dealing with tuples one-by-one is inefficient from a cache locality, code locality and CPU pipelining perspective. Instead fully processing a single tuple from a page (evaluating qual, then joining it to other things etc), it's a lot more efficient to evaluate the qual for all the tuples on a page, and then pass all the matching tuples to the next execution step. After playing with / pondering about various approaches, I think the right way to approach this is to convert the tree-walk based execution with a 'program' / bytecode interpreter based approach. Quite similar to what we've done to expression evaluation in v10 (which then subsequently was used to implement JIT compilation of expression evaluation in v11). To address the memory density / startup overhead issue I think we need to work towards two things: First, most of the memory needed for a query should come from a single, accurately sized, allocation. Secondly, the writable memory for query execution should be referenced using memory offsets, rather than pointers. That way we can clone the writable memory for query execution. That's obviously quite a bit of a project. Or rather projects. I've started prototyping a solution for this. The code for the prototype is at https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite (click on the repo to see the git url) and I plan to continue working on that project in that git branch. In my first iteration I'd tried to basically address 1-4 in a single protytype, using a single pass over the plan tree, without using the current executor initialization. Not impossible, but very very unwieldy. In the second version of the prototype I'm instead only tackling 1-2, and reuse the current executor node initialization functions. This prototype adds support for generating and executing an opcode based execution plan for a subset of query nodes (seqscan, index [only] scan, hash / sort based aggregation without grouping sets, limit, inner nestloop / hashjoin without batches) when the use_linearized_plan is set and all nodes in the query are supported. For example, this converts this converts TPCH's Q01: tpch_1[26142][1]=# SET client_min_messages ='log'; tpch_1[26142][1]=# SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - interval '74 days' GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus; LOG: 00000: pp: into: 0: init_sort 1: seqscan_first 2: seqscan [j empty 5] > s0 3: qual [j fail 2] < scan s0 4: hashagg_tuple [j 2] < s0 5: drain_hashagg [j empty 7] > s1 6: sort_tuple [j 5] < s1 7: sort 8: drain_sort [j empty 10] > s2 9: return < s2 [next 8] 10: done where s are numbered slots, j are, potentially conditional, jumps. This works for a number of plans, but there's several where we currently bail out. The basic idea is to convert the current, largely unmodified, Plan tree into a "linear" representation of steps that can be executed by an interpreter similar to ExecInterpExpr(). Individual parts of a query tree form "instructions" which together form a small program to execute the query. Obviously such a query will contain (conditional) backward and forward jumps (which differentiates it from the expression case, which knows only forward jumps). This means that some of the current executor nodes get split into multiple parts, because it's not permitted to recurse from within an executor step. A tuple can be returned to the user / caller by a special 'return tuple' step. This just sets the "virtual" instruction pointer to the point where execution should resume, allowing execution to continue at the right spot when the next tuple is supposed to be returned (e.g. for cursors). Instead of, as currently in the prototype, returning control flow back to ExecutePlan() we could - and quite possibly should - instead just process the tuple by having a 'send_slot' step or such. But right now it's a good example how control flow can be interrupted at any boundary. For the async FDW (and gather, and IO, ...) cases, you'd have a 'scan_fdw' step that'd normally continue execution, but if there's no tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet to watch which of the to-be-waited-for FDWs is ready. Despite being entirely unoptimized this already yields a measurable speedup over the current executor for the TPCH queries it supports. I think the big argument against this kind of project is that it's going to be quite invasive. Not everything inside the executor will have to be touched, but it's going to be a significant portion. But I don't think there's really a way around that for the medium to long term future, and I don't see how we'd gain anything by waiting further. My (although I certainly hope/plan to not work on this alone) basic plan to attack this is: 1) Expand prototype to cover all SELECT queries. That's mostly "just work". This'll reuse the current executor initialization with minimal modifications. 2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the best way to do so is - potentially it's just generating slightly different programs, where the steps contain additional code to perform the instrumentation. 3) Reimplement EvalPlanQual() infrastructure by generating a separate execution plan for EPQ, which replaces the relevant scan nodes with an execution step that return tuples from estate->es_epqTuple. The current indirection / branches due to ExecScan() are fairly expensive. 4) Remove old executor *Exec() functions, and subsidiary code. 5) Clean up executor nodes, they should now require far less information, because a lot of the state will be in executor steps. 6) Introduce a new pass, run before ExecInitNode(), that accounts for all the necessary memory for, at least, nodes, slots, expression contexts. Then convert ExecInitNode() to allocate from that. I'd assume that we'd leave expression evaluation out of that, because such work would be a largely independent project. 7) Split state in the executor nodes / executor program steps into modifiable and non-modifiable parts, and store as much as reasonably possible using relative addresses rather than absolute pointers. That can then be used to make executor programs reusable and reentrant. That's going to be useful for at least prepared statement, plpgsql and better JIT code generation. 8) Write a JIT implementation. Thoughts? Regards, Andres
pgsql-hackers by date: