Improving executor performance - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Improving executor performance |
Date | |
Msg-id | 20160624232953.beub22r6yqux4gcp@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Improving executor performance
Re: Improving executor performance Re: Improving executor performance Re: Improving executor performance Re: Improving executor performance - tidbitmap |
List | pgsql-hackers |
Hi, at pgcon, in [1], and various other threads and conferences we talked about our executor performance needing improvements to perform well in !OLTP workloads, and how we can get there. I've played with various techniques, on and off over the last few weeks. Including trying to make slot_deform_tuple et al. faster, batched execution for a few node types (SeqScan, hashed Agg), avoiding linked lists in executor datastructures, and smaller micro optimizations. As a motivation, here's a somewhat juicy example of the benefits that can be gained (disabled parallelism, results vary too much): SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ; non-optimized: 9075.835 ms optimized: 5194.024 ms and that's far from doing all the possible work for that query; especially the batching is far from optimal. So far I've mostly looked at performance for tpc-h, not because it's a particularly good workload, but because it's available. If somebody has good suggestions for an analytics heavy workload... My observations about the performance bottlenecks I found, and partially addressed, are in rough order of importance (there's interdependence between most of them): 1) Cache misses cost us a lot, doing more predictable accesses can make a huge difference. Without addressing that, manyother bottlenecks don't matter all that much. I see *significant* performance improvements for large seqscans (whenthe results are used) simply memcpy()'ing the current target block. This partially is an intrinsic problem of analyzing a lot of data, and partially because our access patterns are bad. 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all queries I looked at. There's various places we triggerdeforming, most ending in either slot_deform_tuple(), heap_getattr(), heap_deform_tuple(). This can be optimized significantly, but even after that it's still a major issue. Part of the problem is that we sometimes don't know how many elements to access (especially in index and hashing relatedcode), the other part that we're deforming a lot of columns that we'll never need, just because we need a subsequentone. The other part is that our tuple format requires a number of relatively expensive operations to access data (e.g. alignment computations, checking the null bitmap). 3) Our 1-by-1 tuple flow in the executor has two major issues: The biggest is that we perform a *lot* of repetitive work for each processed tuple. E.g. looking at nodeAgg.c's agg_fill_hash_table(), for each single tuple we execute ExecProcNode(), ExecSeqScan(), heap_getnext(), lookup_hash_entry(),LookupTupleHashEntry(), ... and many more. Each of these has a noticeable per-call state setup. Whenexecuting on batches of tuples, a lot of that setup work can be done once per batch, instead of once per tuple. Partof that the compiler can do automatically, part of that has to be done by hand. Also very significantly, due to the amount of code executed, there's barely any working branch prediction, leading tomassive amounts of pipeline stalls and instruction cache misses. There's also the fact that some future optimizations around using SIMD are easier when looking at batches of tuples, butI'm not planning to work on that. 4) Various missing micro optimizations have to be performed, for more architectural issues to become visible. E.g. [2] causessuch bad slowdowns in hash-agg workloads, that other bottlenecks are hidden. 5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar issues. Using a per-block heap_getmany() that fillsa batch at once is noticeably faster (and more convenient in a batched world anyway) 6) The use of linked lists adds noticeable #instructions overhead and branch misses. Just replacing {FuncExprState,BoolExprState}->args with an array gives a ~10%ish boost for queries with a bunch of quals. Another examplethat appears to hurts noticeably, but which I've not experimentally fixed, is AggState->hash_needed. To a large degree these seem fairly easily fixable; it's kinda boring work, but not really hard. I'm planning to start individual subthreads for some of these, with in-detail discussions and/or patches. But I wanted to present this on a higher level once, before spending even more time. Have I missed concrete issues others noticed? Missed significant improvements (please, please, only realistic stuff)? Unfortunately these items are heavily interdependent. For some queries you'll need issue n) addressed before n+2) shows a benefit, for some it's the other way, and such. Given the size of the overall task, the only way I can see this being reasonably addressable, is trying to get smaller steps in individually, even if not a huge win on their own. And then focus on the the larger architectural changes (primarily 3)) issues. Comments so far? Different suggestions on how to get improvements around this merged? Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com [2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de
pgsql-hackers by date: