Re: Treating work_mem as a shared resource (Was: Parallel Hash take II) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Treating work_mem as a shared resource (Was: Parallel Hash take II) |
Date | |
Msg-id | 19181.1510951685@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Treating work_mem as a shared resource (Was: Parallel Hash take II) (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)
|
List | pgsql-hackers |
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this is basically a planning problem. For example, say we wanted to have work_mem_per_query instead of work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path generationand comparison (less is better). Discard candidate paths whose memory use exceeds the work_mem_per_query budgetunless there are no other alternatives. At the end of planning, pick the cheapest path that survived the memory-budgetfilter. Now, this has the problem that it would make planning more expensive (because we'd hang on to morepaths for longer) but it solves a lot of other problems. If there's no memory pressure, we can use memory like mad evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using morememory for another part of the plan, the choice that does the best job reducing overall execution time will win. Awesome. > I'd like to hear some opinions on the feasibility of this approach. There is indeed a big planning problem here, but Robert's sketch is missing an important component of it: work_mem is not an output of cost estimates, it is an *input*. For example, we can sort or hash-join in however much memory you want, but it's going to cost different amounts. I think what we're actually looking for is to find the breakpoints in the cost curve where it thinks it can switch to a different sorting or hashing model, and then to emit paths that assume work_mem just above each of those breakpoints. But the code isn't set up like that now, not as to either planning or execution. Another problem with formulating it that way is that it suddenly puts a much higher premium on the planner's space estimates being right, which is something I don't have much faith in. For instance, if the planner thinks that 1000kB is just enough to hold a hash table, and then when we run it we find out that we need a bit more space than that, do we really want the executor to switch to a batched join? Probably not, especially not if having set the node's work_mem to 1010kB instead would've let it run to completion without batching. Addressing that discrepancy might be where we need the dynamic "emergency memory request" mechanism that Peter was postulating. But I'm not sure exactly how that works, because at the point where the executor realizes it's about to exceed the original space budget, it generally has little idea how much more it would need in order to avoid spilling the sort to disk or adding another round of batching. So it's really unclear to me what either the planner or executor API contracts for memory consumption ought to be if we're going to try to do this differently. I agree there's a lot of potential for improvement if we can find a better solution, but we're going to need to put some serious thought into it. regards, tom lane
pgsql-hackers by date: