Thread: Treating work_mem as a shared resource (Was: Parallel Hash take II)
On Wed, Nov 15, 2017 at 1:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > In the old days, Oracle had only simple per-operation memory limits > too, and that applied to every operation in every thread just like our > work_mem. It's interesting that they had separate knobs for sort and > hash though, and defaulted to giving hash twice as much. It makes plenty of sense to give hash twice as much IMV. I'm interested in the *economics* of how we use memory -- I think we could do a lot better there. Memory capacities have certainly increased dramatically over the years since sort_mem became work_mem, but I suspect that users are (quite reasonably) still giving most of that over to shared_buffers/FS cache. And, storage/network bandwidth also increased dramatically during that time, so single-pass external sorts will continue to be a sensible thing to do frequently. Hashing is a different story, though -- you really do want to make sure that hash-based operations have access to more memory, where it can really go to use. Though I am primarily concerned about the fact that we don't give any weight to how sensitive hash-based operations are to having less memory, I don't really want to talk about band-aid measures like a hash_mem GUC (though that does have a certain appeal). I want to talk about moving past work_mem, and towards a model where work_mem-like memory is treated like a shared resource, under a regime that intelligently weighs the effects of making more memory available to one plan based on system-wide accounting, and the sensitivity of each memory-consuming operation to the availability of memory. This thread is intended to get the ball rolling on that. It seems like we need something like this to get the full benefit of our numerous enhancements to sorting and hashing. > With a whole-plan memory target, our planner would probably begin to > plan join order differently to minimise the number of hash tables in > memory at once, like other RDBMSs. Not sure how the plan-wide target > should work though -- try many plans, giving different portions of > budget to different subplans? That should work fine if you like > O(please-melt-my-computer), especially if combined with a similar > approach to choosing worker numbers. Some kind of feedback system? > Seems like a different kind of planner, but I have no clue. If you > have ideas/papers/references, it'd be great to see a new thread on > that subject. My first thought is that we might implement a model where little changes about work_mem itself -- it becomes a minimum for each work_mem consuming operation. There could be an additional "emergency memory fund" that individual plan nodes can avail of during execution, if and when it looks like they'll fall underneath an allocation that allows the work_mem-consuming operation to perform "optimally" or "reasonably". This would happen at certain "penny-wise, pound foolish" points. There'd be big differences in how we do this for sorts as compared to hash joins, because the underlying cost function for each look totally different. There'd probably be a maximum amount of memory that each executor node could request from the emergency fund, such as a smallish multiple of work_mem (work_mem being the minimum budget it can *reliably* claim). A single hash join asking for the entire emergency fund for itself all at once seems excessive, and likely to create problems in other areas, so that should be impossible. And, it should be "hard" for a node to ask for and/or receive the absolute maximum a node can get, because we want to keep that for cases that would otherwise truly be much slower. All in all, this approach shouldn't be too radical a departure from the work_mem model. I admit that there are significant risks with this approach as a project. It seems like there is a chance that it won't be ambitious enough in the end, because there are so many competing considerations. At the same time, I cannot help but be concerned about how naive we are about how sorting and hashing respond to work_mem. Anyway, here is how I see the extra/emergency requests working for specific operations: * For sorts, a non-optimal sort (a sort that asks for more memory) would ask for memory when it first looked like multiple passes will be required. As I pointed out in my Sort vs. Hash talk, that's actually pretty rare these days, because as work_mem doubles, the capacity to do everything in one pass quadruples. You should really never get multiple passes for an external sort -- the economics of doing it that way with any frequency are not likely to be sensible on modern machines. * For hash join, the "emergency request" logic could be much more sophisticated, and would be much more likely to be used in practice. I think that we'd probably want to worry about the difference between performing a hash join entirely in-memory versus having to do some amount of batching (unlike sorting). This would generally be more "eager" than the requests that tuplesort makes, because smaller differences matter much more, much sooner. * For hash aggregates, we might have "overage requests" from the emergency fund, or something along those lines. The emergency fund might therefore be in deficit (negative) when hash aggregates misbehave, since it cannot "say no" to these requests (hash aggregate will not currently take no for an answer, since it has no emergency spill mechanism). This could limit the overall impact of that happening, and might also provide a useful choke point for new alerting and monitoring systems that can hook into the "central memory management" logic. Hash aggregates might go over work_mem without it really mattering much of the time. ISTM that there is a similar dilemma to this "sort versus hash" dilemma for maintenance_work_mem tasks: the "CREATE INDEX versus VACUUM" dilemma. We should try to address that as part of this effort. (This dilemma is one reason why I wrote the autovacuum_work_mem patch -- that's not a million miles from the idea of a hash_mem GUC.) To come up with a real proposal for treating local memory as a shared resource, I think that we need: * To hear more ideas about keeping things in balance here. How clever should we be? * To experimentally verify that the cost functions (latency as a function of memory) for things like sorting, merge join, hash join, and hash aggregate are what we think they are. * To understand how this relates to admission control. The only obvious difference that I can think of is that admission control probably involves queuing when very memory constrained, and isn't limited to caring about memory. I'm not trying to make swapping/OOM impossible here; I'm trying to make it easier to be a Postgres DBA sizing work_mem, and make it so that DBAs don't have to be stingy with work_mem. The work_mem sizing formulas we sometimes promote (based on max_connections) are probably very limiting in the real world. * To better understand the role of the optimizer here. Can we get more hash aggregate plans in the first place without risking swapping/OOM? Is it okay that what I propose kind of works against that goal? * To better understand the role of parallel query here. * To try to find a way to assess what "better" here looks like, overall. What benchmark might we devise to assess what a good, robust strategy looks like? I freely admit that my proposal is pretty hand-wavy at this point, but, as I said, I want to at least get the ball rolling. -- Peter Geoghegan
On 16 November 2017 at 16:38, Peter Geoghegan <pg@bowt.ie> wrote: > * To understand how this relates to admission control. The only > obvious difference that I can think of is that admission control > probably involves queuing when very memory constrained, and isn't > limited to caring about memory. I'm not trying to make swapping/OOM > impossible here; I'm trying to make it easier to be a Postgres DBA > sizing work_mem, and make it so that DBAs don't have to be stingy with > work_mem. The work_mem sizing formulas we sometimes promote (based on > max_connections) are probably very limiting in the real world. I had always imagined that this should be some sort of work_mem_pool. Each plan would have some mention of how much they expect to consume, which I'd thought was N * work_mem where N is the number of Nodes in the plan that require a work_mem, then at the start of execution, we atomically increment variable in shared mem that tracks the work_mem_pool usage, then check if that variable is <= work_mem_pool then start execution, if not we add ourselves to some waiters queue and go to sleep only to be signaled when another plan execution completes and releases memory back into the pool, we'd then re-check and just go back to sleep if there's still not enough space. Probably simple plans with no work_mem requirement can skip all these checks which may well keep concurrency up. I'm just not all that clear on how to handle the case where the plan's memory estimate exceeds work_mem_pool. It would never get to run. Perhaps everything that requires any memory must wait in that case so this query can run alone. i.e. special case this to require the work_mem_pool usage to be 0 before we run, or maybe it should just be an ERROR? Probably the whole feature could be disabled if work_mem_pool is -1, which might be a better option for users who find there's some kind of contention around memory pool checks. > I freely admit that my proposal is pretty hand-wavy at this point, > but, as I said, I want to at least get the ball rolling. Me too. I might have overlooked some giant roadblock. I think it's important that the work_mem_pool tracker is consumed at the start of the query rather than when the work_mem node first runs, as there'd likely be some deadlocking type waiting issue if we have plans part-way through execution start waiting for other plans to complete. That might not be ideal, as we'd be assuming that a plan will always consume all their work_mems at once, but it seems better than what we have today. Maybe we can improve on it later. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

I have been pondering how to deal with work_mem for a couple of months myself and had very similar thoughts.
As far as I can tell the problem goes beyond work_mem though:
1. There are several hash operations such as set-ops, hashed subplans, and hash aggregates which today are not spilling at all.
We have solved them partially so far and, once complete, think the fixes can be pushed into community PG if there is desire for it
2. We also worry about large objects which can bloat a backend
3. Others random allocations I fear I just don’t know about.
4. OS are chronically poor in trading memory between processes even after the memory is freed unless it’s returned to the OS in big contiguous chunks.
Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.
Having an “emergency fund” in shared memory is also an option, but I find it too limiting.
Also this approach what was done at DB2 when I was there and it proved cumbersome.
So I’m currently pressing forward with a much more fundamental approach:
Pushing Top Transaction Context and its children into shared memory.
To avoid fragmentation and serialization on latches I have defined the concept of “a context cluster”. The root of the cluster is the sole true allocator of memory. Child contexts allocate blocks as pallocs from the cluster root. Basically memory management goes recursive and children live within the root.
The root (TopTransactionContext) allocates big blocks. e.g. 8MB at a time.
Within a transaction PG operates as usual with freelists and all turning over these same 8MB or allocating more if needed.
But at the end of every transaction big chunks of memory become available to share with other transactions again.
A few places where we reparent contexts need to detect that this can’t be done between or in/out of clusters and do deep copies if needed, but there are few of those. Come to think of it all the cases I encountered so far were SFDC specific…
I’m also moving the e-state from the Portal Heap to the Top Transaction Context.
At the end of the day the assumption is that most transactions only need one block from shared memory, and I can probably pin it to the backend, further reducing contention.
If there is an Out Of Memory situation - should be very rare - there are multiple ways to deal with it. If there is no dead-lock we can simply wait. If there is one rolling back the transaction that encountered the OOM is the obvious - if not optimal solution.
Finding the biggest consumer and sending it a signal to back of would be another way to do it.
My goal is to run a backend with 50-100MB with all local caches controlled for size.
Transaction Memory with e-states included should sized for 8MB/backend plus a fixed “spill” of some GB.
Yes, this is invasive and I’m sure to debug this for a while given my limited knowledge of the engine. I may yet fail spectacularly.
On the other hand it’s conceptually pretty straight forward.
Cheers
Serge Rielau
SFDC
On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote:
Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.Having an “emergency fund” in shared memory is also an option, but I find it too limiting.
I agree.
I think this is basically a planning problem. For example, say we wanted to have work_mem_per_query instead of work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path generation and comparison (less is better). Discard candidate paths whose memory use exceeds the work_mem_per_query budget unless there are no other alternatives. At the end of planning, pick the cheapest path that survived the memory-budget filter. Now, this has the problem that it would make planning more expensive (because we'd hang on to more paths for longer) but it solves a lot of other problems. If there's no memory pressure, we can use memory like mad even when it doesn't save much, but when we have to pick between using more memory for one part of the plan and using more memory for another part of the plan, the choice that does the best job reducing overall execution time will win. Awesome.
We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering resources. I don't think that we can directly incorporate memory use into cost because that will distort the costs of higher-level nodes in the plan tree; cost needs to mean execution time. However, what we could do is refuse to replace a more expensive path in a relation's path list with a cheaper one when the savings are small and the cheaper path uses a lot more memory. That way, you wouldn't replace a nested loop that costs a million units with a hash join that costs 999,999 units but uses a GB of RAM; you'd save the hash join for cases where we think it will help significantly.
Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible. This is particularly an issue for sorts. A 2-batch hash join is so much more expensive than a single-batch hash join that it's almost never going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might be not that different from a 32-batch hash join. But for sorts, given all Peter's work in this area, I bet there are a lot of sorts that could budget a quarter or less of work_mem and really not be hurt very much. It depends somewhat on how fast and how contended your I/O is, though, which we don't have an especially good way to model. I'm starting to wonder if that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else.
If we really want to be able to dynamically react to change memory conditions, what we need is a forest of plans for a given query rather than just one. Pick plan A if memory is limited, otherwise pick B. Or use admission control.
On Fri, Nov 17, 2017 at 3:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote:Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.Having an “emergency fund” in shared memory is also an option, but I find it too limiting.I agree.I think this is basically a planning problem. For example, say we wanted to have work_mem_per_query instead of work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path generation and comparison (less is better). Discard candidate paths whose memory use exceeds the work_mem_per_query budget unless there are no other alternatives. At the end of planning, pick the cheapest path that survived the memory-budget filter. Now, this has the problem that it would make planning more expensive (because we'd hang on to more paths for longer) but it solves a lot of other problems. If there's no memory pressure, we can use memory like mad even when it doesn't save much, but when we have to pick between using more memory for one part of the plan and using more memory for another part of the plan, the choice that does the best job reducing overall execution time will win. Awesome.We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering resources. I don't think that we can directly incorporate memory use into cost because that will distort the costs of higher-level nodes in the plan tree; cost needs to mean execution time. However, what we could do is refuse to replace a more expensive path in a relation's path list with a cheaper one when the savings are small and the cheaper path uses a lot more memory. That way, you wouldn't replace a nested loop that costs a million units with a hash join that costs 999,999 units but uses a GB of RAM; you'd save the hash join for cases where we think it will help significantly.Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible. This is particularly an issue for sorts. A 2-batch hash join is so much more expensive than a single-batch hash join that it's almost never going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might be not that different from a 32-batch hash join. But for sorts, given all Peter's work in this area, I bet there are a lot of sorts that could budget a quarter or less of work_mem and really not be hurt very much. It depends somewhat on how fast and how contended your I/O is, though, which we don't have an especially good way to model. I'm starting to wonder if that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else.If we really want to be able to dynamically react to change memory conditions, what we need is a forest of plans for a given query rather than just one. Pick plan A if memory is limited, otherwise pick B. Or use admission control.
FWIW, lack of per-connection and/or global memory limit for work_mem is major PITA when running shared and/or large-scale setup.
Currently we are doing a poor job with the work_mem parameter because we don't have a good way to let our customers increase it without also giving them ability to shoot themselves in a foot.
Even a simple param limiting global total number of work_mem buffers would help here.
--
Vladimir Rusinov
PostgreSQL SRE, Google Ireland
Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047
Vladimir Rusinov
PostgreSQL SRE, Google Ireland
Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047
On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote: >> >> Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that tobe too complex. >> Having an “emergency fund” in shared memory is also an option, but I find it too limiting. > > > I agree. Yeah. I suspect that that idea is not ambitious enough to do a lot of what we want, and yet is too ambitious to justify working on given its limited shelf life. > I think this is basically a planning problem. For example, say we wanted to have work_mem_per_query instead of work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path generationand comparison (less is better). Discard candidate paths whose memory use exceeds the work_mem_per_query budgetunless there are no other alternatives. At the end of planning, pick the cheapest path that survived the memory-budgetfilter. Now, this has the problem that it would make planning more expensive (because we'd hang on to morepaths for longer) but it solves a lot of other problems. If there's no memory pressure, we can use memory like mad evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using morememory for another part of the plan, the choice that does the best job reducing overall execution time will win. Awesome. I'd like to hear some opinions on the feasibility of this approach. Does David have anything to say about it, for example? > We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering resources. That sounds like independent work, though it could be very useful. > Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible. This is particularlyan issue for sorts. A 2-batch hash join is so much more expensive than a single-batch hash join that it's almostnever going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might benot that different from a 32-batch hash join. But for sorts, given all Peter's work in this area, I bet there are a lotof sorts that could budget a quarter or less of work_mem and really not be hurt very much. It depends somewhat on howfast and how contended your I/O is, though, which we don't have an especially good way to model. I'm starting to wonderif that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else. Right. The ability for sorts to do well with less memory is really striking these days. And though I didn't mean to seriously suggest it, a hash_mem GUC does seem like it solves some significant problems without much risk. I think it should be hash_mem, not sort_mem, because hashing seems more like the special case among operations that consume work_mem, and because sort_mem is already the old name for work_mem that is still accepted as a work_mem alias, and because hash_mem avoids any confusion about whether or not CREATE INDEX uses the new GUC (it clearly does not). Since I am primarily concerned about the difference in sensitivity to the availability of memory that exists when comparing sorting and hashing, and since a new GUC seems like it would noticeably improve matters, I am beginning to take the idea of writing a hash_mem patch for v11 seriously. -- Peter Geoghegan
On Fri, Nov 17, 2017 at 8:09 AM, Vladimir Rusinov <vrusinov@google.com> wrote: > FWIW, lack of per-connection and/or global memory limit for work_mem is major PITA when running shared and/or large-scalesetup. > > Currently we are doing a poor job with the work_mem parameter because we don't have a good way to let our customers increaseit without also giving them ability to shoot themselves in a foot. > Even a simple param limiting global total number of work_mem buffers would help here. I suspect that we can do better here just by allocating memory more sensibly in a very simple way (something like my hash_mem proposal). The relationship between aggregate memory usage and aggregate throughput is very non-linear. One can imagine giving more memory to hash joins, making each hash join much faster, having the overall effect of *reducing* aggregate memory usage. The DBA can be more generous with memory while actually decreasing aggregate memory usage. This is at least possible with work_mem consuming operations that involve hashing, like hash join and hash aggregate. Simple benchmarking tools like pgbench enforce the idea that meeting throughput requirements is the most important thing, but in reality workloads are usually very bursty. It is often more important to be able to stay on a smaller instance size while maintaining less than excellent (but still acceptable) performance. Again, it's about the economics. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think this is basically a planning problem. For example, say we wanted to have work_mem_per_query instead of work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path generationand comparison (less is better). Discard candidate paths whose memory use exceeds the work_mem_per_query budgetunless there are no other alternatives. At the end of planning, pick the cheapest path that survived the memory-budgetfilter. Now, this has the problem that it would make planning more expensive (because we'd hang on to morepaths for longer) but it solves a lot of other problems. If there's no memory pressure, we can use memory like mad evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using morememory for another part of the plan, the choice that does the best job reducing overall execution time will win. Awesome. > I'd like to hear some opinions on the feasibility of this approach. There is indeed a big planning problem here, but Robert's sketch is missing an important component of it: work_mem is not an output of cost estimates, it is an *input*. For example, we can sort or hash-join in however much memory you want, but it's going to cost different amounts. I think what we're actually looking for is to find the breakpoints in the cost curve where it thinks it can switch to a different sorting or hashing model, and then to emit paths that assume work_mem just above each of those breakpoints. But the code isn't set up like that now, not as to either planning or execution. Another problem with formulating it that way is that it suddenly puts a much higher premium on the planner's space estimates being right, which is something I don't have much faith in. For instance, if the planner thinks that 1000kB is just enough to hold a hash table, and then when we run it we find out that we need a bit more space than that, do we really want the executor to switch to a batched join? Probably not, especially not if having set the node's work_mem to 1010kB instead would've let it run to completion without batching. Addressing that discrepancy might be where we need the dynamic "emergency memory request" mechanism that Peter was postulating. But I'm not sure exactly how that works, because at the point where the executor realizes it's about to exceed the original space budget, it generally has little idea how much more it would need in order to avoid spilling the sort to disk or adding another round of batching. So it's really unclear to me what either the planner or executor API contracts for memory consumption ought to be if we're going to try to do this differently. I agree there's a lot of potential for improvement if we can find a better solution, but we're going to need to put some serious thought into it. regards, tom lane
On Fri, Nov 17, 2017 at 1:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Right. The ability for sorts to do well with less memory is really > striking these days. And though I didn't mean to seriously suggest it, > a hash_mem GUC does seem like it solves some significant problems > without much risk. I think it should be hash_mem, not sort_mem, > because hashing seems more like the special case among operations that > consume work_mem, and because sort_mem is already the old name for > work_mem that is still accepted as a work_mem alias, and because > hash_mem avoids any confusion about whether or not CREATE INDEX uses > the new GUC (it clearly does not). Hmm. I wonder if you are correct that hashing is the special case. Hashing and sorting are of course the two main operations -- but there's materialize and anything else that uses a CTE, and maybe other stuff I'm not thinking about right now. You might be right that hash is the one where it really matters, but it's probably worth a bit more reflection on where it matters most and for what reasons. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 17, 2017 at 3:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. I wonder if you are correct that hashing is the special case. > Hashing and sorting are of course the two main operations -- but > there's materialize and anything else that uses a CTE, and maybe other > stuff I'm not thinking about right now. You might be right that hash > is the one where it really matters, but it's probably worth a bit more > reflection on where it matters most and for what reasons. I'd rather be approximately correct than precisely wrong. I think that the current work_mem model is precisely wrong. I'm conscious of the fact that we are loathe to create new GUCs (I sometimes think that we're a bit too averse to doing so), but maybe there is room for adding a second work_mem-alike GUC. For now, I admit that I am applying fuzzy criteria, and that I could easily have missed an important subtlety. Creating hash_mem instead of sort_mem is a direction that is entirely debatable, and should actually be debated. OTOH, it seems like a real problem that we don't allow hashing to take full advantage of available main memory, and *some* interim solution seems preferable to what we have now. -- Peter Geoghegan
On Fri, Nov 17, 2017 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd like to hear some opinions on the feasibility of this approach. > > There is indeed a big planning problem here, but Robert's sketch is > missing an important component of it: work_mem is not an output of cost > estimates, it is an *input*. For example, we can sort or hash-join in > however much memory you want, but it's going to cost different amounts. > > I think what we're actually looking for is to find the breakpoints in > the cost curve where it thinks it can switch to a different sorting > or hashing model, and then to emit paths that assume work_mem just > above each of those breakpoints. But the code isn't set up like that > now, not as to either planning or execution. It might not be that hard to come up with a model for determining which points on the curve are of interest. It seems easy to do this for sorting, because it's actually a very simple curve. Once you're in single pass territory, and provided you're still using at least a few tens of megabytes of work_mem, the availability of work_mem seems to only make a very small difference (temp file I/O is still essentially sequential, and the logarithmic growth in comparisons as more runs must be merged doesn't really bite you). Perhaps this also means that you can expect to get away with moderately bad estimates there. > Another problem with formulating it that way is that it suddenly puts > a much higher premium on the planner's space estimates being right, > which is something I don't have much faith in. For instance, if the > planner thinks that 1000kB is just enough to hold a hash table, and > then when we run it we find out that we need a bit more space than that, > do we really want the executor to switch to a batched join? Probably not, > especially not if having set the node's work_mem to 1010kB instead > would've let it run to completion without batching. Addressing that > discrepancy might be where we need the dynamic "emergency memory request" > mechanism that Peter was postulating. But I'm not sure exactly how that > works, because at the point where the executor realizes it's about to > exceed the original space budget, it generally has little idea how much > more it would need in order to avoid spilling the sort to disk or > adding another round of batching. I don't know either. I think that it's reasonable for us to make it a goal of the executor to have operations that have a smooth cost function, in order to manage the risk of misestimation well, and to make it a goal to have operations that are otherwise adaptive to misestimation. To a large degree, my abandoned "quicksort with spillover" design from a couple of years ago was written with this in mind (it avoided a sudden discontinuity in the cost function of sort nodes, at the point where you must spill for the first time). Another example of an adaptive operation is "role reversal" for hash joins, where the executor flips the inner and outer side during execution, at a point where it becomes clear that the optimizer had it backwards, estimation-wise. There are probably numerous other things like this that are possible...and maybe even worthwhile. In summary, I agree that we're going to have big problems if the planner needs to have very accurate estimates to see a real benefit. It seems possible that most of the benefit of "fixing work_mem" comes from avoiding using a woefully inadequate amount of memory where batching was clearly always going to be necessary. There may be limited benefit to preventing batching in the first place. So while there could also be room for an "emergency memory request" mechanism, it's more of a nice-to-have. > So it's really unclear to me what either the planner or executor API > contracts for memory consumption ought to be if we're going to try to do > this differently. I agree there's a lot of potential for improvement if > we can find a better solution, but we're going to need to put some serious > thought into it. The devil is in the details, of course. Vladimir said something about customer issues with sizing work_mem on Google's cloud database service, and it reminded me of my experiences with this while working at Heroku. I tended to hear few complaints about it, but then there'd sometimes be serious customer issues quite suddenly. My theory is that there can be a turning point where demands on work_mem increase, and there are suddenly more group aggregates than hash aggregates (to a lesser extent, there may be fewer hash joins). Now the database is using group aggregates that are quite a bit slower than hash aggregates, while still using approximately the same amount of memory as before. This creates significantly more pressure quite suddenly, because the group aggregates are quite a bit slower, and it takes that much longer for the memory to be released. I'm mostly concerned about avoiding instability like this. Users greatly value predictable performance. -- Peter Geoghegan
On Fri, Nov 17, 2017 at 9:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I think that it's reasonable for us to make it a goal of the executor > to have operations that have a smooth cost function, in order to > manage the risk of misestimation well, and to make it a goal to have > operations that are otherwise adaptive to misestimation. Hash joins are a place where we could have a smoother cost function than we do. When we run out of memory, instead of switching from (say) a single batch to two batches, switch to 64 batches, but initially keep 63 of them in memory and only write the very last one to disk. Every time we again run out of memory, dump another batch to disk. If we end up dumping more than half or so of the batches to disk, switch to an even larger number of batches to make it even more fine-grained. The current system is bad because you jump from spooling NO tuples to a tuplestore to spooling HALF of the inner AND outer tuples to a tuplestore. If the hash table is just a little too big to fit, we could write 1/64 or 2/64 or 3/64 of the inner and outer tuples to a tuplestore instead of HALF of them, which would be a huge win. That having been said, I think the place where our plans most commonly go wrong is where we incorrectly estimate the number of tuples by multiple orders of magnitude - 100x is common, 1000x is common, a million x is not uncommon, even a billion x is not unheard-of. And I don't think there's any way to make a hash join happy if it thinks it's going to need 1 batch and it ends up needing a million batches. At that, even if the cost function is very smooth, you've moved so far along the curve that you're probably not in a good place. So, while I think that smoothing out the cost functions is a good idea, I think we also need to consider what more can be done to improve the estimates - and especially to avoid estimates that are off by huge multiples. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 21, 2017 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Hash joins are a place where we could have a smoother cost function > than we do. Yes, it definitely is. > When we run out of memory, instead of switching from > (say) a single batch to two batches, switch to 64 batches, but > initially keep 63 of them in memory and only write the very last one > to disk. Every time we again run out of memory, dump another batch to > disk. If we end up dumping more than half or so of the batches to > disk, switch to an even larger number of batches to make it even more > fine-grained. That could work. > That having been said, I think the place where our plans most commonly > go wrong is where we incorrectly estimate the number of tuples by > multiple orders of magnitude - 100x is common, 1000x is common, a > million x is not uncommon, even a billion x is not unheard-of. And I > don't think there's any way to make a hash join happy if it thinks > it's going to need 1 batch and it ends up needing a million batches. What about dynamic role reversal? That could make a big difference. > At that, even if the cost function is very smooth, you've moved so far > along the curve that you're probably not in a good place. So, while I > think that smoothing out the cost functions is a good idea, I think we > also need to consider what more can be done to improve the estimates - > and especially to avoid estimates that are off by huge multiples. I agree that it would be enormously valuable if we could make estimates much better, so I think that I understand why you emphasize it. But, I don't think that there are any good ideas for improving join selectivity that don't involve expert DBA knowledge, or novel/risky techniques for feedback to the system about column redundancy/correlation, etc. These do not seem like scalable approaches, and so they don't particularly appeal to me as projects. I'd be happy to be shown to be wrong about this. OTOH, techniques like dynamic role reversal, for when there are many batches and it's faster to flip the outer and inner side do seem promising. It's probably possible to come up with a more or less unambiguous improvement, without layering complexity. I suspect that this technique is widely implemented, and will cut down on cases leading to terrible performance to a significant degree. I should try to talk Thomas into working on it. -- Peter Geoghegan
On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> That having been said, I think the place where our plans most commonly >> go wrong is where we incorrectly estimate the number of tuples by >> multiple orders of magnitude - 100x is common, 1000x is common, a >> million x is not uncommon, even a billion x is not unheard-of. And I >> don't think there's any way to make a hash join happy if it thinks >> it's going to need 1 batch and it ends up needing a million batches. > > What about dynamic role reversal? That could make a big difference. In the best case it's great, but it looks to me like there are a lot of thorny problems. For example, imagine giant_table INNER JOIN bigger_than_we_thought The latter table will be chosen as the inner table and that won't work out very well, but there's no way to know whether switching the sides will be any better except to try reading a bunch of rows from giant_table and seeing whether it turns out to be a lot smaller than we thought. To do that, we'll need to dump the hash table we started to build on the original inner side out to disk so that we can free up enough work_mem to try building a hash table on the other side. When the giant table turns out to actually be giant, we'll need to go back to the original plan, which means dumping out the tuples from the second hash table and reloading the tuples from the first one. So we end up just doing a bunch of extra work for nothing. I think that this scenario - wasting effort trying to switch the sides only to give up - will happen frequently. In the multi-batch case, there seems to be a little more hope of doing something clever. We're anyway writing out most of both inputs out to tapes. If we were willing to write ALL of both inputs out to tapes, then we could decide - perhaps even separately for each batch - which side to load into the hash table. Of course, that adds a lot of incremental I/O unless the number of batches is large (e.g. if we had only 4 batches, writing 4/4 of the data instead of 3/4 is a 33% increase, but if we had 64 batches, writing 64/64 of the data instead of 63/64 doesn't matter a lot, probably). And it leaves out a few important details, like the fact that what fits in the hash table is used to choose the number of batches in the first place, and that we write the whole of one side to tapes before starting on the other side. I don't know how to handle those problems but it seems like it might be possible to come up with something clever, at least for certain cases. > I agree that it would be enormously valuable if we could make > estimates much better, so I think that I understand why you emphasize > it. But, I don't think that there are any good ideas for improving > join selectivity that don't involve expert DBA knowledge, or > novel/risky techniques for feedback to the system about column > redundancy/correlation, etc. These do not seem like scalable > approaches, and so they don't particularly appeal to me as projects. > I'd be happy to be shown to be wrong about this. Yeah, I agree that it's a hard problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Nov 26, 2017 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>> That having been said, I think the place where our plans most commonly >>> go wrong is where we incorrectly estimate the number of tuples by >>> multiple orders of magnitude - 100x is common, 1000x is common, a >>> million x is not uncommon, even a billion x is not unheard-of. And I >>> don't think there's any way to make a hash join happy if it thinks >>> it's going to need 1 batch and it ends up needing a million batches. >> >> What about dynamic role reversal? That could make a big difference. > > In the best case it's great, but it looks to me like there are a lot > of thorny problems. There are loads of inter-related topics discussed in this thread, including some operator-specific stuff like the above, and some more general stuff, all requiring more research. In the meantime, I wonder if there are some simpler incremental improvements we could consider. Since work_mem currently acts as a kind of per executor node instance limit, the system-wide peak memory usage could be described as number of concurrent queries * number of executor nodes * number of parallel participants * work_mem. In the past I think the number of executor nodes was practically anchored to the ground by the number of relations in the query (not necessarily linearly, but not far off it), and the number of parallel participants was one. With the advent of parallel query we have this new multiplying term, and with the advent of partitions and partition-wise join we have exciting new ways to explode the number of executor nodes when the user only explicitly named a few relations. We could imagine various levels of memory budgeting: 1. work_mem_per_system (global budget). 2. work_mem_per_query (work_mem somehow shared out between executor nodes). 3. Per planned executor node budget (workers get a fraction of work_mem for each node). 4. What we have today: per executor node instance budget (workers get to use work_mem for each node). 1 and 2 seem like they might be boil-the-ocean problems. But as far as I know moving from 4 to 3 would merely require warming up a minor lake. That would take out one of the multipliers, and would remove a perverse incentive from any potential cost-based parallel degree choosing algorithms (you can print free memory by adding new workers.) Parallel Hash either combines the memory budgets of all participants to make one large no-partition hash table, or partitions the inner relation into work_mem sized batches and loads several of them into memory at the same time (at most one per participant). Either way the total memory usage is participants * work_mem, consistent with policy 4 and consistent with the total budget given to equivalent parallel-oblivious hash join, sort-merge join or any other node. If we switched to policy 3 and (say) work_mem were somehow automagically adjusted to be divided by number of participants at planning and execution time, then Parallel Hash wouldn't have to change at all to conform to the policy. It would use at most work_mem per Parallel Hash node, no matter how many workers and no matter which of its strategies it picked (either it receives a budget of work_mem / participants, and then multiplies it by participants to create a no-partition hash table combining the participants' budgets, or it lets each participant chew on smaller hash tables adding up to at most work_mem). Just the same total per-node budget as any other executor node gets. -- Thomas Munro http://www.enterprisedb.com
On Fri, Feb 23, 2018 at 6:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > If we switched to policy 3 and (say) work_mem were somehow > automagically adjusted to be divided by number of participants at > planning and execution time, then Parallel Hash wouldn't have to > change at all to conform to the policy. It would use at most work_mem > per Parallel Hash node, no matter how many workers and no matter which > of its strategies it picked (either it receives a budget of work_mem / > participants, and then multiplies it by participants to create a > no-partition hash table combining the participants' budgets, or it > lets each participant chew on smaller hash tables adding up to at most > work_mem). Just the same total per-node budget as any other executor > node gets. That's true, but what you'd have instead is a whole lot of additional planning overhead. Right now, if we choose to do a merge-join or a parallel-oblivious hash join or a nested loop with a materialize node on the inner side, we can join the parallel-aware path on the outer side to the same parallel-oblivious path on the inner side that we would use if we decided against parallel query altogether. If you wanted to all of the copies of a node across all parallel participants to stick to work_mem as a budget, then you'd need one set of paths for each rel planned with the default work_mem setting and a second set planned with less work_mem. And if you imagine a future where we create various paths for the same relation with various different numbers of workers, then you'd need to have even more different sets of paths for each relation. If we're OK with making planning more expensive to solve this problem, then I think we should forget about #3 and go straight to #2. What we would do is just teach add_path() that "amount of memory used" is another independent dimension of merit, so that a more expensive plan might be kept if it uses less memory. Then if at the end of planning you want to pick the fastest plan that uses less than X amount of memory, or if you want to pick the plan for which weight * cost + weight * memory usage is minimal, or whatever it is you want, you can. I think the only one from your list that's really boil-the-ocean is #1. For that one, you presumably need to create multiple plans and switch between them based on how much memory is available right now and maybe how much you think will be available in the near future and I guess impose some kind of admission control when system memory gets too low... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company