Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel Hash take II |
Date | |
Msg-id | CA+TgmoY7pqs6KfArrH3WdidPr9KP43cyaseMt9kaNAYfhE3xgQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Hash take II (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] Parallel Hash take II
|
List | pgsql-hackers |
On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > 3. Gather Merge and Parallel Hash Join may have a deadlock problem. > Since Gather Merge needs to block waiting for tuples, but workers wait > for all participants (including the leader) to reach barriers. TPCH > Q18 (with a certain set of indexes and settings, YMMV) has Gather > Merge over Sort over Parallel Hash Join, and although it usually runs > successfully I have observed one deadlock. Ouch. This seems to be a > more fundamental problem than the blocked TupleQueue scenario. Not > sure what to do about that. Thomas and I spent about an hour and a half brainstorming about this just now. Parallel query doesn't really have a documented deadlock avoidance strategy, yet all committed and proposed patches other than this one manage to avoid deadlock. This one has had a number of problems crop up in this area, so it struck me that it might be violating a rule which every other patch was following. I struggled for a bit and finally managed to articulate what I think the deadlock-avoidance rule is that is generally followed by other committed and proposed patches: <rule> Once you enter a state in which other participants might wait for you, you must exit that state before doing anything that might wait for another participant. </rule> From this, it's easy to see that the waits-for graph can't contain any cycles: if every parallel query node obeys the above rule, then a given node can have in-arcs or out-arcs, but not both. I also believe it to be the case that every existing node follows this rule. For instance, Gather and Gather Merge wait for workers, but they aren't at that point doing anything that can make the workers wait for them. Parallel Bitmap Heap Scan waits for the leader to finish building the bitmap, but that leader never waits for anyone else while building the bitmap. Parallel Index(-Only) Scan waits for the process advancing the scan to reach the next page, but that process never waits for any other while so doing. Other types of parallel nodes -- including the proposed Parallel Append node, which is an interesting case because like Parallel Hash it appears in the "middle" of the parallel portion of the plan tree rather than the root like Gather or the leaves like a parallel scan -- don't wait at all, except for short spinlock-protected or LWLock-protected critical sections during which they surely don't go into any sort of long-term wait (which would be unacceptable for other reasons anyway). Parallel hash violates this rule only in the case of a multi-batch hash join, and for only one reason: to avoid blowing out work_mem. Since, consistent with resource management decisions elsewhere, each participant is entitled to an amount of memory equal to work_mem, the shared hash table can and does use up to (participants * work_mem), which means that we must wait for everybody to be done with the hash table for batch N before building the hash table for batch N+1. More properly, if the hash table for the current batch happens to be smaller than the absolute maximum amount of memory we can use, we can build the hash table for the next batch up to the point where all the memory is used, but must then pause and wait for the old hash table to go away before continuing. But that means that the process for which we are waiting violated the rule mentioned above: by not being done with the memory, it's making other processes wait, and by returning a tuple, it's allowing other parts of the executor to do arbitrary computations which can themselves wait. So, kaboom. One simple and stupid way to avoid this deadlock is to reduce the memory budget for the shared hash table to work_mem and remove the barriers that prevent more than one such hash table from existing at a time. In the worst case, we still use (participants * work_mem), frequently we'll use less, but there are no longer any waits for processes that might not even be running the parallel has node (ignoring the moment the problem of right and full parallel hash joins, which might need more thought). So no deadlock. We can do better. First, as long as nbatches == 1, we can use a hash table of up to size (participants * work_mem); if we have to switch to multiple batches, then just increase the number of batches enough that the current memory usage drops below work_mem. Second, following an idea originally by Ashutosh Bapat whose relevance to this issue Thomas Munro realized during our discussion, we can make all the batches small enough to fit in work_mem (rather than participants * work_mem as the current patch does) and spread them across the workers (in the style of Parallel Append, including potentially deploying multiple workers against the same batch if there are fewer batches than workers). Then, single-batch parallel hash joins use the maximum allowable memory always, and multi-batch parallel hash joins use the maximum allowable memory after the first batch. Not perfect, but not bad, and definitely better than deadlocking. Further refinements might be possible. If we don't adopt some approach along these lines, then I think we've got to articulate some alternative deadlock-avoidance rule and make sure every parallel query facility follows it. I welcome ideas on that front, but I don't think the rule mentioned above is a bad one, and I'd obviously like to minimize the amount of rework that we need to do. Assuming we do settle on the above rule, it clearly needs to be documented someplace -- not sure of the place. I think that it doesn't belong in README.parallel because it's an executor-specific rule, not necessarily a general rule to which other users of parallelism must adhere; they can choose their own strategies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: