Re: asynchronous and vectorized execution - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: asynchronous and vectorized execution |
Date | |
Msg-id | CA+TgmoacgPkQv9qE1EYpXFwt+JMMxbwqM9_Vax_8N3anBMbRbg@mail.gmail.com Whole thread Raw |
In response to | Re: asynchronous and vectorized execution (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: asynchronous and vectorized execution
|
List | pgsql-hackers |
On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > [ new patches ] + /* + * We assume that few nodes are async-aware and async-unaware + * nodes cannot be revserse-dispatched from lower nodes that is + * async-aware. Firing of an async node that is not a descendant + * of the planstate will cause such reverse-diaptching to + * async-aware nodes, which is unexpected behavior for them. + * + * For instance, consider an async-unaware Hashjoin(OUTER, INNER) + * where the OUTER is running asynchronously but the Hashjoin is + * waiting on the async INNER during inner-hash creation. If the + * OUTER fires for the case, since anyone is waiting on it, + * ExecAsyncWaitForNode finally dispatches to the Hashjoin which + * is now in the middle of thing its work. + */ + if (!IsParent(planstate, node)) + continue; I'm not entirely sure that I understand this comment, but I don't think it's going in the right direction. Let's start with the example in the second paragraph. If the hash join is async-unaware, then it isn't possible for the hash join to be both running the outer side of the join asynchronously and at the same time waiting on the inner side. Once it tries to pull the first tuple from the outer side, it's waiting for that to finish and can't do anything else. So, the inner side can't possibly get touched in any way until the outer side finishes. For anything else to happen, the hash join would have to be async-aware. Even if we did that, I don't think it would be right to kick off both sides of the hash join at the same time. Right now, if the outer side turns out to be empty, we never need to build the hash table, and that's good. I don't think it's a good idea to wait for only nodes that are in the current subtree. For example, consider a plan like this: Append -> Foreign Scan on a -> Hash Join -> Foreign Scan on b -> Hash -> Seq Scan on x Suppose Append and Foreign Scan are parallel-aware but the other nodes are not. Append kicks off the Foreign Scan on a and then waits for the hash join to produce a tuple; the hash join kicks off the Foreign Scan on b and waits for it to return a tuple. If, while we're waiting for the foreign scan on b, the foreign scan on a needs some attention - either to produce tuples, or maybe just to call PQconsumeInput() so that more data can be sent from the other side, I think we need to be able to do that. There's no real problem here; even if the Append becomes result-ready before the hash join returns, that is fine. We will not actually be able to return from the append until the hash join returns because of what's on the call stack, but that doesn't mean that the Append can't be marked result-ready sooner than that. The situation can be improved by making the hash join node parallel-aware, but if we don't do that it's still not broken. I think the reason that you originally got backed into this design was because of problems with reentrancy. I don't think I really understand in any detail exactly what problem you hit there, but it seems to me that the following problem could occur: ExecAsyncWaitForNode finds two events and schedules two callbacks. It calls the first of those two callbacks. Before that callback returns, it again calls ExecAsyncWaitForNode. But the new invocation of ExecAsyncWaitForNode doesn't know that there is a second callback pending, so it somehow gets confused. However, I think this problem can fixed using a different method. The occurred_event and callbacks arrays defined by ExecAsyncWaitForNode can be made part of the EState rather than being local variables. When ExecAsyncWaitForNode is called, it checks whether there are any pending callbacks; if so, it removes and calls the first one. Only if there are no pending callbacks does it actually wait; when a wait event fires, one or more new callbacks are generated. This is very similar to the reason why ReceiveSharedInvalidMessages uses a static messages array rather than a normal local variable. That function is solving a problem which I suspect is very similar to the one we have here. However, it would be helpful if you could provide some more details on what you think the reentrancy problems are, because I'm not able to figure them out from your messages so far. The most mysterious part of this hunk to me is the comment that "Firing of an async node that is not a descendant of the planstate will cause such reverse-diaptching to async-aware nodes, which is unexpected behavior for them." It is the async-unaware nodes which might have a problem. The nodes that have been taught about the new system should know what to expect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: