Re: Asynchronous Append on postgres_fdw nodes. - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Asynchronous Append on postgres_fdw nodes. |
Date | |
Msg-id | 20210115.165433.1389870179026901348.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: Asynchronous Append on postgres_fdw nodes. (Etsuro Fujita <etsuro.fujita@gmail.com>) |
Responses |
Re: Asynchronous Append on postgres_fdw nodes.
|
List | pgsql-hackers |
At Sat, 19 Dec 2020 17:55:22 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in > On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in > > > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > The reason for > > > > the early fetching is letting fdw send the next request as early as > > > > possible. (However, I didn't measure the effect of the > > > > nodeAppend-level prefetching.) > > > > > > I agree that that would lead to an improved efficiency in some cases, > > > but I still think that that would be useless in some other cases like > > > SELECT * FROM sharded_table LIMIT 1. Also, I think the situation > > > would get worse if we support Append on top of joins or aggregates > > > over ForeignScans, which would be more expensive to perform than these > > > ForeignScans. > > > > I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append > > for many times is more common than fetching all or "LIMIT <many > > multiples of fetch_size>", that discussion would be convincing... Is > > it really the case? > > I don't have a clear answer for that... Performance in the case you > mentioned would be improved by async execution without prefetching by > Append, so it seemed reasonable to me to remove that prefetching to > avoid unnecessary overheads in the case I mentioned. BUT: I started > to think my proposal, which needs an additional FDW callback routine > (ie, ForeignAsyncBegin()), might be a bad idea, because it would > increase the burden on FDW authors. I agree on the point of developers' burden. > > > If we do prefetching, I think it would be better that it’s the > > > responsibility of the FDW to do prefetching, and I think that that > > > could be done by letting the FDW to start another data fetch, > > > independently of the core, in the ForeignAsyncNotify callback routine, > > > > FDW does prefetching (if it means sending request to remote) in my > > patch, so I agree to that. It suspect that you were intended to say > > the opposite. The core (ExecAppendAsyncGetNext()) controls > > prefetching in your patch. > > No. That function just tries to retrieve a tuple from any of the > ready subplans (ie, subplans marked as as_needrequest). Mmm. I meant that the function explicitly calls ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin() (if needed). Conversely if the function dosn't call ExecAppendAsyncRequsest, the next request to remote doesn't happen. That is, after the tuple buffer of FDW-side is exhausted, the next request doesn't happen until executor requests for the next tuple. You seem to be saying that "postgresForeignAsyncRequest() calls fetch_more_data_begin() following its own decision." but this doesn't seem to be "prefetching". > > > which I revived from Robert's original patch. I think that that would > > > be more efficient, because the FDW would no longer need to wait until > > > all buffered tuples are returned to the core. In the WIP patch, I > > > > I don't understand. My patch sends a prefetch-query as soon as all the > > tuples of the last remote-request is stored into FDW storage. The > > reason for removing ExecAsyncNotify() was it is just redundant as far > > as concerning Append asynchrony. But I particulary oppose to revive > > the function. > > Sorry, my explanation was not good, but what I'm saying here is about > my patch, not your patch. I think this FDW callback routine would be > useful; it allows an FDW to perform another asynchronous data fetch > before delivering a tuple to the core as discussed in [1]. Also, it > would be useful when extending to the case where we have intermediate > nodes between an Append and a ForeignScan such as joins or aggregates, > which I'll explain below. Yeah. If a not-immediate parent of an async-capable node works as async-aware, the notify API would have the power. So I don't object to the API. > > > only allowed the callback routine to put the corresponding ForeignScan > > > node into a state where it’s either ready for a new request or needing > > > a callback for another data fetch, but I think we could probably relax > > > the restriction so that the ForeignScan node can be put into another > > > state where it’s ready for a new request while needing a callback for > > > the prefetch. > > > > I don't understand this, too. ExecAsyncNotify() doesn't touch any of > > the bitmaps, as_needrequest, callback_pending nor as_asyncpending in > > your patch. Am I looking into something wrong? I'm looking > > async-wip-2020-11-17.patch. > > In the WIP patch I post, these bitmaps are modified in the core side > based on the callback_pending and request_complete flags in > AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()). Sorry. I think I misread you here. I agree that, the notify API is not so useful now but would be useful if we allow notify descendents other than immediate children. However, I stumbled on the fact that some kinds of node doesn't return a result when all the underlying nodes returned *a* tuple. Concretely count(*) doesn't return after *all* tuple of the counted relation has been returned. I remember that the fact might be the reason why I removed the API. After all the topmost async-aware node must ask every immediate child if it can return a tuple. > > (By the way, it is one of those that make the code hard to read to me > > that the "callback" means "calling an API function". I think none of > > them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a > > "callback".) > > I thought the word “callback” was OK, because these functions would > call the corresponding FDW callback routines, but I’ll revise the > wording. I'm not confident on the usage of "callback", though:p (Sorry.) I believe that "callback" is a function a caller tells a callee to call it. In broader meaning, all FDW APIs are a function that an FDW extention tells the core to call it (yeah, that's inversed.). However, we don't call fread a callback of libc. They work based on slightly different mechanism but substantially the same, I think. > > > The reason why I disabled async execution when executing EPQ is to > > > avoid sending asynchronous queries to the remote sides, which would be > > > useless, because scan tuples for an EPQ recheck are obtained in a > > > dedicated way. > > > > If EPQ is performed onto Append, I think it should gain from > > asynchronous execution since it is going to fetch *a* tuple from > > several partitions or children. I believe EPQ doesn't contain Append > > in major cases, though. (Or I didn't come up with the steps for the > > case to happen...) > > Sorry, I don’t understand this part. Could you elaborate a bit more on it? EPQ retrieves a specific tuple from a node. If we perform EPQ on an Append, only one of the children should offer a result tuple. Since Append has no idea of which of its children will offer a result, it has no way other than asking all children until it receives a result. If we do that, asynchronously sending a query to all nodes would win. > > > What do you mean by "push-up style executor"? > > > > The reverse of the volcano-style executor, which enters from the > > topmost node and down to the bottom. In the "push-up stule executor", > > the bottom-most nodes fires by a certain trigger then every > > intermediate nodes throws up the result to the parent until reaching > > the topmost node. > > That is what I'm thinking to be able to support the case I mentioned > above. I think that that would allow us to find ready subplans > efficiently from occurred wait events in ExecAppendAsyncEventWait(). > Consider a plan like this: > > Append > -> Nested Loop > -> Foreign Scan on a > -> Foreign Scan on b > -> ... > > I assume here that Foreign Scan on a, Foreign Scan on b, and Nested > Loop are all async-capable and that we have somewhere in the executor > an AsyncRequest with requestor="Nested Loop" and requestee="Foreign > Scan on a", an AsyncRequest with requestor="Nested Loop" and > requestee="Foreign Scan on b", and an AsyncRequest with > requestor="Append" and requestee="Nested Loop". In > ExecAppendAsyncEventWait(), if a file descriptor for foreign table a > becomes ready, we would call ForeignAsyncNotify() for a, and if it > returns a tuple back to the requestor node (ie, Nested Loop) (using > ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for > Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the > inner requestee node (ie, Foreign Scan on b; I assume here that it is > a foreign scan parameterized by a). If Foreign Scan on b returns a > tuple back to the requestor node (ie, Nested Loop) (using > ExecAsyncResponse()), then Nested Loop would match the tuples from the > outer and inner sides. If they match, the join result would be > returned back to the requestor node (ie, Append) (using > ExecAsyncResponse()), marking the Nested Loop subplan as > as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest() > for the inner requestee node for the next tuple, and so on. If > ExecAsyncRequest() can't return a tuple immediately, we would wait > until a file descriptor for foreign table b becomes ready; we would > start from calling ForeignAsyncNotify() for b when the file descriptor > becomes ready. In this way we could find ready subplans efficiently > from occurred wait events in ExecAppendAsyncEventWait() when extending > to the case where subplans are joins or aggregates over Foreign Scans, > I think. Maybe I’m missing something, though. Maybe so. As I mentioned above, in the follwoing case.. Join -1 Join -2 ForegnScan -A ForegnScan -B ForegnScan -C Where the Join-1 is the leader of asynchronous fetching. Even if both of the FS-A,B have returned one tuple each, it's unsure that Join-2 returns a tuple. I'm not sure how to resolve the situation with the current infrastructure as-is. So I tried a structure where when a node gets a new tuple, the node asks the parent whether it is satisfied or not. In that trial I needed to make every execnodes a state machine and that was pretty messy.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: