Thread: [HACKERS] parallel.c oblivion of worker-startup failures
Sometime back Tom Lane has reported [1] about $Subject. I have looked into the issue and found that the problem is not only with parallel workers but with general background worker machinery as well in situations where fork or some such failure occurs. The first problem is that after we register the dynamic worker, the way to know whether the worker has started (WaitForBackgroundWorkerStartup/GetBackgroundWorkerPid) won't give the right answer if the fork failure happens. Also, in cases where the worker is marked not to start after the crash, postmaster doesn't notify the backend if it is not able to start the worker which can make the backend wait forever as it is oblivion of the fact that the worker is not started. Now, apart from these general problems of background worker machinery, parallel.c assumes that after it has registered the dynamic workers, they will start and perform their work. We need to ensure that in case, postmaster is not able to start parallel workers due to fork failure or any similar situations, backend doesn't keep on waiting forever. To fix it, before waiting for workers to finish, we can check whether the worker exists at all. Attached patch fixes these problems. Another way to fix the parallel query related problem is that after registering the workers, the master backend should wait for workers to start before setting up different queues for them to communicate. I think that can be quite expensive. Thoughts? [1] - https://www.postgresql.org/message-id/4905.1492813727@sss.pgh.pa.us -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Amit Kapila <amit.kapila16@gmail.com> writes: > Attached patch fixes these problems. Hmm, this patch adds a kill(notify_pid) after one call to ForgetBackgroundWorker, but the postmaster has several more such calls. Shouldn't they all notify the notify_pid? Should we move that functionality into ForgetBackgroundWorker itself, so we can't forget it again? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> Attached patch fixes these problems. > > Hmm, this patch adds a kill(notify_pid) after one call to > ForgetBackgroundWorker, but the postmaster has several more such calls. > Shouldn't they all notify the notify_pid? Should we move that > functionality into ForgetBackgroundWorker itself, so we can't forget > it again? > Among other places, we already notify during ReportBackgroundWorkerExit(). However, it seems to me that all other places except where this patch has added a call to notify doesn't need such a call. The other places like in DetermineSleepTime and ResetBackgroundWorkerCrashTimes is called for a crashed worker which I don't think requires notification to the backend as that backend itself would have restarted. The other place where we call ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate is set to true which again seems to be either the case of worker crash or when someone has explicitly asked to terminate the worker in which case we already send a notification. I think we need to notify the backend on start, stop and failure to start a worker. OTOH, if it is harmless to send a notification even after the worker is crashed, then we can just move that functionality into ForgetBackgroundWorker itself as that will simplify the code and infact that is the first thing that occurred to me as well, but I haven't done that way as I was not sure if we want to send notification in all kind of cases. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Kapila <amit.kapila16@gmail.com> writes: >>> Attached patch fixes these problems. >> >> Hmm, this patch adds a kill(notify_pid) after one call to >> ForgetBackgroundWorker, but the postmaster has several more such calls. >> Shouldn't they all notify the notify_pid? Should we move that >> functionality into ForgetBackgroundWorker itself, so we can't forget >> it again? >> > > Among other places, we already notify during > ReportBackgroundWorkerExit(). However, it seems to me that all other > places except where this patch has added a call to notify doesn't need > such a call. The other places like in DetermineSleepTime and > ResetBackgroundWorkerCrashTimes is called for a crashed worker which I > don't think requires notification to the backend as that backend > itself would have restarted. The other place where we call > ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate > is set to true which again seems to be either the case of worker crash > or when someone has explicitly asked to terminate the worker in which > case we already send a notification. > > I think we need to notify the backend on start, stop and failure to > start a worker. OTOH, if it is harmless to send a notification even > after the worker is crashed, then we can just move that functionality > into ForgetBackgroundWorker itself as that will simplify the code and > infact that is the first thing that occurred to me as well, but I > haven't done that way as I was not sure if we want to send > notification in all kind of cases. > The patch still applies (with some hunks). I have added it in CF [1] to avoid losing track. [1] - https://commitfest.postgresql.org/15/1341/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > The patch still applies (with some hunks). I have added it in CF [1] > to avoid losing track. > > [1] - https://commitfest.postgresql.org/15/1341/ This did not get reviews and the patch still applies. I am moving it to next CF. -- Michael
On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> The patch still applies (with some hunks). I have added it in CF [1] >> to avoid losing track. >> >> [1] - https://commitfest.postgresql.org/15/1341/ > > This did not get reviews and the patch still applies. I am moving it to next CF. I'd like to break this into two patches, one to fix the general background worker problems and another to fix the problems specific to parallel query. The attached version extracts the two fixes for general background worker problems from Amit's patch and adjust the comments a bit more extensively than he did. Barring objections, I'll commit and back-patch this; then we can deal with the other part of this afterward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> The patch still applies (with some hunks). I have added it in CF [1] >>> to avoid losing track. >>> >>> [1] - https://commitfest.postgresql.org/15/1341/ >> >> This did not get reviews and the patch still applies. I am moving it to next CF. > > I'd like to break this into two patches, one to fix the general > background worker problems and another to fix the problems specific to > parallel query. > > The attached version extracts the two fixes for general background > worker problems from Amit's patch and adjust the comments a bit more > extensively than he did. > Looks good to me. > Barring objections, I'll commit and > back-patch this; then we can deal with the other part of this > afterward. > Sure, I will rebase on top of the commit unless you have some comments on the remaining part. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Looks good to me. Committed and back-patched to 9.4 where dynamic background workers were introduced. >> Barring objections, I'll commit and >> back-patch this; then we can deal with the other part of this >> afterward. > > Sure, I will rebase on top of the commit unless you have some comments > on the remaining part. I'm not in love with that part of the fix; the other parts of that if statement are just testing variables, and a function call that takes and releases an LWLock is a lot more expensive. Furthermore, that test will often be hit in practice, because we'll often arrive at this function before workers have actually finished. On top of that, we'll typically arrive here having already communicated with the worker in some way, such as by receiving tuples, which means that in most cases we already know that the worker was alive at least at some point, and therefore the extra test isn't necessary. We only need that test, if I understand correctly, to cover the failure-to-launch case, which is presumably very rare. All that having been said, I don't quite know how to do it better. It would be easy enough to modify this function so that if it ever received a result other then BGWH_NOT_YET_STARTED for a worker, it wouldn't call this function again for that worker. However, that's only mitigating the problem, not really fixing it. We'll still end up frequently inquiring - admittedly only once - about the status of a worker with which we've previously communicated via the tuple queue. Maybe we just have to live with that problem, but do you (or does anyone else) have an idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm not in love with that part of the fix; the other parts of that if > statement are just testing variables, and a function call that takes > and releases an LWLock is a lot more expensive. Furthermore, that > test will often be hit in practice, because we'll often arrive at this > function before workers have actually finished. On top of that, we'll > typically arrive here having already communicated with the worker in > some way, such as by receiving tuples, which means that in most cases > we already know that the worker was alive at least at some point, and > therefore the extra test isn't necessary. We only need that test, if > I understand correctly, to cover the failure-to-launch case, which is > presumably very rare. Maybe track "worker is known to have launched" in the leader's state someplace? regards, tom lane
On Wed, Dec 6, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe track "worker is known to have launched" in the leader's state > someplace? That's probably one piece of it, yes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 6, 2017 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Looks good to me. > > Committed and back-patched to 9.4 where dynamic background workers > were introduced. > Thanks. >>> Barring objections, I'll commit and >>> back-patch this; then we can deal with the other part of this >>> afterward. >> >> Sure, I will rebase on top of the commit unless you have some comments >> on the remaining part. > > I'm not in love with that part of the fix; the other parts of that if > statement are just testing variables, and a function call that takes > and releases an LWLock is a lot more expensive. Furthermore, that > test will often be hit in practice, because we'll often arrive at this > function before workers have actually finished. On top of that, we'll > typically arrive here having already communicated with the worker in > some way, such as by receiving tuples, which means that in most cases > we already know that the worker was alive at least at some point, and > therefore the extra test isn't necessary. We only need that test, if > I understand correctly, to cover the failure-to-launch case, which is > presumably very rare. > > All that having been said, I don't quite know how to do it better. It > would be easy enough to modify this function so that if it ever > received a result other then BGWH_NOT_YET_STARTED for a worker, it > wouldn't call this function again for that worker. However, that's > only mitigating the problem, not really fixing it. We'll still end up > frequently inquiring - admittedly only once - about the status of a > worker with which we've previously communicated via the tuple queue. > Maybe we just have to live with that problem, but do you (or does > anyone else) have an idea? > I think the optimization you are suggesting has a merit over what I have done in the patch. However, one point to note is that we are anyway going to call that function down in the path( WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid), so we might want to take the advantage of calling it first time as well. We can actually cache the status of workers that have returned BGWH_STOPPED and use it later so that we don't need to make this function call again for workers which are already stopped. We can even do what Tom is suggesting to avoid calling it for workers which are known to be launched, but I am really not sure if that function call is costly enough that we need to maintain one extra state to avoid that. While looking into this code path, I wonder how much we are gaining by having two separate calls (WaitForParallelWorkersToFinish and WaitForParallelWorkersToExit) to check the status of workers after finishing the parallel execution? They are used together in rescan code path, so apparently there is no benefit calling them separately there. OTOH, during shutdown of nodes, it will often be the case that they will be called in a short duration after each other except for the case where we need to shut down the node for the early exit like when there is a limit clause or such. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think the optimization you are suggesting has a merit over what I > have done in the patch. However, one point to note is that we are > anyway going to call that function down in the path( > WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid), > so we might want to take the advantage of calling it first time as > well. We can actually cache the status of workers that have returned > BGWH_STOPPED and use it later so that we don't need to make this > function call again for workers which are already stopped. We can > even do what Tom is suggesting to avoid calling it for workers which > are known to be launched, but I am really not sure if that function > call is costly enough that we need to maintain one extra state to > avoid that. Well, let's do what optimization we can without making it too complicated. > While looking into this code path, I wonder how much we are gaining by > having two separate calls (WaitForParallelWorkersToFinish and > WaitForParallelWorkersToExit) to check the status of workers after > finishing the parallel execution? They are used together in rescan > code path, so apparently there is no benefit calling them separately > there. OTOH, during shutdown of nodes, it will often be the case that > they will be called in a short duration after each other except for > the case where we need to shut down the node for the early exit like > when there is a limit clause or such. I'm not 100% sure either, but if we're going to do anything about that, it seems like a topic for another patch. I don't think it's completely without value because there is some time after we WaitForParallelWorkersToFinish and before we WaitForParallelWorkersToExit during which we can do things like retrieve instrumentation data and shut down other nodes, but whether it pulls it weight in code I don't know. This kind of grew up gradually: originally I/we didn't think of all of the cases where we needed the workers to actually exit rather than just indicating that they were done generating tuples, and the current situation is the result of a series of bug-fixes related to that oversight, so it's quite possible that a redesign would make sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 8, 2017 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think the optimization you are suggesting has a merit over what I >> have done in the patch. However, one point to note is that we are >> anyway going to call that function down in the path( >> WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid), >> so we might want to take the advantage of calling it first time as >> well. We can actually cache the status of workers that have returned >> BGWH_STOPPED and use it later so that we don't need to make this >> function call again for workers which are already stopped. We can >> even do what Tom is suggesting to avoid calling it for workers which >> are known to be launched, but I am really not sure if that function >> call is costly enough that we need to maintain one extra state to >> avoid that. > > Well, let's do what optimization we can without making it too complicated. > Okay, see the attached and let me know if that suffices the need? >> While looking into this code path, I wonder how much we are gaining by >> having two separate calls (WaitForParallelWorkersToFinish and >> WaitForParallelWorkersToExit) to check the status of workers after >> finishing the parallel execution? They are used together in rescan >> code path, so apparently there is no benefit calling them separately >> there. OTOH, during shutdown of nodes, it will often be the case that >> they will be called in a short duration after each other except for >> the case where we need to shut down the node for the early exit like >> when there is a limit clause or such. > > I'm not 100% sure either, but if we're going to do anything about > that, it seems like a topic for another patch. > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, see the attached and let me know if that suffices the need? + * Check for unexpected worker death. This will ensure that if + * the postmaster failed to start the worker, then we don't wait + * for it indefinitely. For workers that are known to be + * launched, we can rely on their error queue being freed once + * they exit. Hmm. Is this really true? What if the worker starts up but then crashes before attaching to the error queue? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Okay, see the attached and let me know if that suffices the need? > > + * Check for unexpected worker death. This will ensure that if > + * the postmaster failed to start the worker, then we don't wait > + * for it indefinitely. For workers that are known to be > + * launched, we can rely on their error queue being freed once > + * they exit. > > Hmm. Is this really true? What if the worker starts up but then > crashes before attaching to the error queue? > If the worker errored out before attaching to the error queue, then we can't rely on error queue being freed. However, in that case, the worker status will be BGWH_STOPPED. I have adjusted the comment accordingly. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 12, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Okay, see the attached and let me know if that suffices the need? >> >> + * Check for unexpected worker death. This will ensure that if >> + * the postmaster failed to start the worker, then we don't wait >> + * for it indefinitely. For workers that are known to be >> + * launched, we can rely on their error queue being freed once >> + * they exit. >> >> Hmm. Is this really true? What if the worker starts up but then >> crashes before attaching to the error queue? >> > > If the worker errored out before attaching to the error queue, then we > can't rely on error queue being freed. However, in that case, the > worker status will be BGWH_STOPPED. I have adjusted the comment > accordingly. > In particular, if the worker crashed (say somebody forcefully killed the worker), then I think it will lead to a restart of the server. So not sure, adding anything about that in the comment will make much sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > In particular, if the worker crashed (say somebody forcefully killed > the worker), then I think it will lead to a restart of the server. So > not sure, adding anything about that in the comment will make much > sense. I think it's dangerous to rely on that assumption. If somebody stuck proc_exit(1) into the very beginning of the worker startup sequence, the worker would exit but the server would not restart. As I started experimenting with this, I discovered that your patch is actually unsafe. Here's a test case that shows the breakage: set force_parallel_mode = on; select 1; The expected outcome is: ?column? ---------- 1 (1 row) If I hack do_start_bgworker() to make worker startup fail, like this: --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -5701,7 +5701,7 @@ do_start_bgworker(RegisteredBgWorker *rw) #ifdef EXEC_BACKEND switch ((worker_pid = bgworker_forkexec(rw->rw_shmem_slot))) #else - switch ((worker_pid = fork_process())) + switch ((worker_pid = -1)) #endif { case -1: ...then it hangs forever. With your patch it no longer hangs forever, which is good, but it also returns the wrong answer, which is bad: ?column? ---------- (0 rows) The problem is that Gather sees nworkers_launched > 0 and assumes that it is therefore entitled not to perform a local scan. I'm pretty sure that there could also be cases where this patch makes us miss an error reported by a worker; suppose the worker error is reported and the worker exits after WaitForParallelWorkersToFinish does CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid(). The code will simply conclude that the worker is dead and stop waiting, never mind the fact that there's a pending error in the queue. This is exactly why I made parallel workers send a Terminate message to the leader instead of just disappearing -- the leader needs to see that message to know that it's reached the end of what the worker is going to send. So, I think we need to regard fork failure and related failure modes as ERROR conditions. It's not equivalent to failing to register the background worker. When we fail to register the background worker, the parallel.c machinery knows at LaunchParallelWorkers() time that the worker will never show up and lets the calling code by setting nworkers_launched, and the calling code can then choose to proceed with that number of workers or ERROR out as it chooses. But once we decide on the value of nworkers_launched to report to callers, it's really not OK for workers to just silently not ever start, as the test case above shows. We could make it the job of code that uses the ParallelContext machinery to cope with that situation, but I think that's a bad plan, because every user would have to worry about this obscure case. Right now, the only in-core use of the ParallelContext machinery is Gather/Gather Merge, but there are pending patches for parallel index scan and parallel VACUUM, so we'll just end up adding this handling to more and more places. And for what? If you've got max_connections and/or max_parallel_workers set so high that your system can't fork(), you have a big problem, and forcing things that would have run using parallel query to run serially instead of erroring out is not going to fix it. I think that deciding we're going to handle fork() failure by reporting an ERROR is just fine. So, here's a patch. This patch keeps track of whether we've ever received a message from each worker via the error queue. If there are no remaining workers which have neither exited cleanly nor sent us at least one message, then it calls GetBackgroundWorkerPid() for each one. If any worker reports that it is BGWH_STOPPED, then it double-checks whether the worker has attached to the error queue. If it did, then it assumes that we'll detect clean shutdown or an error on the next iteration and ignores the problem. If not, then the worker died without ever attaching the error queue, so it throws an ERROR. I tested that this causes the test case above to throw a proper ERROR when fork_process() returns -1. Please review... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Dec 13, 2017 at 12:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> In particular, if the worker crashed (say somebody forcefully killed >> the worker), then I think it will lead to a restart of the server. So >> not sure, adding anything about that in the comment will make much >> sense. > > > The problem is that Gather sees nworkers_launched > 0 and assumes that > it is therefore entitled not to perform a local scan. > Yeah, I think that assumption is not right. I think ideally before assuming that it should wait for workers to startup. > I'm pretty sure > that there could also be cases where this patch makes us miss an error > reported by a worker; suppose the worker error is reported and the > worker exits after WaitForParallelWorkersToFinish does > CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid(). > The code will simply conclude that the worker is dead and stop > waiting, never mind the fact that there's a pending error in the > queue. I have tried to reproduce this situation by adding error case in worker (just before worker returns success ('X' message)) and by having breakpoint in WaitForParallelWorkersToFinish. What, I noticed is that worker is not allowed to exit till it receives some ack from master (basically worker waits at SendProcSignal after sending message) and error is reported properly. > This is exactly why I made parallel workers send a Terminate > message to the leader instead of just disappearing -- the leader needs > to see that message to know that it's reached the end of what the > worker is going to send. > > So, I think we need to regard fork failure and related failure modes > as ERROR conditions. It's not equivalent to failing to register the > background worker. When we fail to register the background worker, > the parallel.c machinery knows at LaunchParallelWorkers() time that > the worker will never show up and lets the calling code by setting > nworkers_launched, and the calling code can then choose to proceed > with that number of workers or ERROR out as it chooses. But once we > decide on the value of nworkers_launched to report to callers, it's > really not OK for workers to just silently not ever start, as the test > case above shows. We could make it the job of code that uses the > ParallelContext machinery to cope with that situation, but I think > that's a bad plan, because every user would have to worry about this > obscure case. Right now, the only in-core use of the ParallelContext > machinery is Gather/Gather Merge, but there are pending patches for > parallel index scan and parallel VACUUM, so we'll just end up adding > this handling to more and more places. And for what? If you've got > max_connections and/or max_parallel_workers set so high that your > system can't fork(), you have a big problem, and forcing things that > would have run using parallel query to run serially instead of > erroring out is not going to fix it. I think that deciding we're > going to handle fork() failure by reporting an ERROR is just fine. > So, if I understand correctly, this means that it will return an error even if there is a problem in one worker (exited or not started) even though other workers would have easily finished the work. Won't such an error can lead to situations where after running the query for one hour the user might see some failure just because one of the many workers is not started (or some similar failure) even though the query would have been completed without that? If so, that doesn't sound like a sensible behavior. > So, here's a patch. This patch keeps track of whether we've ever > received a message from each worker via the error queue. If there are > no remaining workers which have neither exited cleanly nor sent us at > least one message, then it calls GetBackgroundWorkerPid() for each > one. If any worker reports that it is BGWH_STOPPED, then it > double-checks whether the worker has attached to the error queue. If > it did, then it assumes that we'll detect clean shutdown or an error > on the next iteration and ignores the problem. If not, then the > worker died without ever attaching the error queue, so it throws an > ERROR. I tested that this causes the test case above to throw a > proper ERROR when fork_process() returns -1. Please review... > This also doesn't appear to be completely safe. If we add proc_exit(1) after attaching to error queue (say after pq_set_parallel_master) in the worker, then it will lead to *hang* as anyone_alive will still be false and as it will find that the sender is set for the error queue, it won't return any failure. Now, I think even if we check worker status (which will be stopped) and break after the new error condition, it won't work as it will still return zero rows in the case reported by you above. I think if before making an assumption not to scan locally if we have a check to see whether workers are started, then my patch should work fine and we don't need to add any additional checks. Also, it won't create any performance issue as we will perform such a check only when force_parallel_mode is not off or parallel leader participation is off which I think are not common cases. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have tried to reproduce this situation by adding error case in > worker (just before worker returns success ('X' message)) and by > having breakpoint in WaitForParallelWorkersToFinish. What, I noticed > is that worker is not allowed to exit till it receives some ack from > master (basically worker waits at SendProcSignal after sending > message) and error is reported properly. SendProcSignal doesn't look to me like it can do anything that can block, so I don't understand this. > So, if I understand correctly, this means that it will return an error > even if there is a problem in one worker (exited or not started) even > though other workers would have easily finished the work. Won't such > an error can lead to situations where after running the query for one > hour the user might see some failure just because one of the many > workers is not started (or some similar failure) even though the query > would have been completed without that? If so, that doesn't sound > like a sensible behavior. I think it's the *only* sensible behavior. parallel.c does not know that the query would have completed successfully without that worker, or at least it doesn't know that it would have returned the right answer. It *might* be the case that the caller wasn't depending on that worker to do anything, but parallel.c doesn't know that. > This also doesn't appear to be completely safe. If we add > proc_exit(1) after attaching to error queue (say after > pq_set_parallel_master) in the worker, then it will lead to *hang* as > anyone_alive will still be false and as it will find that the sender > is set for the error queue, it won't return any failure. Now, I think > even if we check worker status (which will be stopped) and break after > the new error condition, it won't work as it will still return zero > rows in the case reported by you above. Hmm, there might still be a problem there. I was thinking that once the leader attaches to the queue, we can rely on the leader reaching "ERROR: lost connection to parallel worker" in HandleParallelMessages. However, that might not work because nothing sets ParallelMessagePending in that case. The worker will have detached the queue via shm_mq_detach_callback, but the leader will only discover the detach if it actually tries to read from that queue. > I think if before making an assumption not to scan locally if we have > a check to see whether workers are started, then my patch should work > fine and we don't need to add any additional checks. Also, it won't > create any performance issue as we will perform such a check only when > force_parallel_mode is not off or parallel leader participation is off > which I think are not common cases. My instinct is that we're going to have a lot of subtle bugs if clients of parallel.c can't count on nworkers_launched to be accurate. If clients didn't need that value, we could just have nworkers and let the callers figure it out, but nobody wants that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 14, 2017 at 3:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> This also doesn't appear to be completely safe. If we add >> proc_exit(1) after attaching to error queue (say after >> pq_set_parallel_master) in the worker, then it will lead to *hang* as >> anyone_alive will still be false and as it will find that the sender >> is set for the error queue, it won't return any failure. Now, I think >> even if we check worker status (which will be stopped) and break after >> the new error condition, it won't work as it will still return zero >> rows in the case reported by you above. > > Hmm, there might still be a problem there. I was thinking that once > the leader attaches to the queue, we can rely on the leader reaching > "ERROR: lost connection to parallel worker" in HandleParallelMessages. > However, that might not work because nothing sets > ParallelMessagePending in that case. The worker will have detached > the queue via shm_mq_detach_callback, but the leader will only > discover the detach if it actually tries to read from that queue. > I think it would have been much easier to fix this problem if we would have some way to differentiate whether the worker has stopped gracefully or not. Do you think it makes sense to introduce such a state in the background worker machinery? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think it would have been much easier to fix this problem if we would > have some way to differentiate whether the worker has stopped > gracefully or not. Do you think it makes sense to introduce such a > state in the background worker machinery? I think it makes a lot more sense to just throw an ERROR when the worker doesn't shut down cleanly, which is currently what happens in nearly all cases. It only fails to happen for fork() failure and other errors that happen very early in startup. I don't think there's any point in trying to make this code more complicated to cater to such cases. If fork() is failing, the fact that parallel query is erroring out rather than running with fewer workers is likely to be a good thing. Your principle concern in that situation is probably whether your attempt to log into the machine and kill some processes is also going to die with 'fork failure', and having PostgreSQL consume every available process slot is not going to make that easier. On the other hand, if workers are failing so early in startup that they never attach to the error queue, then they're probably all failing the same way and trying to cope with that problem in any way other than throwing an error is going to result in parallelism being silently disabled with no notification to the user, which doesn't seem good to me either. So basically I think it's right to treat these as error conditions, not try to continue the work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 19, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think it would have been much easier to fix this problem if we would >> have some way to differentiate whether the worker has stopped >> gracefully or not. Do you think it makes sense to introduce such a >> state in the background worker machinery? > > I think it makes a lot more sense to just throw an ERROR when the > worker doesn't shut down cleanly, which is currently what happens in > nearly all cases. It only fails to happen for fork() failure and > other errors that happen very early in startup. I don't think there's > any point in trying to make this code more complicated to cater to > such cases. If fork() is failing, the fact that parallel query is > erroring out rather than running with fewer workers is likely to be a > good thing. Your principle concern in that situation is probably > whether your attempt to log into the machine and kill some processes > is also going to die with 'fork failure', and having PostgreSQL > consume every available process slot is not going to make that easier. > On the other hand, if workers are failing so early in startup that > they never attach to the error queue, then they're probably all > failing the same way and trying to cope with that problem in any way > other than throwing an error is going to result in parallelism being > silently disabled with no notification to the user, which doesn't seem > good to me either. > That is not the main point I am bothered about. I am concerned that the patch proposed by you can lead to hang if there is a crash (abrupt failure like proc_exit(1)) after attaching to the error queue. This is explained in my email up thread [1]. I think we can fix it by trying to read the queues as mentioned by you, but was not sure if that is the best way to deal with it, so proposed an alternate solution. [1] - https://www.postgresql.org/message-id/CAA4eK1LkHBbafduA8em1fb-in4kDBgmQf6qO5TV8CfyuTzEFaQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > That is not the main point I am bothered about. I am concerned that > the patch proposed by you can lead to hang if there is a crash (abrupt > failure like proc_exit(1)) after attaching to the error queue. This is > explained in my email up thread [1]. I think we can fix it by trying > to read the queues as mentioned by you, but was not sure if that is > the best way to deal with it, so proposed an alternate solution. OK. My core point is that I really, really don't want Gather or Gather Merge or parallel CREATE INDEX to have to worry about these cases; I believe strongly that it is best if these cases cause an error at the ParallelContext layer. I read your email as suggesting the opposite, but maybe I misunderstood you. One thing I thought about is introducing a new state BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED. However, it seems problematic. Once the worker is unregistered we wouldn't know which one to return, especially if the slot has been reused. Furthermore, it doesn't help in the case where the worker starts and immediately exits without attaching to the DSM. So it doesn't seem to me that we can fix the problem this way. It seems to me that we instead have to solve the problem in the ParallelContext layer, which can understand the state of the worker by comparing the state of the background worker to what we can find out via the DSM. The background worker layer knows whether or not the worker is running and can signal us when it changes state, but it can't know whether the worker exited cleanly or uncleanly. To know that, we have to look at things like whether it sent us a Terminate message before it stopped. While I think that the clean-vs-unclean distinction has to be made at the ParallelContext layer, it doesn't necessarily have to happen by inserting an error queue. I think that may be a convenient way; for example, we could plug the hole you mentioned before by attaching an on_dsm_detach callback that signals the leader with SendProcSignal(..., PROCSIG_PARALLEL_MESSAGE, ...). At that point, all of the information required for the leader to know that the failure has occurred is present in shared memory; it is only that the leader may not have anything to trigger it to look at the relevant bits of information, and this would fix that. But there are probably other ways the information could also be communicated as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 21, 2017 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> That is not the main point I am bothered about. I am concerned that >> the patch proposed by you can lead to hang if there is a crash (abrupt >> failure like proc_exit(1)) after attaching to the error queue. This is >> explained in my email up thread [1]. I think we can fix it by trying >> to read the queues as mentioned by you, but was not sure if that is >> the best way to deal with it, so proposed an alternate solution. > > OK. My core point is that I really, really don't want Gather or > Gather Merge or parallel CREATE INDEX to have to worry about these > cases; I believe strongly that it is best if these cases cause an > error at the ParallelContext layer. I read your email as suggesting > the opposite, but maybe I misunderstood you. > > One thing I thought about is introducing a new state > BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED. However, it > seems problematic. Once the worker is unregistered we wouldn't know > which one to return, especially if the slot has been reused. > What if we don't allow to reuse such slots till the backend/session that has registered it performs unregister? Currently, we don't seem to have an API corresponding to Register*BackgroundWorker() which can be used to unregister, but maybe we can provide such an API. > Furthermore, it doesn't help in the case where the worker starts and > immediately exits without attaching to the DSM. > Yeah, but can't we detect that case? After the worker exits, we can know its exit status as is passed to CleanupBackgroundWorker, we can use that to mark the worker state as BGWH_ERROR_STOPPED (or something like BGWH_IMMEDIATE_STOPPED). I think above way sounds invasive, but it seems to me that it can be used by other users of background workers as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > What if we don't allow to reuse such slots till the backend/session > that has registered it performs unregister? Currently, we don't seem > to have an API corresponding to Register*BackgroundWorker() which can > be used to unregister, but maybe we can provide such an API. Well, then we could have slots pinned down for a long time, if the backend never gets around to calling unregister. Furthermore, that's absolutely not back-patchable, because we can't put a requirement like that on code running in the back branches. Also, what if the code path that would have done the unregister eventually errors out? We'd need TRY/CATCH blocks everywhere that registers the worker. In short, this seems terrible for multiple reasons. >> Furthermore, it doesn't help in the case where the worker starts and >> immediately exits without attaching to the DSM. > > Yeah, but can't we detect that case? After the worker exits, we can > know its exit status as is passed to CleanupBackgroundWorker, we can > use that to mark the worker state as BGWH_ERROR_STOPPED (or something > like BGWH_IMMEDIATE_STOPPED). > > I think above way sounds invasive, but it seems to me that it can be > used by other users of background workers as well. The exit status doesn't tell us whether the worker attached to the DSM. I'm relatively puzzled as to why you're rejecting a relatively low-impact way of handling a corner case that was missed in the original design in favor of major architectural changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 21, 2017 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> What if we don't allow to reuse such slots till the backend/session >> that has registered it performs unregister? Currently, we don't seem >> to have an API corresponding to Register*BackgroundWorker() which can >> be used to unregister, but maybe we can provide such an API. > > Well, then we could have slots pinned down for a long time, if the > backend never gets around to calling unregister. Furthermore, that's > absolutely not back-patchable, because we can't put a requirement like > that on code running in the back branches. Also, what if the code > path that would have done the unregister eventually errors out? We'd > need TRY/CATCH blocks everywhere that registers the worker. In short, > this seems terrible for multiple reasons. > >>> Furthermore, it doesn't help in the case where the worker starts and >>> immediately exits without attaching to the DSM. >> >> Yeah, but can't we detect that case? After the worker exits, we can >> know its exit status as is passed to CleanupBackgroundWorker, we can >> use that to mark the worker state as BGWH_ERROR_STOPPED (or something >> like BGWH_IMMEDIATE_STOPPED). >> >> I think above way sounds invasive, but it seems to me that it can be >> used by other users of background workers as well. > > The exit status doesn't tell us whether the worker attached to the DSM. > > I'm relatively puzzled as to why you're rejecting a relatively > low-impact way of handling a corner case that was missed in the > original design in favor of major architectural changes. > I am not against using the way specific to parallel context layer as described by you above. However, I was trying to see if there is some general purpose solution as the low-impact way is not very straightforward. I think you can go ahead with the way you have described to fix the hole I was pointing to and I can review it or I can also give it a try if you want to. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I am not against using the way specific to parallel context layer as > described by you above. However, I was trying to see if there is > some general purpose solution as the low-impact way is not very > straightforward. I think you can go ahead with the way you have > described to fix the hole I was pointing to and I can review it or I > can also give it a try if you want to. See attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jan 18, 2018 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I am not against using the way specific to parallel context layer as >> described by you above. However, I was trying to see if there is >> some general purpose solution as the low-impact way is not very >> straightforward. I think you can go ahead with the way you have >> described to fix the hole I was pointing to and I can review it or I >> can also give it a try if you want to. > > See attached. > The patch doesn't apply cleanly on the head, but after rebasing it, I have reviewed and tested it and it seems to be working fine. Apart from this specific issue, I think we should consider making nworkers_launched reliable (at the very least for cases where it matters). You seem to be of opinion that can be a source of subtle bugs which I don't deny but now I think we are starting to see some use cases of such a mechanism as indicated by Peter G. in parallel create index thread. Even, if we find some way for parallel create index to not rely on that assumption, I won't be surprised if some future parallelism work would have such a requirement. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > The patch doesn't apply cleanly on the head, but after rebasing it, I > have reviewed and tested it and it seems to be working fine. Apart > from this specific issue, I think we should consider making > nworkers_launched reliable (at the very least for cases where it > matters). You seem to be of opinion that can be a source of subtle > bugs which I don't deny but now I think we are starting to see some > use cases of such a mechanism as indicated by Peter G. in parallel > create index thread. Even, if we find some way for parallel create > index to not rely on that assumption, I won't be surprised if some > future parallelism work would have such a requirement. Isn't making nworkers_launched reliable exactly what this patch does? It converts the rare cases in which nworkers_launched would have been unreliable into errors, precisely so that code like parallel CREATE INDEX doesn't need to worry about the case where it's inaccurate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 22, 2018 at 7:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> The patch doesn't apply cleanly on the head, but after rebasing it, I >> have reviewed and tested it and it seems to be working fine. Apart >> from this specific issue, I think we should consider making >> nworkers_launched reliable (at the very least for cases where it >> matters). You seem to be of opinion that can be a source of subtle >> bugs which I don't deny but now I think we are starting to see some >> use cases of such a mechanism as indicated by Peter G. in parallel >> create index thread. Even, if we find some way for parallel create >> index to not rely on that assumption, I won't be surprised if some >> future parallelism work would have such a requirement. > > Isn't making nworkers_launched reliable exactly what this patch does? > It converts the rare cases in which nworkers_launched would have been > unreliable into errors, precisely so that code like parallel CREATE > INDEX doesn't need to worry about the case where it's inaccurate. > It does turn such cases into error but only at the end when someone calls WaitForParallelWorkersToFinish. If one tries to rely on it at any other time, it won't work as I think is the case for Parallel Create Index where Peter G. is trying to use it differently. I am not 100% sure whether Parallel Create index has this symptom as I have not tried it with that patch, but I and Peter are trying to figure that out. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > It does turn such cases into error but only at the end when someone > calls WaitForParallelWorkersToFinish. If one tries to rely on it at > any other time, it won't work as I think is the case for Parallel > Create Index where Peter G. is trying to use it differently. I am not > 100% sure whether Parallel Create index has this symptom as I have not > tried it with that patch, but I and Peter are trying to figure that > out. OK. I've committed this patch and back-patched it to 9.6. Back-patching to 9.5 didn't looks simple because nworkers_launched doesn't exist on that branch, and it doesn't matter very much anyway since the infrastructure has no in-core users in 9.5. I'll respond to the other thread about the issues specific to parallel CREATE INDEX. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> It does turn such cases into error but only at the end when someone >> calls WaitForParallelWorkersToFinish. If one tries to rely on it at >> any other time, it won't work as I think is the case for Parallel >> Create Index where Peter G. is trying to use it differently. I am not >> 100% sure whether Parallel Create index has this symptom as I have not >> tried it with that patch, but I and Peter are trying to figure that >> out. > > OK. I've committed this patch and back-patched it to 9.6. > Back-patching to 9.5 didn't looks simple because nworkers_launched > doesn't exist on that branch, and it doesn't matter very much anyway > since the infrastructure has no in-core users in 9.5. I was just talking to Thomas, and one or the other of us realized that this doesn't completely fix the problem. I think that if a worker fails, even by some unorthodox mechanism like an abrupt proc_exit(1), after the point where it attached to the error queue, this patch is sufficient to make that reliably turn into an error. But what if it fails before that - e.g. fork() failure? In that case, this process will catch the problem if the calling process calls WaitForParallelWorkersToFinish, but does that actually happen in any interesting cases? Maybe not. In the case of Gather, for example, we won't WaitForParallelWorkersToFinish() until we've read all the tuples. If there's a tuple queue that does not yet have a sender, then we'll just wait for it to get one, even if the sender fell victim to a fork failure and is never going to show up. We could *almost* fix this by having execParallel.c include a Barrier in the DSM, similar to what I proposed for parallel CREATE INDEX. When a worker starts, it attaches to the barrier; when it exits, it detaches. Instead of looping until the leader is done and all the tuple queues are detached, Gather or Gather Merge could loop until the leader is done and everyone detaches from the barrier. But that would require changes to the Barrier API, which isn't prepared to have the caller alternate waiting with other activity like reading the tuple queues, which would clearly be necessary in this case. Moreover, it doesn't work at all when parallel_leader_participation=off, because in that case we'll again just wait forever for some worker to show up, and if none of them do, then we're hosed. One idea to fix this is to add a new function in parallel.c with a name like CheckForFailedWorkers() that Gather and Gather Merge call just before they WaitLatch(). If a worker fails to start, the postmaster will send SIGUSR1 to the leader, which will set the latch, so we can count on the function being run after every worker death. The function would go through and check each worker for which pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) == BGWH_STOPPED. If so, ERROR. This lets us *run* for arbitrary periods of time without detecting a fork failure, but before we *wait* we will notice. Getting more aggressive than that sounds expensive, but I'm open to ideas. If we adopt this approach, then Peter's patch could probably make use of it, too. It's a little tricky because ConditionVariableSleep() tries not to return spuriously, so we'd either have to change that or stick CheckForFailedWorkers() into that function's loop. I suspect the former is a better approach. Or maybe that patch should still use the Barrier approach and avoid all of this annoyance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 23, 2018 at 1:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I was just talking to Thomas, and one or the other of us realized that > this doesn't completely fix the problem. I think that if a worker > fails, even by some unorthodox mechanism like an abrupt proc_exit(1), > after the point where it attached to the error queue, this patch is > sufficient to make that reliably turn into an error. But what if it > fails before that - e.g. fork() failure? In that case, this process > will catch the problem if the calling process calls > WaitForParallelWorkersToFinish, but does that actually happen in any > interesting cases? Maybe not. > > In the case of Gather, for example, we won't > WaitForParallelWorkersToFinish() until we've read all the tuples. If > there's a tuple queue that does not yet have a sender, then we'll just > wait for it to get one, even if the sender fell victim to a fork > failure and is never going to show up. > > We could *almost* fix this by having execParallel.c include a Barrier > in the DSM, similar to what I proposed for parallel CREATE INDEX. > When a worker starts, it attaches to the barrier; when it exits, it > detaches. Instead of looping until the leader is done and all the > tuple queues are detached, Gather or Gather Merge could loop until the > leader is done and everyone detaches from the barrier. But that would > require changes to the Barrier API, which isn't prepared to have the > caller alternate waiting with other activity like reading the tuple > queues, which would clearly be necessary in this case. Moreover, it > doesn't work at all when parallel_leader_participation=off, because in > that case we'll again just wait forever for some worker to show up, > and if none of them do, then we're hosed. I'm relieved that we all finally seem to be on the same page on this issue. > One idea to fix this is to add a new function in parallel.c with a > name like CheckForFailedWorkers() that Gather and Gather Merge call > just before they WaitLatch(). If a worker fails to start, the > postmaster will send SIGUSR1 to the leader, which will set the latch, > so we can count on the function being run after every worker death. > The function would go through and check each worker for which > pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to > see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) == > BGWH_STOPPED. If so, ERROR. Sounds workable to me. > This lets us *run* for arbitrary periods of time without detecting a > fork failure, but before we *wait* we will notice. Getting more > aggressive than that sounds expensive, but I'm open to ideas. Is that really that different to any other case? I imagine that in practice most interrupt checks happen within the leader in a WaitLatch() loop (or similar). Besides, the kinds of failures we're talking about are incredibly rare in the real world. I think it's fine if the leader is not very promptly aware of when this happens, or if execution becomes slow in some other way. > If we adopt this approach, then Peter's patch could probably make use > of it, too. It's a little tricky because ConditionVariableSleep() > tries not to return spuriously, so we'd either have to change that or > stick CheckForFailedWorkers() into that function's loop. I suspect > the former is a better approach. Or maybe that patch should still use > the Barrier approach and avoid all of this annoyance. I have the same misgivings about using a barrier to solve this problem for parallel CREATE INDEX as before. Specifically: why should nbtsort.c solve this problem for itself? I don't think that it's in any way special. One can imagine exactly the same problem with other parallel DDL implementations, fixable using exactly the same solution. I'm not saying that nbtsort.c shouldn't have to do anything differently; just that it should buy into some general solution. That said, it would be ideal if the current approach I take in nbtsort.c didn't have to be changed *at all*, because if it was 100% correct already then I think we'd have a more robust parallel.h interface (it is certainly a "nice to have" for me). The fact that multiple hackers at least tacitly assumed that nworkers_launched is a reliable indicator of how many workers will show up tells us something. Moreover, parallel_leader_participation=off is always going to be wonky under a regime based on "see who shows up", because failure becomes indistinguishable from high latency -- I'd rather avoid having to think about nbtsort.c in distributed systems terms. -- Peter Geoghegan
On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> OK. I've committed this patch and back-patched it to 9.6. >> Back-patching to 9.5 didn't looks simple because nworkers_launched >> doesn't exist on that branch, and it doesn't matter very much anyway >> since the infrastructure has no in-core users in 9.5. > > I was just talking to Thomas, and one or the other of us realized that > this doesn't completely fix the problem. I think that if a worker > fails, even by some unorthodox mechanism like an abrupt proc_exit(1), > after the point where it attached to the error queue, this patch is > sufficient to make that reliably turn into an error. But what if it > fails before that - e.g. fork() failure? In that case, this process > will catch the problem if the calling process calls > WaitForParallelWorkersToFinish, but does that actually happen in any > interesting cases? Maybe not. > > In the case of Gather, for example, we won't > WaitForParallelWorkersToFinish() until we've read all the tuples. If > there's a tuple queue that does not yet have a sender, then we'll just > wait for it to get one, even if the sender fell victim to a fork > failure and is never going to show up. > Hmm, I think that case will be addressed because tuple queues can detect if the leader is not attached. It does in code path shm_mq_receive->shm_mq_counterparty_gone. In shm_mq_counterparty_gone, it can detect if the worker is gone by using GetBackgroundWorkerPid. Moreover, I have manually tested this particular case before saying your patch is fine. Do you have some other case in mind which I am missing? > We could *almost* fix this by having execParallel.c include a Barrier > in the DSM, similar to what I proposed for parallel CREATE INDEX. > When a worker starts, it attaches to the barrier; when it exits, it > detaches. Instead of looping until the leader is done and all the > tuple queues are detached, Gather or Gather Merge could loop until the > leader is done and everyone detaches from the barrier. But that would > require changes to the Barrier API, which isn't prepared to have the > caller alternate waiting with other activity like reading the tuple > queues, which would clearly be necessary in this case. Moreover, it > doesn't work at all when parallel_leader_participation=off, because in > that case we'll again just wait forever for some worker to show up, > and if none of them do, then we're hosed. > > One idea to fix this is to add a new function in parallel.c with a > name like CheckForFailedWorkers() that Gather and Gather Merge call > just before they WaitLatch(). If a worker fails to start, the > postmaster will send SIGUSR1 to the leader, which will set the latch, > so we can count on the function being run after every worker death. > The function would go through and check each worker for which > pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to > see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) == > BGWH_STOPPED. If so, ERROR. > > This lets us *run* for arbitrary periods of time without detecting a > fork failure, but before we *wait* we will notice. Getting more > aggressive than that sounds expensive, but I'm open to ideas. > > If we adopt this approach, then Peter's patch could probably make use > of it, too. It's a little tricky because ConditionVariableSleep() > tries not to return spuriously, so we'd either have to change that or > stick CheckForFailedWorkers() into that function's loop. I suspect > the former is a better approach. Or maybe that patch should still use > the Barrier approach and avoid all of this annoyance. > I think we can discuss your proposed solution once the problem is clear. As of now, it is not very clear to me whether there is any pending problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> In the case of Gather, for example, we won't >> WaitForParallelWorkersToFinish() until we've read all the tuples. If >> there's a tuple queue that does not yet have a sender, then we'll just >> wait for it to get one, even if the sender fell victim to a fork >> failure and is never going to show up. >> > > Hmm, I think that case will be addressed because tuple queues can > detect if the leader is not attached. It does in code path > shm_mq_receive->shm_mq_counterparty_gone. In > shm_mq_counterparty_gone, it can detect if the worker is gone by using > GetBackgroundWorkerPid. Moreover, I have manually tested this > particular case before saying your patch is fine. Do you have some > other case in mind which I am missing? Hmm. Yeah. I can't seem to reach a stuck case and was probably just confused and managed to confuse Robert too. If you make fork_process() fail randomly (see attached), I see that there are a couple of easily reachable failure modes (example session at bottom of message): 1. HandleParallelMessages() is reached and raises a "lost connection to parallel worker" error because shm_mq_receive() returns SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked GetBackgroundWorkerPid() just as you said. I guess that's happening because some other process is (coincidentally) sending PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a process is unexpectedly stopped. 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel worker failed to initialize" error. TupleQueueReaderNext() set done to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once again, that is because shm_mq_counterparty_gone() returned true. This is the bit Robert and I missed in our off-list discussion. As long as we always get our latch set by the postmaster after a fork failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is guaranteed to return BGWH_STOPPED after that, and as long as we only ever use latch/CFI loops to wait, and as long as we try to read from a shm_mq, then I don't see a failure mode that hangs. This doesn't help a parallel.c client that doesn't try to read from a shm_mq though, like parallel CREATE INDEX (using the proposed design without dynamic Barrier). We can't rely on a coincidental PROCSIG_PARALLEL_MESSAGE like in (1), and it's not going to try to read from a queue like in (2). So wouldn't it need a way to perform its own similar check for unexpectedly BGWH_STOPPED processes whenever its latch is set? If there were some way for the postmaster to cause reason PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of just notification via kill(SIGUSR1) when it fails to fork a parallel worker, we'd get (1) for free in any latch/CFI loop code. But I understand that we can't do that by project edict. === postgres=# create table foox as select generate_series(1, 10000) as a; SELECT 10000 postgres=# alter table foox set (parallel_workers = 4); ALTER TABLE postgres=# set max_parallel_workers_per_gather = 4; SET postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: parallel worker failed to initialize HINT: More details may be available in the server log. postgres=# select count(*) from foox; count ------- 10000 (1 row) postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Hmm, I think that case will be addressed because tuple queues can >> detect if the leader is not attached. It does in code path >> shm_mq_receive->shm_mq_counterparty_gone. In >> shm_mq_counterparty_gone, it can detect if the worker is gone by using >> GetBackgroundWorkerPid. Moreover, I have manually tested this >> particular case before saying your patch is fine. Do you have some >> other case in mind which I am missing? > > Hmm. Yeah. I can't seem to reach a stuck case and was probably just > confused and managed to confuse Robert too. If you make > fork_process() fail randomly (see attached), I see that there are a > couple of easily reachable failure modes (example session at bottom of > message): > > 1. HandleParallelMessages() is reached and raises a "lost connection > to parallel worker" error because shm_mq_receive() returns > SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked > GetBackgroundWorkerPid() just as you said. I guess that's happening > because some other process is (coincidentally) sending > PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a > process is unexpectedly stopped. > > 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel > worker failed to initialize" error. TupleQueueReaderNext() set done > to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once > again, that is because shm_mq_counterparty_gone() returned true. This > is the bit Robert and I missed in our off-list discussion. > > As long as we always get our latch set by the postmaster after a fork > failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is > guaranteed to return BGWH_STOPPED after that, and as long as we only > ever use latch/CFI loops to wait, and as long as we try to read from a > shm_mq, then I don't see a failure mode that hangs. What about the parallel_leader_participation=off case? -- Peter Geoghegan
On Wed, Jan 24, 2018 at 10:03 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >>> Hmm, I think that case will be addressed because tuple queues can >>> detect if the leader is not attached. It does in code path >>> shm_mq_receive->shm_mq_counterparty_gone. In >>> shm_mq_counterparty_gone, it can detect if the worker is gone by using >>> GetBackgroundWorkerPid. Moreover, I have manually tested this >>> particular case before saying your patch is fine. Do you have some >>> other case in mind which I am missing? >> >> Hmm. Yeah. I can't seem to reach a stuck case and was probably just >> confused and managed to confuse Robert too. If you make >> fork_process() fail randomly (see attached), I see that there are a >> couple of easily reachable failure modes (example session at bottom of >> message): >> >> 1. HandleParallelMessages() is reached and raises a "lost connection >> to parallel worker" error because shm_mq_receive() returns >> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked >> GetBackgroundWorkerPid() just as you said. I guess that's happening >> because some other process is (coincidentally) sending >> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a >> process is unexpectedly stopped. >> >> 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel >> worker failed to initialize" error. TupleQueueReaderNext() set done >> to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once >> again, that is because shm_mq_counterparty_gone() returned true. This >> is the bit Robert and I missed in our off-list discussion. >> >> As long as we always get our latch set by the postmaster after a fork >> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is >> guaranteed to return BGWH_STOPPED after that, and as long as we only >> ever use latch/CFI loops to wait, and as long as we try to read from a >> shm_mq, then I don't see a failure mode that hangs. > > What about the parallel_leader_participation=off case? > There is nothing special about that case, there shouldn't be any problem till we can detect the worker failures appropriately. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 9:55 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> In the case of Gather, for example, we won't >>> WaitForParallelWorkersToFinish() until we've read all the tuples. If >>> there's a tuple queue that does not yet have a sender, then we'll just >>> wait for it to get one, even if the sender fell victim to a fork >>> failure and is never going to show up. >>> >> >> Hmm, I think that case will be addressed because tuple queues can >> detect if the leader is not attached. It does in code path >> shm_mq_receive->shm_mq_counterparty_gone. In >> shm_mq_counterparty_gone, it can detect if the worker is gone by using >> GetBackgroundWorkerPid. Moreover, I have manually tested this >> particular case before saying your patch is fine. Do you have some >> other case in mind which I am missing? > > Hmm. Yeah. I can't seem to reach a stuck case and was probably just > confused and managed to confuse Robert too. If you make > fork_process() fail randomly (see attached), I see that there are a > couple of easily reachable failure modes (example session at bottom of > message): > In short, we are good with committed code. Right? > 1. HandleParallelMessages() is reached and raises a "lost connection > to parallel worker" error because shm_mq_receive() returns > SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked > GetBackgroundWorkerPid() just as you said. I guess that's happening > because some other process is (coincidentally) sending > PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a > process is unexpectedly stopped. > > 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel > worker failed to initialize" error. TupleQueueReaderNext() set done > to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once > again, that is because shm_mq_counterparty_gone() returned true. This > is the bit Robert and I missed in our off-list discussion. > > As long as we always get our latch set by the postmaster after a fork > failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is > guaranteed to return BGWH_STOPPED after that, and as long as we only > ever use latch/CFI loops to wait, and as long as we try to read from a > shm_mq, then I don't see a failure mode that hangs. > > This doesn't help a parallel.c client that doesn't try to read from a > shm_mq though, like parallel CREATE INDEX (using the proposed design > without dynamic Barrier). > Yes, this is what I am trying to explain on parallel create index thread. I think there we need to either use WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a new API as proposed in that thread) if we don't want to use barriers. I see a minor disadvantage in using WaitForParallelWorkersToFinish which I will say on that thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 5:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Hmm. Yeah. I can't seem to reach a stuck case and was probably just >> confused and managed to confuse Robert too. If you make >> fork_process() fail randomly (see attached), I see that there are a >> couple of easily reachable failure modes (example session at bottom of >> message): >> > > In short, we are good with committed code. Right? Yep. Sorry for the noise. > Yes, this is what I am trying to explain on parallel create index > thread. I think there we need to either use > WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a > new API as proposed in that thread) if we don't want to use barriers. > I see a minor disadvantage in using WaitForParallelWorkersToFinish > which I will say on that thread. Ah, I see. So if you wait for them to attach you can detect unexpected dead workers (via shm_mq_receive), at the cost of having the leader wasting time waiting around for forked processes to say hello when it could instead be sorting tuples. -- Thomas Munro http://www.enterprisedb.com
On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Yes, this is what I am trying to explain on parallel create index >> thread. I think there we need to either use >> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a >> new API as proposed in that thread) if we don't want to use barriers. >> I see a minor disadvantage in using WaitForParallelWorkersToFinish >> which I will say on that thread. > > Ah, I see. So if you wait for them to attach you can detect > unexpected dead workers (via shm_mq_receive), at the cost of having > the leader wasting time waiting around for forked processes to say > hello when it could instead be sorting tuples. The leader can go ahead and sort before calling something like a new WaitForParallelWorkersToAttach() function (or even WaitForParallelWorkersToFinish()). If we did add a WaitForParallelWorkersToAttach() function, then the performance hit would probably not be noticeable in practice. The parallel_leader_participates case would still work without special care (that's the main hazard that makes using a barrier seem unappealing). -- Peter Geoghegan
On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >>> Yes, this is what I am trying to explain on parallel create index >>> thread. I think there we need to either use >>> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a >>> new API as proposed in that thread) if we don't want to use barriers. >>> I see a minor disadvantage in using WaitForParallelWorkersToFinish >>> which I will say on that thread. >> >> Ah, I see. So if you wait for them to attach you can detect >> unexpected dead workers (via shm_mq_receive), at the cost of having >> the leader wasting time waiting around for forked processes to say >> hello when it could instead be sorting tuples. > > The leader can go ahead and sort before calling something like a new > WaitForParallelWorkersToAttach() function (or even > WaitForParallelWorkersToFinish()). If we did add a > WaitForParallelWorkersToAttach() function, then the performance hit > would probably not be noticeable in practice. The > parallel_leader_participates case would still work without special > care (that's the main hazard that makes using a barrier seem > unappealing). > +1. I also think so. Another advantage is that even if one of the workers is not able to start, we don't need to return an error at the end of the query, rather we can detect that situation early and can execute the query successfully. OTOH, if we are not convinced about performance implications, then WaitForParallelWorkersToFinish should serve the purpose which can be called at later point of time. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 23, 2018 at 9:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> The leader can go ahead and sort before calling something like a new >> WaitForParallelWorkersToAttach() function (or even >> WaitForParallelWorkersToFinish()). If we did add a >> WaitForParallelWorkersToAttach() function, then the performance hit >> would probably not be noticeable in practice. The >> parallel_leader_participates case would still work without special >> care (that's the main hazard that makes using a barrier seem >> unappealing). >> > > +1. I also think so. Another advantage is that even if one of the > workers is not able to start, we don't need to return an error at the > end of the query, rather we can detect that situation early and can > execute the query successfully. OTOH, if we are not convinced about > performance implications, then WaitForParallelWorkersToFinish should > serve the purpose which can be called at later point of time. There is another disadvantage to just using WaitForParallelWorkersToFinish() to wait for nbtsort.c workers to finish (and not inventing a WaitForParallelWorkersToAttach() function, and calling that instead), which is: there can be no custom wait event that specifically mentions parallel CREATE INDEX. -- Peter Geoghegan
On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > If there were some way for the postmaster to cause reason > PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of > just notification via kill(SIGUSR1) when it fails to fork a parallel > worker, we'd get (1) for free in any latch/CFI loop code. But I > understand that we can't do that by project edict. Based on the above observation, here is a terrible idea you'll all hate. It is pessimistic and expensive: it thinks that every latch wake might be the postmaster telling us it's failed to fork() a parallel worker, until we've seen a sign of life on every worker's error queue. Untested illustration code only. This is the only way I've come up with to discover fork failure in any latch/CFI loop (ie without requiring client code to explicitly try to read either error or tuple queues). -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jan 24, 2018 at 1:57 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> If there were some way for the postmaster to cause reason >> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of >> just notification via kill(SIGUSR1) when it fails to fork a parallel >> worker, we'd get (1) for free in any latch/CFI loop code. But I >> understand that we can't do that by project edict. > > Based on the above observation, here is a terrible idea you'll all > hate. It is pessimistic and expensive: it thinks that every latch > wake might be the postmaster telling us it's failed to fork() a > parallel worker, until we've seen a sign of life on every worker's > error queue. Untested illustration code only. This is the only way > I've come up with to discover fork failure in any latch/CFI loop (ie > without requiring client code to explicitly try to read either error > or tuple queues). The question, I suppose, is how expensive this is in the real world. If it's actually not a cost that anybody is likely to notice, then I think we should pursue this approach. I wouldn't put too much weight on keeping this simple for users of the parallel infrastructure, though, because something like Amit's WaitForParallelWorkersToAttach() idea still seems acceptable. "Call this function before trusting the finality of nworkers_launched" isn't too onerous a rule to have to follow. -- Peter Geoghegan
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Hmm, I think that case will be addressed because tuple queues can > detect if the leader is not attached. It does in code path > shm_mq_receive->shm_mq_counterparty_gone. In > shm_mq_counterparty_gone, it can detect if the worker is gone by using > GetBackgroundWorkerPid. Moreover, I have manually tested this > particular case before saying your patch is fine. Do you have some > other case in mind which I am missing? Oh, right. I had forgotten that I taught shm_mq_attach() to take an optional BackgroundWorkerHandle precisely to solve this kind of problem. I can't decide whether to feel smart because I did that or dumb because I forgot that I did it. But I think there's still a problem: given recent commits, that will cause us to ERROR out if we tried to read from a tuple queue for a worker which failed to start, or that starts and then dies before attaching to the queue. But there's no guarantee we'll try to read from that queue in the first place. Normally, that gets triggered when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started up in the first place, it can't send that, and the postmaster won't. In Thomas's test case, he's using 4 workers, and if even one of those manages to start, then it'll probably do so after any fork failures have already occurred, and the error will be noticed when that process sends its first message to the leader through the error queue, because it'll send PROCSIG_PARALLEL_MESSAGE via all the queues. If all of the workers fail to start, then that doesn't help. But it still manages to detect the failure in that case because it reaches WaitForParallelWorkersToFinish, which we just patched. But how does that happen, anyway? Shouldn't it get stuck waiting for the tuple queues to which nobody's attached yet? The reason it doesn't is because ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue, which causes the tuple queues to be regarded as detached if the worker fails to start. A detached tuple queue, in general, is not an error condition: it just means that worker has no more tuples. So I think what happens is as follows. If parallel_leader_participation = on, then the leader runs the whole plan, producing however many tuples the plan should have generated, and then afterwards waits for the workers to finish, tripping an error. If parallel_leader_participation = off, the leader doesn't try to run the plan and is perfectly happy with the fact that every worker has produced no tuples; since, as far as it knows, the plan is now fully executed, it waits for them to shut down, tripping an error. I guess that works, but it seems more like blind luck than good design. Parallel CREATE INDEX fails to be as "lucky" as Gather because there's nothing in parallel CREATE INDEX that lets it skip waiting for a worker which doesn't start -- and in general it seems unacceptable to impose a coding requirement that all future parallel operations must fail to notice the difference between a worker that completed successfully and one that never ran at all. If we made the Gather node wait only for workers that actually reached the Gather -- either by using a Barrier or by some other technique -- then this would be a lot less fragile. And the same kind of technique would work for parallel CREATE INDEX. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 24, 2018 at 12:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: > In Thomas's test case, he's using 4 workers, and if even one of those > manages to start, then it'll probably do so after any fork failures > have already occurred, and the error will be noticed when that process > sends its first message to the leader through the error queue, because > it'll send PROCSIG_PARALLEL_MESSAGE via all the queues. If all of the > workers fail to start, then that doesn't help. But it still manages > to detect the failure in that case because it reaches > WaitForParallelWorkersToFinish, which we just patched. > > But how does that happen, anyway? Shouldn't it get stuck waiting for > the tuple queues to which nobody's attached yet? The reason it > doesn't is because > ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue, > which causes the tuple queues to be regarded as detached if the worker > fails to start. A detached tuple queue, in general, is not an error > condition: it just means that worker has no more tuples. This explains all the confusion. Amit told me that using a tuple queue made all the difference here. Even till, it seemed surprising that we'd rely on that from such a long distance from within nodeGather.c. > I guess that works, but it seems more like blind luck than good > design. Parallel CREATE INDEX fails to be as "lucky" as Gather > because there's nothing in parallel CREATE INDEX that lets it skip > waiting for a worker which doesn't start -- and in general it seems > unacceptable to impose a coding requirement that all future parallel > operations must fail to notice the difference between a worker that > completed successfully and one that never ran at all. +1. > If we made the Gather node wait only for workers that actually reached > the Gather -- either by using a Barrier or by some other technique -- > then this would be a lot less fragile. And the same kind of technique > would work for parallel CREATE INDEX. The use of a barrier has problems of its own [1], though, of which one is unique to the parallel_leader_participation=off case. I thought that you yourself agreed with this [2] -- do you? Another argument against using a barrier in this way is that it seems like way too much mechanism to solve a simple problem. Why should a client of parallel.h not be able to rely on nworkers_launched (perhaps only after "verifying it can be trusted")? That seem like a pretty reasonable requirement for clients to have for any framework for parallel imperative programming. I think that we should implement "some other technique", instead of using a barrier. As I've said, Amit's WaitForParallelWorkersToAttach() idea seems like a promising "other technique". [1] https://www.postgresql.org/message-id/CAA4eK1+a0OF4M231vBgPr_0Ygg_BNmRGZLiB7WQDE-FYBSyrGg@mail.gmail.com [2] https://www.postgresql.org/message-id/CA+TgmoaF8UA8v8hP=CcoqUc50pucPC8ABj-_yyC++yGggjWFsw@mail.gmail.com -- Peter Geoghegan
On Wed, Jan 24, 2018 at 5:52 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> If we made the Gather node wait only for workers that actually reached >> the Gather -- either by using a Barrier or by some other technique -- >> then this would be a lot less fragile. And the same kind of technique >> would work for parallel CREATE INDEX. > > The use of a barrier has problems of its own [1], though, of which one > is unique to the parallel_leader_participation=off case. I thought > that you yourself agreed with this [2] -- do you? > > Another argument against using a barrier in this way is that it seems > like way too much mechanism to solve a simple problem. Why should a > client of parallel.h not be able to rely on nworkers_launched (perhaps > only after "verifying it can be trusted")? That seem like a pretty > reasonable requirement for clients to have for any framework for > parallel imperative programming. Well, I've been resisting that approach from the very beginning of parallel query. Eventually, I hope that we're going to go in the direction of changing our mind about how many workers parallel operations use "on the fly". For example, if there are 8 parallel workers available and 4 of them are in use, and you start a query (or index build) that wants 6 but only gets 4, it would be nice if the other 2 could join later after the other operation finishes and frees some up. That, of course, won't work very well if parallel operations are coded in such a way that the number of workers must be nailed down at the very beginning. Now maybe all that seems like pie in the sky, and perhaps it is, but I hold out hope. For queries, there is another consideration, which is that some queries may run with parallelism but actually finish quite quickly - it's not desirable to make the leader wait for workers to start when it could be busy computing. That's a lesser consideration for bulk operations like parallel CREATE INDEX, but even there I don't think it's totally negligible. For both reasons, it's much better, or so it seems to me, if parallel operations are coded to work with the number of workers that show up, rather than being inflexibly tied to a particular worker count. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 24, 2018 at 3:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, I've been resisting that approach from the very beginning of > parallel query. Eventually, I hope that we're going to go in the > direction of changing our mind about how many workers parallel > operations use "on the fly". For example, if there are 8 parallel > workers available and 4 of them are in use, and you start a query (or > index build) that wants 6 but only gets 4, it would be nice if the > other 2 could join later after the other operation finishes and frees > some up. That seems like a worthwhile high-level goal. I remember looking into Intel Threading Building Blocks many years ago, and seeing some interesting ideas there. According to Wikipedia, "TBB implements work stealing to balance a parallel workload across available processing cores in order to increase core utilization and therefore scaling". The programmer does not operate in terms of an explicit number of threads, and there are probably certain types of problems that this has an advantage with. That model also has its costs, though, and I don't think it's every going to supplant a lower level approach. In an ideal world, you have both things, because TBB's approach apparently has high coordination overhead on many core systems. > That, of course, won't work very well if parallel operations > are coded in such a way that the number of workers must be nailed down > at the very beginning. But my whole approach to sorting is based on the idea that each worker produces a roughly even amount of output to merge. I don't see any scope to do better for parallel CREATE INDEX. (Other uses for parallel sort are another matter, though.) > Now maybe all that seems like pie in the sky, and perhaps it is, but I > hold out hope. For queries, there is another consideration, which is > that some queries may run with parallelism but actually finish quite > quickly - it's not desirable to make the leader wait for workers to > start when it could be busy computing. That's a lesser consideration > for bulk operations like parallel CREATE INDEX, but even there I don't > think it's totally negligible. Since I don't have to start this until the leader stops participating as a worker, there is no wait in the leader. In the vast majority of cases, a call to something like WaitForParallelWorkersToAttach() ends up looking at state in shared memory, immediately determining that every launched process initialized successfully. The overhead should be negligible in the real world. > For both reasons, it's much better, or so it seems to me, if parallel > operations are coded to work with the number of workers that show up, > rather than being inflexibly tied to a particular worker count. I've been clear from day one that my approach to parallel tuplesort isn't going to be that useful to parallel query in its first version. You need some kind of partitioning (a distribution sort of some kind) for that, and probably plenty of cooperation from within the executor. I've also said that I don't think we can do much better for parallel CREATE INDEX even *with* support for partitioning, which is something borne out by comparisons with other systems. My patch was always presented as an 80/20 solution. I have given you specific technical reasons why I think that using a barrier is at least a bad idea for nbtsort.c, and probably for nodeGather.c, too. Those problems will need to be worked through if you're not going to concede the point on using a barrier. Your aspirations around not assuming that workers cannot join later seem like good ones, broadly speaking, but they are not particularly applicable to how *anything* happens to work now. Besides all this, I'm not even suggesting that I need to know the number of workers up front for parallel CREATE INDEX. Perhaps nworkers_launched can be incremented after the fact following some later enhancement to the parallel infrastructure, in which case parallel CREATE INDEX will theoretically be prepared to take advantage right away (though other parallel sort operations seem more likely to *actually* benefit). That will be a job for the parallel infrastructure, though, not for each and every parallel operation -- how else could we possibly hope to add more workers that become available half way through, as part of a future enhancement to the parallel infrastructure? Surely every caller to CreateParallelContext() should not need to invent their own way of doing this. All I want is to be able to rely on nworkers_launched. That's not in tension with this other goal/aspiration, and actually seems to complement it. -- Peter Geoghegan