Thread: [HACKERS] parallel.c oblivion of worker-startup failures

[HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

18 September 2017, 07:26:49

Sometime back Tom Lane has reported [1] about $Subject.  I have looked
into the issue and found that the problem is not only with parallel
workers but with general background worker machinery as well in
situations where fork or some such failure occurs. The first problem
is that after we register the dynamic worker, the way to know whether
the worker has started
(WaitForBackgroundWorkerStartup/GetBackgroundWorkerPid) won't give the
right answer if the fork failure happens.  Also, in cases where the
worker is marked not to start after the crash, postmaster doesn't
notify the backend if it is not able to start the worker which can
make the backend wait forever as it is oblivion of the fact that the
worker is not started.  Now, apart from these general problems of
background worker machinery, parallel.c assumes that after it has
registered the dynamic workers, they will start and perform their
work.  We need to ensure that in case, postmaster is not able to start
parallel workers due to fork failure or any similar situations,
backend doesn't keep on waiting forever.  To fix it, before waiting
for workers to finish, we can check whether the worker exists at all.
Attached patch fixes these problems.

Another way to fix the parallel query related problem is that after
registering the workers, the master backend should wait for workers to
start before setting up different queues for them to communicate.  I
think that can be quite expensive.

Thoughts?

[1] - https://www.postgresql.org/message-id/4905.1492813727@sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

fix_worker_startup_failures_v1.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Tom Lane

Date:

18 September 2017, 19:30:44

Amit Kapila <amit.kapila16@gmail.com> writes:
> Attached patch fixes these problems.

Hmm, this patch adds a kill(notify_pid) after one call to
ForgetBackgroundWorker, but the postmaster has several more such calls.
Shouldn't they all notify the notify_pid?  Should we move that
functionality into ForgetBackgroundWorker itself, so we can't forget
it again?
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

19 September 2017, 06:17:06

On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> Attached patch fixes these problems.
>
> Hmm, this patch adds a kill(notify_pid) after one call to
> ForgetBackgroundWorker, but the postmaster has several more such calls.
> Shouldn't they all notify the notify_pid?  Should we move that
> functionality into ForgetBackgroundWorker itself, so we can't forget
> it again?
>

Among other places, we already notify during
ReportBackgroundWorkerExit().  However, it seems to me that all other
places except where this patch has added a call to notify doesn't need
such a call.  The other places like in DetermineSleepTime and
ResetBackgroundWorkerCrashTimes is called for a crashed worker which I
don't think requires notification to the backend as that backend
itself would have restarted.  The other place where we call
ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate
is set to true which again seems to be either the case of worker crash
or when someone has explicitly asked to terminate the worker in which
case we already send a notification.

I think we need to notify the backend on start, stop and failure to
start a worker.  OTOH, if it is harmless to send a notification even
after the worker is crashed, then we can just move that functionality
into ForgetBackgroundWorker itself as that will simplify the code and
infact that is the first thing that occurred to me as well, but I
haven't done that way as I was not sure if we want to send
notification in all kind of cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

27 October 2017, 15:54:26

On Tue, Sep 19, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>> Attached patch fixes these problems.
>>
>> Hmm, this patch adds a kill(notify_pid) after one call to
>> ForgetBackgroundWorker, but the postmaster has several more such calls.
>> Shouldn't they all notify the notify_pid?  Should we move that
>> functionality into ForgetBackgroundWorker itself, so we can't forget
>> it again?
>>
>
> Among other places, we already notify during
> ReportBackgroundWorkerExit().  However, it seems to me that all other
> places except where this patch has added a call to notify doesn't need
> such a call.  The other places like in DetermineSleepTime and
> ResetBackgroundWorkerCrashTimes is called for a crashed worker which I
> don't think requires notification to the backend as that backend
> itself would have restarted.  The other place where we call
> ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate
> is set to true which again seems to be either the case of worker crash
> or when someone has explicitly asked to terminate the worker in which
> case we already send a notification.
>
> I think we need to notify the backend on start, stop and failure to
> start a worker.  OTOH, if it is harmless to send a notification even
> after the worker is crashed, then we can just move that functionality
> into ForgetBackgroundWorker itself as that will simplify the code and
> infact that is the first thing that occurred to me as well, but I
> haven't done that way as I was not sure if we want to send
> notification in all kind of cases.
>

The patch still applies (with some hunks).  I have added it in CF [1]
to avoid losing track.

[1] - https://commitfest.postgresql.org/15/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Michael Paquier

Date:

29 November 2017, 08:11:34

On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The patch still applies (with some hunks).  I have added it in CF [1]
> to avoid losing track.
>
> [1] - https://commitfest.postgresql.org/15/1341/

This did not get reviews and the patch still applies. I am moving it to next CF.
-- 
Michael

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

05 December 2017, 21:04:43

On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The patch still applies (with some hunks).  I have added it in CF [1]
>> to avoid losing track.
>>
>> [1] - https://commitfest.postgresql.org/15/1341/
>
> This did not get reviews and the patch still applies. I am moving it to next CF.

I'd like to break this into two patches, one to fix the general
background worker problems and another to fix the problems specific to
parallel query.

The attached version extracts the two fixes for general background
worker problems from Amit's patch and adjust the comments a bit more
extensively than he did.  Barring objections, I'll commit and
back-patch this; then we can deal with the other part of this
afterward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bgworker-startup.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

06 December 2017, 09:43:07

On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The patch still applies (with some hunks).  I have added it in CF [1]
>>> to avoid losing track.
>>>
>>> [1] - https://commitfest.postgresql.org/15/1341/
>>
>> This did not get reviews and the patch still applies. I am moving it to next CF.
>
> I'd like to break this into two patches, one to fix the general
> background worker problems and another to fix the problems specific to
> parallel query.
>
> The attached version extracts the two fixes for general background
> worker problems from Amit's patch and adjust the comments a bit more
> extensively than he did.
>

Looks good to me.

> Barring objections, I'll commit and
> back-patch this; then we can deal with the other part of this
> afterward.
>

Sure, I will rebase on top of the commit unless you have some comments
on the remaining part.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

06 December 2017, 17:32:58

On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Looks good to me.

Committed and back-patched to 9.4 where dynamic background workers
were introduced.

>> Barring objections, I'll commit and
>> back-patch this; then we can deal with the other part of this
>> afterward.
>
> Sure, I will rebase on top of the commit unless you have some comments
> on the remaining part.

I'm not in love with that part of the fix; the other parts of that if
statement are just testing variables, and a function call that takes
and releases an LWLock is a lot more expensive.  Furthermore, that
test will often be hit in practice, because we'll often arrive at this
function before workers have actually finished.  On top of that, we'll
typically arrive here having already communicated with the worker in
some way, such as by receiving tuples, which means that in most cases
we already know that the worker was alive at least at some point, and
therefore the extra test isn't necessary.  We only need that test, if
I understand correctly, to cover the failure-to-launch case, which is
presumably very rare.

All that having been said, I don't quite know how to do it better.  It
would be easy enough to modify this function so that if it ever
received a result other then BGWH_NOT_YET_STARTED for a worker, it
wouldn't call this function again for that worker.  However, that's
only mitigating the problem, not really fixing it.  We'll still end up
frequently inquiring - admittedly only once - about the status of a
worker with which we've previously communicated via the tuple queue.
Maybe we just have to live with that problem, but do you (or does
anyone else) have an idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Tom Lane

Date:

06 December 2017, 17:54:22

Robert Haas <robertmhaas@gmail.com> writes:
> I'm not in love with that part of the fix; the other parts of that if
> statement are just testing variables, and a function call that takes
> and releases an LWLock is a lot more expensive.  Furthermore, that
> test will often be hit in practice, because we'll often arrive at this
> function before workers have actually finished.  On top of that, we'll
> typically arrive here having already communicated with the worker in
> some way, such as by receiving tuples, which means that in most cases
> we already know that the worker was alive at least at some point, and
> therefore the extra test isn't necessary.  We only need that test, if
> I understand correctly, to cover the failure-to-launch case, which is
> presumably very rare.

Maybe track "worker is known to have launched" in the leader's state
someplace?

            regards, tom lane

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

06 December 2017, 20:06:31

On Wed, Dec 6, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe track "worker is known to have launched" in the leader's state
> someplace?

That's probably one piece of it, yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

08 December 2017, 12:56:07

On Wed, Dec 6, 2017 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Looks good to me.
>
> Committed and back-patched to 9.4 where dynamic background workers
> were introduced.
>

Thanks.

>>> Barring objections, I'll commit and
>>> back-patch this; then we can deal with the other part of this
>>> afterward.
>>
>> Sure, I will rebase on top of the commit unless you have some comments
>> on the remaining part.
>
> I'm not in love with that part of the fix; the other parts of that if
> statement are just testing variables, and a function call that takes
> and releases an LWLock is a lot more expensive.  Furthermore, that
> test will often be hit in practice, because we'll often arrive at this
> function before workers have actually finished.  On top of that, we'll
> typically arrive here having already communicated with the worker in
> some way, such as by receiving tuples, which means that in most cases
> we already know that the worker was alive at least at some point, and
> therefore the extra test isn't necessary.  We only need that test, if
> I understand correctly, to cover the failure-to-launch case, which is
> presumably very rare.
>
> All that having been said, I don't quite know how to do it better.  It
> would be easy enough to modify this function so that if it ever
> received a result other then BGWH_NOT_YET_STARTED for a worker, it
> wouldn't call this function again for that worker.  However, that's
> only mitigating the problem, not really fixing it.  We'll still end up
> frequently inquiring - admittedly only once - about the status of a
> worker with which we've previously communicated via the tuple queue.
> Maybe we just have to live with that problem, but do you (or does
> anyone else) have an idea?
>

I think the optimization you are suggesting has a merit over what I
have done in the patch.  However, one point to note is that we are
anyway going to call that function down in the path(
WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
so we might want to take the advantage of calling it first time as
well.  We can actually cache the status of workers that have returned
BGWH_STOPPED and use it later so that we don't need to make this
function call again for workers which are already stopped.  We can
even do what Tom is suggesting to avoid calling it for workers which
are known to be launched, but I am really not sure if that function
call is costly enough that we need to maintain one extra state to
avoid that.

While looking into this code path, I wonder how much we are gaining by
having two separate calls (WaitForParallelWorkersToFinish and
WaitForParallelWorkersToExit) to check the status of workers after
finishing the parallel execution?  They are used together in rescan
code path, so apparently there is no benefit calling them separately
there.  OTOH, during shutdown of nodes, it will often be the case that
they will be called in a short duration after each other except for
the case where we need to shut down the node for the early exit like
when there is a limit clause or such.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

08 December 2017, 19:15:25

On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the optimization you are suggesting has a merit over what I
> have done in the patch.  However, one point to note is that we are
> anyway going to call that function down in the path(
> WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
> so we might want to take the advantage of calling it first time as
> well.  We can actually cache the status of workers that have returned
> BGWH_STOPPED and use it later so that we don't need to make this
> function call again for workers which are already stopped.  We can
> even do what Tom is suggesting to avoid calling it for workers which
> are known to be launched, but I am really not sure if that function
> call is costly enough that we need to maintain one extra state to
> avoid that.

Well, let's do what optimization we can without making it too complicated.

> While looking into this code path, I wonder how much we are gaining by
> having two separate calls (WaitForParallelWorkersToFinish and
> WaitForParallelWorkersToExit) to check the status of workers after
> finishing the parallel execution?  They are used together in rescan
> code path, so apparently there is no benefit calling them separately
> there.  OTOH, during shutdown of nodes, it will often be the case that
> they will be called in a short duration after each other except for
> the case where we need to shut down the node for the early exit like
> when there is a limit clause or such.

I'm not 100% sure either, but if we're going to do anything about
that, it seems like a topic for another patch.  I don't think it's
completely without value because there is some time after we
WaitForParallelWorkersToFinish and before we
WaitForParallelWorkersToExit during which we can do things like
retrieve instrumentation data and shut down other nodes, but whether
it pulls it weight in code I don't know.  This kind of grew up
gradually: originally I/we didn't think of all of the cases where we
needed the workers to actually exit rather than just indicating that
they were done generating tuples, and the current situation is the
result of a series of bug-fixes related to that oversight, so it's
quite possible that a redesign would make sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

11 December 2017, 07:07:39

On Fri, Dec 8, 2017 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think the optimization you are suggesting has a merit over what I
>> have done in the patch.  However, one point to note is that we are
>> anyway going to call that function down in the path(
>> WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
>> so we might want to take the advantage of calling it first time as
>> well.  We can actually cache the status of workers that have returned
>> BGWH_STOPPED and use it later so that we don't need to make this
>> function call again for workers which are already stopped.  We can
>> even do what Tom is suggesting to avoid calling it for workers which
>> are known to be launched, but I am really not sure if that function
>> call is costly enough that we need to maintain one extra state to
>> avoid that.
>
> Well, let's do what optimization we can without making it too complicated.
>

Okay, see the attached and let me know if that suffices the need?

>> While looking into this code path, I wonder how much we are gaining by
>> having two separate calls (WaitForParallelWorkersToFinish and
>> WaitForParallelWorkersToExit) to check the status of workers after
>> finishing the parallel execution?  They are used together in rescan
>> code path, so apparently there is no benefit calling them separately
>> there.  OTOH, during shutdown of nodes, it will often be the case that
>> they will be called in a short duration after each other except for
>> the case where we need to shut down the node for the early exit like
>> when there is a limit clause or such.
>
> I'm not 100% sure either, but if we're going to do anything about
> that, it seems like a topic for another patch.
>

makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

fix_parallel_worker_startup_failures_v1.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

11 December 2017, 20:57:43

On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, see the attached and let me know if that suffices the need?

+             * Check for unexpected worker death.  This will ensure that if
+             * the postmaster failed to start the worker, then we don't wait
+             * for it indefinitely.  For workers that are known to be
+             * launched, we can rely on their error queue being freed once
+             * they exit.

Hmm.  Is this really true?  What if the worker starts up but then
crashes before attaching to the error queue?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

12 December 2017, 06:30:08

On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Okay, see the attached and let me know if that suffices the need?
>
> +             * Check for unexpected worker death.  This will ensure that if
> +             * the postmaster failed to start the worker, then we don't wait
> +             * for it indefinitely.  For workers that are known to be
> +             * launched, we can rely on their error queue being freed once
> +             * they exit.
>
> Hmm.  Is this really true?  What if the worker starts up but then
> crashes before attaching to the error queue?
>

If the worker errored out before attaching to the error queue, then we
can't rely on error queue being freed.  However, in that case, the
worker status will be BGWH_STOPPED.  I have adjusted the comment
accordingly.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

fix_parallel_worker_startup_failures_v2.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

12 December 2017, 06:41:25

On Tue, Dec 12, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Okay, see the attached and let me know if that suffices the need?
>>
>> +             * Check for unexpected worker death.  This will ensure that if
>> +             * the postmaster failed to start the worker, then we don't wait
>> +             * for it indefinitely.  For workers that are known to be
>> +             * launched, we can rely on their error queue being freed once
>> +             * they exit.
>>
>> Hmm.  Is this really true?  What if the worker starts up but then
>> crashes before attaching to the error queue?
>>
>
> If the worker errored out before attaching to the error queue, then we
> can't rely on error queue being freed.  However, in that case, the
> worker status will be BGWH_STOPPED.  I have adjusted the comment
> accordingly.
>

In particular, if the worker crashed (say somebody forcefully killed
the worker), then I think it will lead to a restart of the server.  So
not sure, adding anything about that in the comment will make much
sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

12 December 2017, 22:01:48

On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In particular, if the worker crashed (say somebody forcefully killed
> the worker), then I think it will lead to a restart of the server.  So
> not sure, adding anything about that in the comment will make much
> sense.

I think it's dangerous to rely on that assumption.  If somebody stuck
proc_exit(1) into the very beginning of the worker startup sequence,
the worker would exit but the server would not restart.

As I started experimenting with this, I discovered that your patch is
actually unsafe.   Here's a test case that shows the breakage:

set force_parallel_mode = on;
select 1;

The expected outcome is:

 ?column?
----------
        1
(1 row)

If I hack do_start_bgworker() to make worker startup fail, like this:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5701,7 +5701,7 @@ do_start_bgworker(RegisteredBgWorker *rw)
 #ifdef EXEC_BACKEND
     switch ((worker_pid = bgworker_forkexec(rw->rw_shmem_slot)))
 #else
-    switch ((worker_pid = fork_process()))
+    switch ((worker_pid = -1))
 #endif
     {
         case -1:

...then it hangs forever.  With your patch it no longer hangs forever,
which is good, but it also returns the wrong answer, which is bad:

 ?column?
----------
(0 rows)

The problem is that Gather sees nworkers_launched > 0 and assumes that
it is therefore entitled not to perform a local scan.  I'm pretty sure
that there could also be cases where this patch makes us miss an error
reported by a worker; suppose the worker error is reported and the
worker exits after WaitForParallelWorkersToFinish does
CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid().
The code will simply conclude that the worker is dead and stop
waiting, never mind the fact that there's a pending error in the
queue.  This is exactly why I made parallel workers send a Terminate
message to the leader instead of just disappearing -- the leader needs
to see that message to know that it's reached the end of what the
worker is going to send.

So, I think we need to regard fork failure and related failure modes
as ERROR conditions.  It's not equivalent to failing to register the
background worker.  When we fail to register the background worker,
the parallel.c machinery knows at LaunchParallelWorkers() time that
the worker will never show up and lets the calling code by setting
nworkers_launched, and the calling code can then choose to proceed
with that number of workers or ERROR out as it chooses.  But once we
decide on the value of nworkers_launched to report to callers, it's
really not OK for workers to just silently not ever start, as the test
case above shows.  We could make it the job of code that uses the
ParallelContext machinery to cope with that situation, but I think
that's a bad plan, because every user would have to worry about this
obscure case.  Right now, the only in-core use of the ParallelContext
machinery is Gather/Gather Merge, but there are pending patches for
parallel index scan and parallel VACUUM, so we'll just end up adding
this handling to more and more places.  And for what?  If you've got
max_connections and/or max_parallel_workers set so high that your
system can't fork(), you have a big problem, and forcing things that
would have run using parallel query to run serially instead of
erroring out is not going to fix it.  I think that deciding we're
going to handle fork() failure by reporting an ERROR is just fine.

So, here's a patch.  This patch keeps track of whether we've ever
received a message from each worker via the error queue.  If there are
no remaining workers which have neither exited cleanly nor sent us at
least one message, then it calls GetBackgroundWorkerPid() for each
one.  If any worker reports that it is BGWH_STOPPED, then it
double-checks whether the worker has attached to the error queue.  If
it did, then it assumes that we'll detect clean shutdown or an error
on the next iteration and ignores the problem.  If not, then the
worker died without ever attaching the error queue, so it throws an
ERROR.  I tested that this causes the test case above to throw a
proper ERROR when fork_process() returns -1.  Please review...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

parallel-worker-fork-failure.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

13 December 2017, 09:41:25

On Wed, Dec 13, 2017 at 12:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> In particular, if the worker crashed (say somebody forcefully killed
>> the worker), then I think it will lead to a restart of the server.  So
>> not sure, adding anything about that in the comment will make much
>> sense.
>
>
> The problem is that Gather sees nworkers_launched > 0 and assumes that
> it is therefore entitled not to perform a local scan.
>

Yeah, I think that assumption is not right.  I think ideally before
assuming that it should wait for workers to startup.

>  I'm pretty sure
> that there could also be cases where this patch makes us miss an error
> reported by a worker; suppose the worker error is reported and the
> worker exits after WaitForParallelWorkersToFinish does
> CHECK_FOR_INTERRUPTS() but before we reach GetBackgroundWorkerPid().
> The code will simply conclude that the worker is dead and stop
> waiting, never mind the fact that there's a pending error in the
> queue.

I have tried to reproduce this situation by adding error case in
worker (just before worker returns success ('X' message)) and by
having breakpoint in WaitForParallelWorkersToFinish.  What, I noticed
is that worker is not allowed to exit till it receives some ack from
master (basically worker waits at SendProcSignal after sending
message) and error is reported properly.

>  This is exactly why I made parallel workers send a Terminate
> message to the leader instead of just disappearing -- the leader needs
> to see that message to know that it's reached the end of what the
> worker is going to send.
>
> So, I think we need to regard fork failure and related failure modes
> as ERROR conditions.  It's not equivalent to failing to register the
> background worker.  When we fail to register the background worker,
> the parallel.c machinery knows at LaunchParallelWorkers() time that
> the worker will never show up and lets the calling code by setting
> nworkers_launched, and the calling code can then choose to proceed
> with that number of workers or ERROR out as it chooses.  But once we
> decide on the value of nworkers_launched to report to callers, it's
> really not OK for workers to just silently not ever start, as the test
> case above shows.  We could make it the job of code that uses the
> ParallelContext machinery to cope with that situation, but I think
> that's a bad plan, because every user would have to worry about this
> obscure case.  Right now, the only in-core use of the ParallelContext
> machinery is Gather/Gather Merge, but there are pending patches for
> parallel index scan and parallel VACUUM, so we'll just end up adding
> this handling to more and more places.  And for what?  If you've got
> max_connections and/or max_parallel_workers set so high that your
> system can't fork(), you have a big problem, and forcing things that
> would have run using parallel query to run serially instead of
> erroring out is not going to fix it.  I think that deciding we're
> going to handle fork() failure by reporting an ERROR is just fine.
>

So, if I understand correctly, this means that it will return an error
even if there is a problem in one worker (exited or not started) even
though other workers would have easily finished the work. Won't such
an error can lead to situations where after running the query for one
hour the user might see some failure just because one of the many
workers is not started (or some similar failure) even though the query
would have been completed without that?  If so, that doesn't sound
like a sensible behavior.

> So, here's a patch.  This patch keeps track of whether we've ever
> received a message from each worker via the error queue.  If there are
> no remaining workers which have neither exited cleanly nor sent us at
> least one message, then it calls GetBackgroundWorkerPid() for each
> one.  If any worker reports that it is BGWH_STOPPED, then it
> double-checks whether the worker has attached to the error queue.  If
> it did, then it assumes that we'll detect clean shutdown or an error
> on the next iteration and ignores the problem.  If not, then the
> worker died without ever attaching the error queue, so it throws an
> ERROR.  I tested that this causes the test case above to throw a
> proper ERROR when fork_process() returns -1.  Please review...
>

This also doesn't appear to be completely safe.  If we add
proc_exit(1) after attaching to error queue (say after
pq_set_parallel_master) in the worker, then it will lead to *hang* as
anyone_alive will still be false and as it will find that the sender
is set for the error queue, it won't return any failure.  Now, I think
even if we check worker status (which will be stopped) and break after
the new error condition, it won't work as it will still return zero
rows in the case reported by you above.

I think if before making an assumption not to scan locally if we have
a check to see whether workers are started, then my patch should work
fine and we don't need to add any additional checks.  Also, it won't
create any performance issue as we will perform such a check only when
force_parallel_mode is not off or parallel leader participation is off
which I think are not common cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

14 December 2017, 00:35:04

On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have tried to reproduce this situation by adding error case in
> worker (just before worker returns success ('X' message)) and by
> having breakpoint in WaitForParallelWorkersToFinish.  What, I noticed
> is that worker is not allowed to exit till it receives some ack from
> master (basically worker waits at SendProcSignal after sending
> message) and error is reported properly.

SendProcSignal doesn't look to me like it can do anything that can
block, so I don't understand this.

> So, if I understand correctly, this means that it will return an error
> even if there is a problem in one worker (exited or not started) even
> though other workers would have easily finished the work. Won't such
> an error can lead to situations where after running the query for one
> hour the user might see some failure just because one of the many
> workers is not started (or some similar failure) even though the query
> would have been completed without that?  If so, that doesn't sound
> like a sensible behavior.

I think it's the *only* sensible behavior.  parallel.c does not know
that the query would have completed successfully without that worker,
or at least it doesn't know that it would have returned the right
answer.  It *might* be the case that the caller wasn't depending on
that worker to do anything, but parallel.c doesn't know that.

> This also doesn't appear to be completely safe.  If we add
> proc_exit(1) after attaching to error queue (say after
> pq_set_parallel_master) in the worker, then it will lead to *hang* as
> anyone_alive will still be false and as it will find that the sender
> is set for the error queue, it won't return any failure.  Now, I think
> even if we check worker status (which will be stopped) and break after
> the new error condition, it won't work as it will still return zero
> rows in the case reported by you above.

Hmm, there might still be a problem there.  I was thinking that once
the leader attaches to the queue, we can rely on the leader reaching
"ERROR: lost connection to parallel worker" in HandleParallelMessages.
However, that might not work because nothing sets
ParallelMessagePending in that case.  The worker will have detached
the queue via shm_mq_detach_callback, but the leader will only
discover the detach if it actually tries to read from that queue.

> I think if before making an assumption not to scan locally if we have
> a check to see whether workers are started, then my patch should work
> fine and we don't need to add any additional checks.  Also, it won't
> create any performance issue as we will perform such a check only when
> force_parallel_mode is not off or parallel leader participation is off
> which I think are not common cases.

My instinct is that we're going to have a lot of subtle bugs if
clients of parallel.c can't count on nworkers_launched to be accurate.
If clients didn't need that value, we could just have nworkers and let
the callers figure it out, but nobody wants that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

19 December 2017, 13:01:05

On Thu, Dec 14, 2017 at 3:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> This also doesn't appear to be completely safe.  If we add
>> proc_exit(1) after attaching to error queue (say after
>> pq_set_parallel_master) in the worker, then it will lead to *hang* as
>> anyone_alive will still be false and as it will find that the sender
>> is set for the error queue, it won't return any failure.  Now, I think
>> even if we check worker status (which will be stopped) and break after
>> the new error condition, it won't work as it will still return zero
>> rows in the case reported by you above.
>
> Hmm, there might still be a problem there.  I was thinking that once
> the leader attaches to the queue, we can rely on the leader reaching
> "ERROR: lost connection to parallel worker" in HandleParallelMessages.
> However, that might not work because nothing sets
> ParallelMessagePending in that case.  The worker will have detached
> the queue via shm_mq_detach_callback, but the leader will only
> discover the detach if it actually tries to read from that queue.
>

I think it would have been much easier to fix this problem if we would
have some way to differentiate whether the worker has stopped
gracefully or not.  Do you think it makes sense to introduce such a
state in the background worker machinery?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

19 December 2017, 18:31:04

On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think it would have been much easier to fix this problem if we would
> have some way to differentiate whether the worker has stopped
> gracefully or not.  Do you think it makes sense to introduce such a
> state in the background worker machinery?

I think it makes a lot more sense to just throw an ERROR when the
worker doesn't shut down cleanly, which is currently what happens in
nearly all cases.  It only fails to happen for fork() failure and
other errors that happen very early in startup.  I don't think there's
any point in trying to make this code more complicated to cater to
such cases.  If fork() is failing, the fact that parallel query is
erroring out rather than running with fewer workers is likely to be a
good thing.  Your principle concern in that situation is probably
whether your attempt to log into the machine and kill some processes
is also going to die with 'fork failure', and having PostgreSQL
consume every available process slot is not going to make that easier.
On the other hand, if workers are failing so early in startup that
they never attach to the error queue, then they're probably all
failing the same way and trying to cope with that problem in any way
other than throwing an error is going to result in parallelism being
silently disabled with no notification to the user, which doesn't seem
good to me either.

So basically I think it's right to treat these as error conditions,
not try to continue the work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

20 December 2017, 07:28:14

On Tue, Dec 19, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think it would have been much easier to fix this problem if we would
>> have some way to differentiate whether the worker has stopped
>> gracefully or not.  Do you think it makes sense to introduce such a
>> state in the background worker machinery?
>
> I think it makes a lot more sense to just throw an ERROR when the
> worker doesn't shut down cleanly, which is currently what happens in
> nearly all cases.  It only fails to happen for fork() failure and
> other errors that happen very early in startup.  I don't think there's
> any point in trying to make this code more complicated to cater to
> such cases.  If fork() is failing, the fact that parallel query is
> erroring out rather than running with fewer workers is likely to be a
> good thing.  Your principle concern in that situation is probably
> whether your attempt to log into the machine and kill some processes
> is also going to die with 'fork failure', and having PostgreSQL
> consume every available process slot is not going to make that easier.
> On the other hand, if workers are failing so early in startup that
> they never attach to the error queue, then they're probably all
> failing the same way and trying to cope with that problem in any way
> other than throwing an error is going to result in parallelism being
> silently disabled with no notification to the user, which doesn't seem
> good to me either.
>

That is not the main point I am bothered about.  I am concerned that
the patch proposed by you can lead to hang if there is a crash (abrupt
failure like proc_exit(1)) after attaching to the error queue. This is
explained in my email up thread [1].  I think we can fix it by trying
to read the queues as mentioned by you, but was not sure if that is
the best way to deal with it, so proposed an alternate solution.

[1] - https://www.postgresql.org/message-id/CAA4eK1LkHBbafduA8em1fb-in4kDBgmQf6qO5TV8CfyuTzEFaQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

21 December 2017, 00:09:45

On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> That is not the main point I am bothered about.  I am concerned that
> the patch proposed by you can lead to hang if there is a crash (abrupt
> failure like proc_exit(1)) after attaching to the error queue. This is
> explained in my email up thread [1].  I think we can fix it by trying
> to read the queues as mentioned by you, but was not sure if that is
> the best way to deal with it, so proposed an alternate solution.

OK.  My core point is that I really, really don't want Gather or
Gather Merge or parallel CREATE INDEX to have to worry about these
cases; I believe strongly that it is best if these cases cause an
error at the ParallelContext layer.   I read your email as suggesting
the opposite, but maybe I misunderstood you.

One thing I thought about is introducing a new state
BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED.  However, it
seems problematic.  Once the worker is unregistered we wouldn't know
which one to return, especially if the slot has been reused.
Furthermore, it doesn't help in the case where the worker starts and
immediately exits without attaching to the DSM.  So it doesn't seem to
me that we can fix the problem this way.

It seems to me that we instead have to solve the problem in the
ParallelContext layer, which can understand the state of the worker by
comparing the state of the background worker to what we can find out
via the DSM.  The background worker layer knows whether or not the
worker is running and can signal us when it changes state, but it
can't know whether the worker exited cleanly or uncleanly.  To know
that, we have to look at things like whether it sent us a Terminate
message before it stopped.

While I think that the clean-vs-unclean distinction has to be made at
the ParallelContext layer, it doesn't necessarily have to happen by
inserting an error queue.  I think that may be a convenient way; for
example, we could plug the hole you mentioned before by attaching an
on_dsm_detach callback that signals the leader with
SendProcSignal(..., PROCSIG_PARALLEL_MESSAGE, ...).  At that point,
all of the information required for the leader to know that the
failure has occurred is present in shared memory; it is only that the
leader may not have anything to trigger it to look at the relevant
bits of information, and this would fix that.  But there are probably
other ways the information could also be communicated as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

21 December 2017, 14:46:11

On Thu, Dec 21, 2017 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> That is not the main point I am bothered about.  I am concerned that
>> the patch proposed by you can lead to hang if there is a crash (abrupt
>> failure like proc_exit(1)) after attaching to the error queue. This is
>> explained in my email up thread [1].  I think we can fix it by trying
>> to read the queues as mentioned by you, but was not sure if that is
>> the best way to deal with it, so proposed an alternate solution.
>
> OK.  My core point is that I really, really don't want Gather or
> Gather Merge or parallel CREATE INDEX to have to worry about these
> cases; I believe strongly that it is best if these cases cause an
> error at the ParallelContext layer.   I read your email as suggesting
> the opposite, but maybe I misunderstood you.
>
> One thing I thought about is introducing a new state
> BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED.  However, it
> seems problematic.  Once the worker is unregistered we wouldn't know
> which one to return, especially if the slot has been reused.
>

What if we don't allow to reuse such slots till the backend/session
that has registered it performs unregister?  Currently, we don't seem
to have an API corresponding to Register*BackgroundWorker() which can
be used to unregister, but maybe we can provide such an API.

> Furthermore, it doesn't help in the case where the worker starts and
> immediately exits without attaching to the DSM.
>

Yeah, but can't we detect that case?  After the worker exits, we can
know its exit status as is passed to CleanupBackgroundWorker, we can
use that to mark the worker state as  BGWH_ERROR_STOPPED (or something
like BGWH_IMMEDIATE_STOPPED).

I think above way sounds invasive, but it seems to me that it can be
used by other users of background workers as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

21 December 2017, 15:56:51

On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> What if we don't allow to reuse such slots till the backend/session
> that has registered it performs unregister?  Currently, we don't seem
> to have an API corresponding to Register*BackgroundWorker() which can
> be used to unregister, but maybe we can provide such an API.

Well, then we could have slots pinned down for a long time, if the
backend never gets around to calling unregister.  Furthermore, that's
absolutely not back-patchable, because we can't put a requirement like
that on code running in the back branches.  Also, what if the code
path that would have done the unregister eventually errors out?  We'd
need TRY/CATCH blocks everywhere that registers the worker.  In short,
this seems terrible for multiple reasons.

>> Furthermore, it doesn't help in the case where the worker starts and
>> immediately exits without attaching to the DSM.
>
> Yeah, but can't we detect that case?  After the worker exits, we can
> know its exit status as is passed to CleanupBackgroundWorker, we can
> use that to mark the worker state as  BGWH_ERROR_STOPPED (or something
> like BGWH_IMMEDIATE_STOPPED).
>
> I think above way sounds invasive, but it seems to me that it can be
> used by other users of background workers as well.

The exit status doesn't tell us whether the worker attached to the DSM.

I'm relatively puzzled as to why you're rejecting a relatively
low-impact way of handling a corner case that was missed in the
original design in favor of major architectural changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

21 December 2017, 17:13:28

On Thu, Dec 21, 2017 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> What if we don't allow to reuse such slots till the backend/session
>> that has registered it performs unregister?  Currently, we don't seem
>> to have an API corresponding to Register*BackgroundWorker() which can
>> be used to unregister, but maybe we can provide such an API.
>
> Well, then we could have slots pinned down for a long time, if the
> backend never gets around to calling unregister.  Furthermore, that's
> absolutely not back-patchable, because we can't put a requirement like
> that on code running in the back branches.  Also, what if the code
> path that would have done the unregister eventually errors out?  We'd
> need TRY/CATCH blocks everywhere that registers the worker.  In short,
> this seems terrible for multiple reasons.
>
>>> Furthermore, it doesn't help in the case where the worker starts and
>>> immediately exits without attaching to the DSM.
>>
>> Yeah, but can't we detect that case?  After the worker exits, we can
>> know its exit status as is passed to CleanupBackgroundWorker, we can
>> use that to mark the worker state as  BGWH_ERROR_STOPPED (or something
>> like BGWH_IMMEDIATE_STOPPED).
>>
>> I think above way sounds invasive, but it seems to me that it can be
>> used by other users of background workers as well.
>
> The exit status doesn't tell us whether the worker attached to the DSM.
>
> I'm relatively puzzled as to why you're rejecting a relatively
> low-impact way of handling a corner case that was missed in the
> original design in favor of major architectural changes.
>

I am not against using the way specific to parallel context layer as
described by you above.   However, I was trying to see if there is
some general purpose solution as the low-impact way is not very
straightforward.  I think you can go ahead with the way you have
described to fix the hole I was pointing to and I can review it or I
can also give it a try if you want to.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

18 January 2018, 18:23:06

On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I am not against using the way specific to parallel context layer as
> described by you above.   However, I was trying to see if there is
> some general purpose solution as the low-impact way is not very
> straightforward.  I think you can go ahead with the way you have
> described to fix the hole I was pointing to and I can review it or I
> can also give it a try if you want to.

See attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

parallel-worker-fork-failure-v2.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

20 January 2018, 06:59:20

On Thu, Jan 18, 2018 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I am not against using the way specific to parallel context layer as
>> described by you above.   However, I was trying to see if there is
>> some general purpose solution as the low-impact way is not very
>> straightforward.  I think you can go ahead with the way you have
>> described to fix the hole I was pointing to and I can review it or I
>> can also give it a try if you want to.
>
> See attached.
>

The patch doesn't apply cleanly on the head, but after rebasing it, I
have reviewed and tested it and it seems to be working fine.  Apart
from this specific issue, I think we should consider making
nworkers_launched reliable (at the very least for cases where it
matters).  You seem to be of opinion that can be a source of subtle
bugs which I don't deny but now I think we are starting to see some
use cases of such a mechanism as indicated by Peter G. in parallel
create index thread.  Even, if we find some way for parallel create
index to not rely on that assumption, I won't be surprised if some
future parallelism work would have such a requirement.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

22 January 2018, 16:46:07

On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The patch doesn't apply cleanly on the head, but after rebasing it, I
> have reviewed and tested it and it seems to be working fine.  Apart
> from this specific issue, I think we should consider making
> nworkers_launched reliable (at the very least for cases where it
> matters).  You seem to be of opinion that can be a source of subtle
> bugs which I don't deny but now I think we are starting to see some
> use cases of such a mechanism as indicated by Peter G. in parallel
> create index thread.  Even, if we find some way for parallel create
> index to not rely on that assumption, I won't be surprised if some
> future parallelism work would have such a requirement.

Isn't making nworkers_launched reliable exactly what this patch does?
It converts the rare cases in which nworkers_launched would have been
unreliable into errors, precisely so that code like parallel CREATE
INDEX doesn't need to worry about the case where it's inaccurate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

23 January 2018, 06:56:57

On Mon, Jan 22, 2018 at 7:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The patch doesn't apply cleanly on the head, but after rebasing it, I
>> have reviewed and tested it and it seems to be working fine.  Apart
>> from this specific issue, I think we should consider making
>> nworkers_launched reliable (at the very least for cases where it
>> matters).  You seem to be of opinion that can be a source of subtle
>> bugs which I don't deny but now I think we are starting to see some
>> use cases of such a mechanism as indicated by Peter G. in parallel
>> create index thread.  Even, if we find some way for parallel create
>> index to not rely on that assumption, I won't be surprised if some
>> future parallelism work would have such a requirement.
>
> Isn't making nworkers_launched reliable exactly what this patch does?
> It converts the rare cases in which nworkers_launched would have been
> unreliable into errors, precisely so that code like parallel CREATE
> INDEX doesn't need to worry about the case where it's inaccurate.
>

It does turn such cases into error but only at the end when someone
calls WaitForParallelWorkersToFinish.  If one tries to rely on it at
any other time, it won't work as I think is the case for Parallel
Create Index where Peter G. is trying to use it differently.  I am not
100% sure whether Parallel Create index has this symptom as I have not
tried it with that patch, but I and Peter are trying to figure that
out.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

23 January 2018, 19:15:29

On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> It does turn such cases into error but only at the end when someone
> calls WaitForParallelWorkersToFinish.  If one tries to rely on it at
> any other time, it won't work as I think is the case for Parallel
> Create Index where Peter G. is trying to use it differently.  I am not
> 100% sure whether Parallel Create index has this symptom as I have not
> tried it with that patch, but I and Peter are trying to figure that
> out.

OK.  I've committed this patch and back-patched it to 9.6.
Back-patching to 9.5 didn't looks simple because nworkers_launched
doesn't exist on that branch, and it doesn't matter very much anyway
since the infrastructure has no in-core users in 9.5.

I'll respond to the other thread about the issues specific to parallel
CREATE INDEX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

24 January 2018, 00:05:21

On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> It does turn such cases into error but only at the end when someone
>> calls WaitForParallelWorkersToFinish.  If one tries to rely on it at
>> any other time, it won't work as I think is the case for Parallel
>> Create Index where Peter G. is trying to use it differently.  I am not
>> 100% sure whether Parallel Create index has this symptom as I have not
>> tried it with that patch, but I and Peter are trying to figure that
>> out.
>
> OK.  I've committed this patch and back-patched it to 9.6.
> Back-patching to 9.5 didn't looks simple because nworkers_launched
> doesn't exist on that branch, and it doesn't matter very much anyway
> since the infrastructure has no in-core users in 9.5.

I was just talking to Thomas, and one or the other of us realized that
this doesn't completely fix the problem.  I think that if a worker
fails, even by some unorthodox mechanism like an abrupt proc_exit(1),
after the point where it attached to the error queue, this patch is
sufficient to make that reliably turn into an error.  But what if it
fails before that - e.g. fork() failure?  In that case, this process
will catch the problem if the calling process calls
WaitForParallelWorkersToFinish, but does that actually happen in any
interesting cases?  Maybe not.

In the case of Gather, for example, we won't
WaitForParallelWorkersToFinish() until we've read all the tuples.  If
there's a tuple queue that does not yet have a sender, then we'll just
wait for it to get one, even if the sender fell victim to a fork
failure and is never going to show up.

We could *almost* fix this by having execParallel.c include a Barrier
in the DSM, similar to what I proposed for parallel CREATE INDEX.
When a worker starts, it attaches to the barrier; when it exits, it
detaches.  Instead of looping until the leader is done and all the
tuple queues are detached, Gather or Gather Merge could loop until the
leader is done and everyone detaches from the barrier.  But that would
require changes to the Barrier API, which isn't prepared to have the
caller alternate waiting with other activity like reading the tuple
queues, which would clearly be necessary in this case.  Moreover, it
doesn't work at all when parallel_leader_participation=off, because in
that case we'll again just wait forever for some worker to show up,
and if none of them do, then we're hosed.

One idea to fix this is to add a new function in parallel.c with a
name like CheckForFailedWorkers() that Gather and Gather Merge call
just before they WaitLatch().  If a worker fails to start, the
postmaster will send SIGUSR1 to the leader, which will set the latch,
so we can count on the function being run after every worker death.
The function would go through and check each worker for which
pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to
see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) ==
BGWH_STOPPED.  If so, ERROR.

This lets us *run* for arbitrary periods of time without detecting a
fork failure, but before we *wait* we will notice.  Getting more
aggressive than that sounds expensive, but I'm open to ideas.

If we adopt this approach, then Peter's patch could probably make use
of it, too.  It's a little tricky because ConditionVariableSleep()
tries not to return spuriously, so we'd either have to change that or
stick CheckForFailedWorkers() into that function's loop.  I suspect
the former is a better approach.  Or maybe that patch should still use
the Barrier approach and avoid all of this annoyance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

24 January 2018, 01:10:51

On Tue, Jan 23, 2018 at 1:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I was just talking to Thomas, and one or the other of us realized that
> this doesn't completely fix the problem.  I think that if a worker
> fails, even by some unorthodox mechanism like an abrupt proc_exit(1),
> after the point where it attached to the error queue, this patch is
> sufficient to make that reliably turn into an error.  But what if it
> fails before that - e.g. fork() failure?  In that case, this process
> will catch the problem if the calling process calls
> WaitForParallelWorkersToFinish, but does that actually happen in any
> interesting cases?  Maybe not.
>
> In the case of Gather, for example, we won't
> WaitForParallelWorkersToFinish() until we've read all the tuples.  If
> there's a tuple queue that does not yet have a sender, then we'll just
> wait for it to get one, even if the sender fell victim to a fork
> failure and is never going to show up.
>
> We could *almost* fix this by having execParallel.c include a Barrier
> in the DSM, similar to what I proposed for parallel CREATE INDEX.
> When a worker starts, it attaches to the barrier; when it exits, it
> detaches.  Instead of looping until the leader is done and all the
> tuple queues are detached, Gather or Gather Merge could loop until the
> leader is done and everyone detaches from the barrier.  But that would
> require changes to the Barrier API, which isn't prepared to have the
> caller alternate waiting with other activity like reading the tuple
> queues, which would clearly be necessary in this case.  Moreover, it
> doesn't work at all when parallel_leader_participation=off, because in
> that case we'll again just wait forever for some worker to show up,
> and if none of them do, then we're hosed.

I'm relieved that we all finally seem to be on the same page on this issue.

> One idea to fix this is to add a new function in parallel.c with a
> name like CheckForFailedWorkers() that Gather and Gather Merge call
> just before they WaitLatch().  If a worker fails to start, the
> postmaster will send SIGUSR1 to the leader, which will set the latch,
> so we can count on the function being run after every worker death.
> The function would go through and check each worker for which
> pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to
> see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) ==
> BGWH_STOPPED.  If so, ERROR.

Sounds workable to me.

> This lets us *run* for arbitrary periods of time without detecting a
> fork failure, but before we *wait* we will notice.  Getting more
> aggressive than that sounds expensive, but I'm open to ideas.

Is that really that different to any other case? I imagine that in
practice most interrupt checks happen within the leader in a
WaitLatch() loop (or similar). Besides, the kinds of failures we're
talking about are incredibly rare in the real world. I think it's fine
if the leader is not very promptly aware of when this happens, or if
execution becomes slow in some other way.

> If we adopt this approach, then Peter's patch could probably make use
> of it, too.  It's a little tricky because ConditionVariableSleep()
> tries not to return spuriously, so we'd either have to change that or
> stick CheckForFailedWorkers() into that function's loop.  I suspect
> the former is a better approach.  Or maybe that patch should still use
> the Barrier approach and avoid all of this annoyance.

I have the same misgivings about using a barrier to solve this problem
for parallel CREATE INDEX as before. Specifically: why should
nbtsort.c solve this problem for itself? I don't think that it's in
any way special. One can imagine exactly the same problem with other
parallel DDL implementations, fixable using exactly the same solution.

I'm not saying that nbtsort.c shouldn't have to do anything
differently; just that it should buy into some general solution. That
said, it would be ideal if the current approach I take in nbtsort.c
didn't have to be changed *at all*, because if it was 100% correct
already then I think we'd have a more robust parallel.h interface (it
is certainly a "nice to have" for me). The fact that multiple hackers
at least tacitly assumed that nworkers_launched is a reliable
indicator of how many workers will show up tells us something.
Moreover, parallel_leader_participation=off is always going to be
wonky under a regime based on "see who shows up", because failure
becomes indistinguishable from high latency -- I'd rather avoid having
to think about nbtsort.c in distributed systems terms.

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

24 January 2018, 05:45:21

On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> OK.  I've committed this patch and back-patched it to 9.6.
>> Back-patching to 9.5 didn't looks simple because nworkers_launched
>> doesn't exist on that branch, and it doesn't matter very much anyway
>> since the infrastructure has no in-core users in 9.5.
>
> I was just talking to Thomas, and one or the other of us realized that
> this doesn't completely fix the problem.  I think that if a worker
> fails, even by some unorthodox mechanism like an abrupt proc_exit(1),
> after the point where it attached to the error queue, this patch is
> sufficient to make that reliably turn into an error.  But what if it
> fails before that - e.g. fork() failure?  In that case, this process
> will catch the problem if the calling process calls
> WaitForParallelWorkersToFinish, but does that actually happen in any
> interesting cases?  Maybe not.
>
> In the case of Gather, for example, we won't
> WaitForParallelWorkersToFinish() until we've read all the tuples.  If
> there's a tuple queue that does not yet have a sender, then we'll just
> wait for it to get one, even if the sender fell victim to a fork
> failure and is never going to show up.
>

Hmm, I think that case will be addressed because tuple queues can
detect if the leader is not attached.  It does in code path
shm_mq_receive->shm_mq_counterparty_gone.  In
shm_mq_counterparty_gone, it can detect if the worker is gone by using
GetBackgroundWorkerPid.  Moreover, I have manually tested this
particular case before saying your patch is fine.  Do you have some
other case in mind which I am missing?

> We could *almost* fix this by having execParallel.c include a Barrier
> in the DSM, similar to what I proposed for parallel CREATE INDEX.
> When a worker starts, it attaches to the barrier; when it exits, it
> detaches.  Instead of looping until the leader is done and all the
> tuple queues are detached, Gather or Gather Merge could loop until the
> leader is done and everyone detaches from the barrier.  But that would
> require changes to the Barrier API, which isn't prepared to have the
> caller alternate waiting with other activity like reading the tuple
> queues, which would clearly be necessary in this case.  Moreover, it
> doesn't work at all when parallel_leader_participation=off, because in
> that case we'll again just wait forever for some worker to show up,
> and if none of them do, then we're hosed.
>
> One idea to fix this is to add a new function in parallel.c with a
> name like CheckForFailedWorkers() that Gather and Gather Merge call
> just before they WaitLatch().  If a worker fails to start, the
> postmaster will send SIGUSR1 to the leader, which will set the latch,
> so we can count on the function being run after every worker death.
> The function would go through and check each worker for which
> pcxt->worker[i].error_mqh != NULL && !pcxt->any_message_received[i] to
> see whether GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) ==
> BGWH_STOPPED.  If so, ERROR.
>
> This lets us *run* for arbitrary periods of time without detecting a
> fork failure, but before we *wait* we will notice.  Getting more
> aggressive than that sounds expensive, but I'm open to ideas.
>
> If we adopt this approach, then Peter's patch could probably make use
> of it, too.  It's a little tricky because ConditionVariableSleep()
> tries not to return spuriously, so we'd either have to change that or
> stick CheckForFailedWorkers() into that function's loop.  I suspect
> the former is a better approach.  Or maybe that patch should still use
> the Barrier approach and avoid all of this annoyance.
>

I think we can discuss your proposed solution once the problem is
clear.  As of now, it is not very clear to me whether there is any
pending problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Thomas Munro

Date:

24 January 2018, 07:25:30

On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> In the case of Gather, for example, we won't
>> WaitForParallelWorkersToFinish() until we've read all the tuples.  If
>> there's a tuple queue that does not yet have a sender, then we'll just
>> wait for it to get one, even if the sender fell victim to a fork
>> failure and is never going to show up.
>>
>
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached.  It does in code path
> shm_mq_receive->shm_mq_counterparty_gone.  In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid.  Moreover, I have manually tested this
> particular case before saying your patch is fine.  Do you have some
> other case in mind which I am missing?

Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
confused and managed to confuse Robert too.  If you make
fork_process() fail randomly (see attached), I see that there are a
couple of easily reachable failure modes (example session at bottom of
message):

1.  HandleParallelMessages() is reached and raises a "lost connection
to parallel worker" error because shm_mq_receive() returns
SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
GetBackgroundWorkerPid() just as you said.  I guess that's happening
because some other process is (coincidentally) sending
PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
process is unexpectedly stopped.

2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
worker failed to initialize" error.  TupleQueueReaderNext() set done
to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
again, that is because shm_mq_counterparty_gone() returned true.  This
is the bit Robert and I missed in our off-list discussion.

As long as we always get our latch set by the postmaster after a fork
failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
guaranteed to return BGWH_STOPPED after that, and as long as we only
ever use latch/CFI loops to wait, and as long as we try to read from a
shm_mq, then I don't see a failure mode that hangs.

This doesn't help a parallel.c client that doesn't try to read from a
shm_mq though, like parallel CREATE INDEX (using the proposed design
without dynamic Barrier).  We can't rely on a coincidental
PROCSIG_PARALLEL_MESSAGE like in (1), and it's not going to try to
read from a queue like in (2).  So wouldn't it need a way to perform
its own similar check for unexpectedly BGWH_STOPPED processes whenever
its latch is set?

If there were some way for the postmaster to cause reason
PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
just notification via kill(SIGUSR1) when it fails to fork a parallel
worker, we'd get (1) for free in any latch/CFI loop code.  But I
understand that we can't do that by project edict.

===

postgres=# create table foox as select generate_series(1, 10000) as a;
SELECT 10000
postgres=# alter table foox set (parallel_workers = 4);
ALTER TABLE
postgres=# set max_parallel_workers_per_gather = 4;
SET
postgres=# select count(*) from foox;
ERROR:  lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR:  parallel worker failed to initialize
HINT:  More details may be available in the server log.
postgres=# select count(*) from foox;
 count
-------
 10000
(1 row)

postgres=# select count(*) from foox;
ERROR:  lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR:  lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR:  lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR:  lost connection to parallel worker

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

chaos-monkey-fork-process.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

24 January 2018, 07:33:14

On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> Hmm, I think that case will be addressed because tuple queues can
>> detect if the leader is not attached.  It does in code path
>> shm_mq_receive->shm_mq_counterparty_gone.  In
>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>> GetBackgroundWorkerPid.  Moreover, I have manually tested this
>> particular case before saying your patch is fine.  Do you have some
>> other case in mind which I am missing?
>
> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
> confused and managed to confuse Robert too.  If you make
> fork_process() fail randomly (see attached), I see that there are a
> couple of easily reachable failure modes (example session at bottom of
> message):
>
> 1.  HandleParallelMessages() is reached and raises a "lost connection
> to parallel worker" error because shm_mq_receive() returns
> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
> GetBackgroundWorkerPid() just as you said.  I guess that's happening
> because some other process is (coincidentally) sending
> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
> process is unexpectedly stopped.
>
> 2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
> worker failed to initialize" error.  TupleQueueReaderNext() set done
> to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
> again, that is because shm_mq_counterparty_gone() returned true.  This
> is the bit Robert and I missed in our off-list discussion.
>
> As long as we always get our latch set by the postmaster after a fork
> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
> guaranteed to return BGWH_STOPPED after that, and as long as we only
> ever use latch/CFI loops to wait, and as long as we try to read from a
> shm_mq, then I don't see a failure mode that hangs.

What about the parallel_leader_participation=off case?

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

24 January 2018, 07:36:17

On Wed, Jan 24, 2018 at 10:03 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> Hmm, I think that case will be addressed because tuple queues can
>>> detect if the leader is not attached.  It does in code path
>>> shm_mq_receive->shm_mq_counterparty_gone.  In
>>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>>> GetBackgroundWorkerPid.  Moreover, I have manually tested this
>>> particular case before saying your patch is fine.  Do you have some
>>> other case in mind which I am missing?
>>
>> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
>> confused and managed to confuse Robert too.  If you make
>> fork_process() fail randomly (see attached), I see that there are a
>> couple of easily reachable failure modes (example session at bottom of
>> message):
>>
>> 1.  HandleParallelMessages() is reached and raises a "lost connection
>> to parallel worker" error because shm_mq_receive() returns
>> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
>> GetBackgroundWorkerPid() just as you said.  I guess that's happening
>> because some other process is (coincidentally) sending
>> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
>> process is unexpectedly stopped.
>>
>> 2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
>> worker failed to initialize" error.  TupleQueueReaderNext() set done
>> to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
>> again, that is because shm_mq_counterparty_gone() returned true.  This
>> is the bit Robert and I missed in our off-list discussion.
>>
>> As long as we always get our latch set by the postmaster after a fork
>> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
>> guaranteed to return BGWH_STOPPED after that, and as long as we only
>> ever use latch/CFI loops to wait, and as long as we try to read from a
>> shm_mq, then I don't see a failure mode that hangs.
>
> What about the parallel_leader_participation=off case?
>

There is nothing special about that case, there shouldn't be any
problem till we can detect the worker failures appropriately.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

24 January 2018, 07:43:47

On Wed, Jan 24, 2018 at 9:55 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> In the case of Gather, for example, we won't
>>> WaitForParallelWorkersToFinish() until we've read all the tuples.  If
>>> there's a tuple queue that does not yet have a sender, then we'll just
>>> wait for it to get one, even if the sender fell victim to a fork
>>> failure and is never going to show up.
>>>
>>
>> Hmm, I think that case will be addressed because tuple queues can
>> detect if the leader is not attached.  It does in code path
>> shm_mq_receive->shm_mq_counterparty_gone.  In
>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>> GetBackgroundWorkerPid.  Moreover, I have manually tested this
>> particular case before saying your patch is fine.  Do you have some
>> other case in mind which I am missing?
>
> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
> confused and managed to confuse Robert too.  If you make
> fork_process() fail randomly (see attached), I see that there are a
> couple of easily reachable failure modes (example session at bottom of
> message):
>

In short, we are good with committed code. Right?

> 1.  HandleParallelMessages() is reached and raises a "lost connection
> to parallel worker" error because shm_mq_receive() returns
> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
> GetBackgroundWorkerPid() just as you said.  I guess that's happening
> because some other process is (coincidentally) sending
> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
> process is unexpectedly stopped.
>
> 2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
> worker failed to initialize" error.  TupleQueueReaderNext() set done
> to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
> again, that is because shm_mq_counterparty_gone() returned true.  This
> is the bit Robert and I missed in our off-list discussion.
>
> As long as we always get our latch set by the postmaster after a fork
> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
> guaranteed to return BGWH_STOPPED after that, and as long as we only
> ever use latch/CFI loops to wait, and as long as we try to read from a
> shm_mq, then I don't see a failure mode that hangs.
>
> This doesn't help a parallel.c client that doesn't try to read from a
> shm_mq though, like parallel CREATE INDEX (using the proposed design
> without dynamic Barrier).
>

Yes, this is what I am trying to explain on parallel create index
thread.  I think there we need to either use
WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a
new API as proposed in that thread) if we don't want to use barriers.
I see a minor disadvantage in using WaitForParallelWorkersToFinish
which I will say on that thread.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Thomas Munro

Date:

24 January 2018, 08:02:45

On Wed, Jan 24, 2018 at 5:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
>> confused and managed to confuse Robert too.  If you make
>> fork_process() fail randomly (see attached), I see that there are a
>> couple of easily reachable failure modes (example session at bottom of
>> message):
>>
>
> In short, we are good with committed code. Right?

Yep.  Sorry for the noise.

> Yes, this is what I am trying to explain on parallel create index
> thread.  I think there we need to either use
> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a
> new API as proposed in that thread) if we don't want to use barriers.
> I see a minor disadvantage in using WaitForParallelWorkersToFinish
> which I will say on that thread.

Ah, I see.  So if you wait for them to attach you can detect
unexpected dead workers (via shm_mq_receive), at the cost of having
the leader wasting time waiting around for forked processes to say
hello when it could instead be sorting tuples.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

24 January 2018, 08:08:51

On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> Yes, this is what I am trying to explain on parallel create index
>> thread.  I think there we need to either use
>> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a
>> new API as proposed in that thread) if we don't want to use barriers.
>> I see a minor disadvantage in using WaitForParallelWorkersToFinish
>> which I will say on that thread.
>
> Ah, I see.  So if you wait for them to attach you can detect
> unexpected dead workers (via shm_mq_receive), at the cost of having
> the leader wasting time waiting around for forked processes to say
> hello when it could instead be sorting tuples.

The leader can go ahead and sort before calling something like a new
WaitForParallelWorkersToAttach() function (or even
WaitForParallelWorkersToFinish()). If we did add a
WaitForParallelWorkersToAttach() function, then the performance hit
would probably not be noticeable in practice. The
parallel_leader_participates case would still work without special
care (that's the main hazard that makes using a barrier seem
unappealing).

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Amit Kapila

Date:

24 January 2018, 08:38:54

On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> Yes, this is what I am trying to explain on parallel create index
>>> thread.  I think there we need to either use
>>> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a
>>> new API as proposed in that thread) if we don't want to use barriers.
>>> I see a minor disadvantage in using WaitForParallelWorkersToFinish
>>> which I will say on that thread.
>>
>> Ah, I see.  So if you wait for them to attach you can detect
>> unexpected dead workers (via shm_mq_receive), at the cost of having
>> the leader wasting time waiting around for forked processes to say
>> hello when it could instead be sorting tuples.
>
> The leader can go ahead and sort before calling something like a new
> WaitForParallelWorkersToAttach() function (or even
> WaitForParallelWorkersToFinish()). If we did add a
> WaitForParallelWorkersToAttach() function, then the performance hit
> would probably not be noticeable in practice. The
> parallel_leader_participates case would still work without special
> care (that's the main hazard that makes using a barrier seem
> unappealing).
>

+1. I also think so.  Another advantage is that even if one of the
workers is not able to start, we don't need to return an error at the
end of the query, rather we can detect that situation early and can
execute the query successfully.  OTOH, if we are not convinced about
performance implications, then WaitForParallelWorkersToFinish should
serve the purpose which can be called at later point of time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

24 January 2018, 09:31:35

On Tue, Jan 23, 2018 at 9:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> The leader can go ahead and sort before calling something like a new
>> WaitForParallelWorkersToAttach() function (or even
>> WaitForParallelWorkersToFinish()). If we did add a
>> WaitForParallelWorkersToAttach() function, then the performance hit
>> would probably not be noticeable in practice. The
>> parallel_leader_participates case would still work without special
>> care (that's the main hazard that makes using a barrier seem
>> unappealing).
>>
>
> +1. I also think so.  Another advantage is that even if one of the
> workers is not able to start, we don't need to return an error at the
> end of the query, rather we can detect that situation early and can
> execute the query successfully.  OTOH, if we are not convinced about
> performance implications, then WaitForParallelWorkersToFinish should
> serve the purpose which can be called at later point of time.

There is another disadvantage to just using
WaitForParallelWorkersToFinish() to wait for nbtsort.c workers to
finish (and not inventing a WaitForParallelWorkersToAttach() function,
and calling that instead), which is: there can be no custom wait event
that specifically mentions parallel CREATE INDEX.

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Thomas Munro

Date:

24 January 2018, 12:57:27

On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> If there were some way for the postmaster to cause reason
> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
> just notification via kill(SIGUSR1) when it fails to fork a parallel
> worker, we'd get (1) for free in any latch/CFI loop code.  But I
> understand that we can't do that by project edict.

Based on the above observation, here is a terrible idea you'll all
hate.  It is pessimistic and expensive: it thinks that every latch
wake might be the postmaster telling us it's failed to fork() a
parallel worker, until we've seen a sign of life on every worker's
error queue.  Untested illustration code only.  This is the only way
I've come up with to discover fork failure in any latch/CFI loop (ie
without requiring client code to explicitly try to read either error
or tuple queues).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

fork-failure-detection-idea.patch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

24 January 2018, 23:05:01

On Wed, Jan 24, 2018 at 1:57 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> If there were some way for the postmaster to cause reason
>> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
>> just notification via kill(SIGUSR1) when it fails to fork a parallel
>> worker, we'd get (1) for free in any latch/CFI loop code.  But I
>> understand that we can't do that by project edict.
>
> Based on the above observation, here is a terrible idea you'll all
> hate.  It is pessimistic and expensive: it thinks that every latch
> wake might be the postmaster telling us it's failed to fork() a
> parallel worker, until we've seen a sign of life on every worker's
> error queue.  Untested illustration code only.  This is the only way
> I've come up with to discover fork failure in any latch/CFI loop (ie
> without requiring client code to explicitly try to read either error
> or tuple queues).

The question, I suppose, is how expensive this is in the real world.
If it's actually not a cost that anybody is likely to notice, then I
think we should pursue this approach. I wouldn't put too much weight
on keeping this simple for users of the parallel infrastructure,
though, because something like Amit's WaitForParallelWorkersToAttach()
idea still seems acceptable. "Call this function before trusting the
finality of nworkers_launched" isn't too onerous a rule to have to
follow.

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

24 January 2018, 23:05:54

On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached.  It does in code path
> shm_mq_receive->shm_mq_counterparty_gone.  In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid.  Moreover, I have manually tested this
> particular case before saying your patch is fine.  Do you have some
> other case in mind which I am missing?

Oh, right.  I had forgotten that I taught shm_mq_attach() to take an
optional BackgroundWorkerHandle precisely to solve this kind of
problem.  I can't decide whether to feel smart because I did that or
dumb because I forgot that I did it.

But I think there's still a problem: given recent commits, that will
cause us to ERROR out if we tried to read from a tuple queue for a
worker which failed to start, or that starts and then dies before
attaching to the queue.  But there's no guarantee we'll try to read
from that queue in the first place.  Normally, that gets triggered
when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started
up in the first place, it can't send that, and the postmaster won't.

In Thomas's test case, he's using 4 workers, and if even one of those
manages to start, then it'll probably do so after any fork failures
have already occurred, and the error will be noticed when that process
sends its first message to the leader through the error queue, because
it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
workers fail to start, then that doesn't help.  But it still manages
to detect the failure in that case because it reaches
WaitForParallelWorkersToFinish, which we just patched.

But how does that happen, anyway?  Shouldn't it get stuck waiting for
the tuple queues to which nobody's attached yet?  The reason it
doesn't is because
ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
which causes the tuple queues to be regarded as detached if the worker
fails to start.  A detached tuple queue, in general, is not an error
condition: it just means that worker has no more tuples.  So I think
what happens is as follows.  If parallel_leader_participation = on,
then the leader runs the whole plan, producing however many tuples the
plan should have generated, and then afterwards waits for the workers
to finish, tripping an error.  If parallel_leader_participation = off,
the leader doesn't try to run the plan and is perfectly happy with the
fact that every worker has produced no tuples; since, as far as it
knows, the plan is now fully executed, it waits for them to shut down,
tripping an error.

I guess that works, but it seems more like blind luck than good
design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
because there's nothing in parallel CREATE INDEX that lets it skip
waiting for a worker which doesn't start -- and in general it seems
unacceptable to impose a coding requirement that all future parallel
operations must fail to notice the difference between a worker that
completed successfully and one that never ran at all.

If we made the Gather node wait only for workers that actually reached
the Gather -- either by using a Barrier or by some other technique --
then this would be a lot less fragile.  And the same kind of technique
would work for parallel CREATE INDEX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

25 January 2018, 01:52:33

On Wed, Jan 24, 2018 at 12:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In Thomas's test case, he's using 4 workers, and if even one of those
> manages to start, then it'll probably do so after any fork failures
> have already occurred, and the error will be noticed when that process
> sends its first message to the leader through the error queue, because
> it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
> workers fail to start, then that doesn't help.  But it still manages
> to detect the failure in that case because it reaches
> WaitForParallelWorkersToFinish, which we just patched.
>
> But how does that happen, anyway?  Shouldn't it get stuck waiting for
> the tuple queues to which nobody's attached yet?  The reason it
> doesn't is because
> ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
> which causes the tuple queues to be regarded as detached if the worker
> fails to start.  A detached tuple queue, in general, is not an error
> condition: it just means that worker has no more tuples.

This explains all the confusion. Amit told me that using a tuple queue
made all the difference here. Even till, it seemed surprising that
we'd rely on that from such a long distance from within nodeGather.c.

> I guess that works, but it seems more like blind luck than good
> design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
> because there's nothing in parallel CREATE INDEX that lets it skip
> waiting for a worker which doesn't start -- and in general it seems
> unacceptable to impose a coding requirement that all future parallel
> operations must fail to notice the difference between a worker that
> completed successfully and one that never ran at all.

+1.

> If we made the Gather node wait only for workers that actually reached
> the Gather -- either by using a Barrier or by some other technique --
> then this would be a lot less fragile.  And the same kind of technique
> would work for parallel CREATE INDEX.

The use of a barrier has problems of its own [1], though, of which one
is unique to the parallel_leader_participation=off case. I thought
that you yourself agreed with this [2] -- do you?

Another argument against using a barrier in this way is that it seems
like way too much mechanism to solve a simple problem. Why should a
client of parallel.h not be able to rely on nworkers_launched (perhaps
only after "verifying it can be trusted")? That seem like a pretty
reasonable requirement for clients to have for any framework for
parallel imperative programming.

I think that we should implement "some other technique", instead of
using a barrier. As I've said, Amit's WaitForParallelWorkersToAttach()
idea seems like a promising "other technique".

[1] https://www.postgresql.org/message-id/CAA4eK1+a0OF4M231vBgPr_0Ygg_BNmRGZLiB7WQDE-FYBSyrGg@mail.gmail.com
[2] https://www.postgresql.org/message-id/CA+TgmoaF8UA8v8hP=CcoqUc50pucPC8ABj-_yyC++yGggjWFsw@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Robert Haas

Date:

25 January 2018, 02:37:57

On Wed, Jan 24, 2018 at 5:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> If we made the Gather node wait only for workers that actually reached
>> the Gather -- either by using a Barrier or by some other technique --
>> then this would be a lot less fragile.  And the same kind of technique
>> would work for parallel CREATE INDEX.
>
> The use of a barrier has problems of its own [1], though, of which one
> is unique to the parallel_leader_participation=off case. I thought
> that you yourself agreed with this [2] -- do you?
>
> Another argument against using a barrier in this way is that it seems
> like way too much mechanism to solve a simple problem. Why should a
> client of parallel.h not be able to rely on nworkers_launched (perhaps
> only after "verifying it can be trusted")? That seem like a pretty
> reasonable requirement for clients to have for any framework for
> parallel imperative programming.

Well, I've been resisting that approach from the very beginning of
parallel query.  Eventually, I hope that we're going to go in the
direction of changing our mind about how many workers parallel
operations use "on the fly".  For example, if there are 8 parallel
workers available and 4 of them are in use, and you start a query (or
index build) that wants 6 but only gets 4, it would be nice if the
other 2 could join later after the other operation finishes and frees
some up.  That, of course, won't work very well if parallel operations
are coded in such a way that the number of workers must be nailed down
at the very beginning.

Now maybe all that seems like pie in the sky, and perhaps it is, but I
hold out hope.  For queries, there is another consideration, which is
that some queries may run with parallelism but actually finish quite
quickly - it's not desirable to make the leader wait for workers to
start when it could be busy computing.  That's a lesser consideration
for bulk operations like parallel CREATE INDEX, but even there I don't
think it's totally negligible.

For both reasons, it's much better, or so it seems to me, if parallel
operations are coded to work with the number of workers that show up,
rather than being inflexibly tied to a particular worker count.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

From

Peter Geoghegan

Date:

25 January 2018, 03:29:09

On Wed, Jan 24, 2018 at 3:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, I've been resisting that approach from the very beginning of
> parallel query.  Eventually, I hope that we're going to go in the
> direction of changing our mind about how many workers parallel
> operations use "on the fly".  For example, if there are 8 parallel
> workers available and 4 of them are in use, and you start a query (or
> index build) that wants 6 but only gets 4, it would be nice if the
> other 2 could join later after the other operation finishes and frees
> some up.

That seems like a worthwhile high-level goal.

I remember looking into Intel Threading Building Blocks many years
ago, and seeing some interesting ideas there. According to Wikipedia,
"TBB implements work stealing to balance a parallel workload across
available processing cores in order to increase core utilization and
therefore scaling". The programmer does not operate in terms of an
explicit number of threads, and there are probably certain types of
problems that this has an advantage with.

That model also has its costs, though, and I don't think it's every
going to supplant a lower level approach. In an ideal world, you have
both things, because TBB's approach apparently has high coordination
overhead on many core systems.

> That, of course, won't work very well if parallel operations
> are coded in such a way that the number of workers must be nailed down
> at the very beginning.

But my whole approach to sorting is based on the idea that each worker
produces a roughly even amount of output to merge. I don't see any
scope to do better for parallel CREATE INDEX. (Other uses for parallel
sort are another matter, though.)

> Now maybe all that seems like pie in the sky, and perhaps it is, but I
> hold out hope.  For queries, there is another consideration, which is
> that some queries may run with parallelism but actually finish quite
> quickly - it's not desirable to make the leader wait for workers to
> start when it could be busy computing.  That's a lesser consideration
> for bulk operations like parallel CREATE INDEX, but even there I don't
> think it's totally negligible.

Since I don't have to start this until the leader stops participating
as a worker, there is no wait in the leader. In the vast majority of
cases, a call to something like WaitForParallelWorkersToAttach() ends
up looking at state in shared memory, immediately determining that
every launched process initialized successfully. The overhead should
be negligible in the real world.

> For both reasons, it's much better, or so it seems to me, if parallel
> operations are coded to work with the number of workers that show up,
> rather than being inflexibly tied to a particular worker count.

I've been clear from day one that my approach to parallel tuplesort
isn't going to be that useful to parallel query in its first version.
You need some kind of partitioning (a distribution sort of some kind)
for that, and probably plenty of cooperation from within the executor.
I've also said that I don't think we can do much better for parallel
CREATE INDEX even *with* support for partitioning, which is something
borne out by comparisons with other systems. My patch was always
presented as an 80/20 solution.

I have given you specific technical reasons why I think that using a
barrier is at least a bad idea for nbtsort.c, and probably for
nodeGather.c, too. Those problems will need to be worked through if
you're not going to concede the point on using a barrier. Your
aspirations around not assuming that workers cannot join later seem
like good ones, broadly speaking, but they are not particularly
applicable to how *anything* happens to work now.

Besides all this, I'm not even suggesting that I need to know the
number of workers up front for parallel CREATE INDEX. Perhaps
nworkers_launched can be incremented after the fact following some
later enhancement to the parallel infrastructure, in which case
parallel CREATE INDEX will theoretically be prepared to take advantage
right away (though other parallel sort operations seem more likely to
*actually* benefit). That will be a job for the parallel
infrastructure, though, not for each and every parallel operation --
how else could we possibly hope to add more workers that become
available half way through, as part of a future enhancement to the
parallel infrastructure? Surely every caller to
CreateParallelContext() should not need to invent their own way of
doing this.

All I want is to be able to rely on nworkers_launched. That's not in
tension with this other goal/aspiration, and actually seems to
complement it.

-- 
Peter Geoghegan