Thread: crashes due to setting max_parallel_workers=0

crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

25 March 2017, 05:10:59

Hi,

while working on a patch I ran into some crashes that seem to be caused 
by inconsistent handling of max_parallel_workers - queries still seem to 
be planned with parallel plans enabled, but then crash at execution.

The attached script reproduces the issue on a simple query, causing 
crashed in GatherMerge, but I assume the issue is more general.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

crash-parallel.sql

Re: crashes due to setting max_parallel_workers=0

From

Amit Kapila

Date:

25 March 2017, 07:56:54

On Sat, Mar 25, 2017 at 7:40 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> while working on a patch I ran into some crashes that seem to be caused by
> inconsistent handling of max_parallel_workers - queries still seem to be
> planned with parallel plans enabled, but then crash at execution.
>
> The attached script reproduces the issue on a simple query, causing crashed
> in GatherMerge, but I assume the issue is more general.
>

I could see other parallel plans like Gather also picked at
max_parallel_workers=0.  I suspect multiple issues here and this needs
some investigation.  I have added this to open items list.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: crashes due to setting max_parallel_workers=0

From

David Rowley

Date:

25 March 2017, 11:51:51

On 25 March 2017 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> while working on a patch I ran into some crashes that seem to be caused by
> inconsistent handling of max_parallel_workers - queries still seem to be
> planned with parallel plans enabled, but then crash at execution.
>
> The attached script reproduces the issue on a simple query, causing crashed
> in GatherMerge, but I assume the issue is more general.

I had a look at this and found a bunch of off by 1 errors in the code.

I've attached a couple of patches, one is the minimal fix, and one
more complete one.

In the more complete one I ended up ditching the
GatherMergeState->nreaders altogether. It was always the same as
nworkers_launched anyway, so I saw no point in it.
Here I've attempted to make the code a bit more understandable, to
prevent further confusion about how many elements are in each array.
Probably there's more that can be done here. I see GatherState has
nreaders too, but I've not looked into the detail of if it suffers
from the same weirdness of nreaders always matching nworkers_launched.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

25 March 2017, 15:09:55

Thanks Tomas for reporting issue and Thanks David for working

on this.

I can see the problem in GatherMerge, specially when nworkers_launched

is zero. I will look into this issue and will post a fix for the same.

Also another point which I think we should fix is, when someone set

max_parallel_workers = 0, we should also set the max_parallel_workers_per_gather

to zero. So that way it we can avoid generating the gather path with

max_parallel_worker = 0.

Thanks,

On Sat, Mar 25, 2017 at 2:21 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 25 March 2017 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> while working on a patch I ran into some crashes that seem to be caused by
> inconsistent handling of max_parallel_workers - queries still seem to be
> planned with parallel plans enabled, but then crash at execution.
>
> The attached script reproduces the issue on a simple query, causing crashed
> in GatherMerge, but I assume the issue is more general.

I had a look at this and found a bunch of off by 1 errors in the code.

I've attached a couple of patches, one is the minimal fix, and one
more complete one.

In the more complete one I ended up ditching the
GatherMergeState->nreaders altogether. It was always the same as
nworkers_launched anyway, so I saw no point in it.
Here I've attempted to make the code a bit more understandable, to
prevent further confusion about how many elements are in each array.
Probably there's more that can be done here. I see GatherState has
nreaders too, but I've not looked into the detail of if it suffers
from the same weirdness of nreaders always matching nworkers_launched.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Rushabh Lathia

Re: crashes due to setting max_parallel_workers=0

From

David Rowley

Date:

25 March 2017, 16:01:40

On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Also another point which I think we should fix is, when someone set
> max_parallel_workers = 0, we should also set the
> max_parallel_workers_per_gather
> to zero. So that way it we can avoid generating the gather path with
> max_parallel_worker = 0.

I see that it was actually quite useful that it works the way it does.
If it had worked the same as max_parallel_workers_per_gather, then
likely Tomas would never have found this bug.

I wondered if there's anything we can do here to better test cases
when no workers are able to try to ensure the parallel nodes work
correctly, but the more I think about it, the more I don't see wrong
with just using SET max_parallel_workers = 0;

My vote would be to leave the GUC behaviour as is and add some tests
to select_parallel.sql which exploit setting max_parallel_workers to 0
for running some tests.

If that's not going to fly, then unless we add something else to allow
us to reliably not get any workers, then we're closing to close the
door on being able to write automatic tests to capture this sort of
bug.

... thinks for a bit....

perhaps some magic value like -1 could be used for this... unsure of
how that would be documented though...

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Amit Kapila

Date:

25 March 2017, 16:17:14

On Sat, Mar 25, 2017 at 6:31 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
>
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.
>
> I wondered if there's anything we can do here to better test cases
> when no workers are able to try to ensure the parallel nodes work
> correctly, but the more I think about it, the more I don't see wrong
> with just using SET max_parallel_workers = 0;
>
> My vote would be to leave the GUC behaviour as is and add some tests
> to select_parallel.sql which exploit setting max_parallel_workers to 0
> for running some tests.
>

I think force_parallel_mode=regress should test the same thing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: crashes due to setting max_parallel_workers=0

From

David Rowley

Date:

25 March 2017, 16:18:34

On 26 March 2017 at 00:17, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 6:31 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>>> Also another point which I think we should fix is, when someone set
>>> max_parallel_workers = 0, we should also set the
>>> max_parallel_workers_per_gather
>>> to zero. So that way it we can avoid generating the gather path with
>>> max_parallel_worker = 0.
>>
>> I see that it was actually quite useful that it works the way it does.
>> If it had worked the same as max_parallel_workers_per_gather, then
>> likely Tomas would never have found this bug.
>>
>> I wondered if there's anything we can do here to better test cases
>> when no workers are able to try to ensure the parallel nodes work
>> correctly, but the more I think about it, the more I don't see wrong
>> with just using SET max_parallel_workers = 0;
>>
>> My vote would be to leave the GUC behaviour as is and add some tests
>> to select_parallel.sql which exploit setting max_parallel_workers to 0
>> for running some tests.
>>
>
> I think force_parallel_mode=regress should test the same thing.

Not really. That flag's days are surely numbered. It creates a Gather
node, the problem was with GatherMerge.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Peter Eisentraut

Date:

25 March 2017, 16:31:55

On 3/25/17 09:01, David Rowley wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.

Another problem is that the GUC system doesn't really support cases
where the validity of one setting depends on the current value of
another setting.  So each individual setting needs to be robust against
cases of related settings being nonsensical.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

25 March 2017, 19:18:13

On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 3/25/17 09:01, David Rowley wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.

Another problem is that the GUC system doesn't really support cases
where the validity of one setting depends on the current value of
another setting. So each individual setting needs to be robust against
cases of related settings being nonsensical.

Okay.

About the original issue reported by Tomas, I did more debugging and

found that - problem was gather_merge_clear_slots() was not returning

the clear slot when nreader is zero (means nworkers_launched = 0).

Due to the same scan was continue even all the tuple are exhausted,

and then end up with server crash at gather_merge_getnext(). In the patch

I also added the Assert into gather_merge_getnext(), about the index
should be less then the nreaders + 1 (leader).

PFA simple patch to fix the problem.

Thanks,

Rushabh Lathia

www.Enterprisedb.com

Attachment

gm_nworkers_launched_zero.patch

Re: crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

27 March 2017, 01:13:09

On 03/25/2017 05:18 PM, Rushabh Lathia wrote:
>
>
> On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com
> <mailto:peter.eisentraut@2ndquadrant.com>> wrote:
>
>     On 3/25/17 09:01, David Rowley wrote:
>     > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com <mailto:rushabh.lathia@gmail.com>> wrote:
>     >> Also another point which I think we should fix is, when someone set
>     >> max_parallel_workers = 0, we should also set the
>     >> max_parallel_workers_per_gather
>     >> to zero. So that way it we can avoid generating the gather path with
>     >> max_parallel_worker = 0.
>     > I see that it was actually quite useful that it works the way it does.
>     > If it had worked the same as max_parallel_workers_per_gather, then
>     > likely Tomas would never have found this bug.
>
>     Another problem is that the GUC system doesn't really support cases
>     where the validity of one setting depends on the current value of
>     another setting.  So each individual setting needs to be robust against
>     cases of related settings being nonsensical.
>
>
> Okay.
>
> About the original issue reported by Tomas, I did more debugging and
> found that - problem was gather_merge_clear_slots() was not returning
> the clear slot when nreader is zero (means nworkers_launched = 0).
> Due to the same scan was continue even all the tuple are exhausted,
> and then end up with server crash at gather_merge_getnext(). In the patch
> I also added the Assert into gather_merge_getnext(), about the index
> should be less then the nreaders + 1 (leader).
>
> PFA simple patch to fix the problem.
>

I think there are two issues at play, here - the first one is that we 
still produce parallel plans even with max_parallel_workers=0, and the 
second one is the crash in GatherMerge when nworkers=0.

Your patch fixes the latter (thanks for looking into it), which is 
obviously a good thing - getting 0 workers on a busy system is quite 
possible, because all the parallel workers can be already chewing on 
some other query.

But it seems a bit futile to produce the parallel plan in the first 
place, because with max_parallel_workers=0 we can't possibly get any 
parallel workers ever. I wonder why compute_parallel_worker() only looks 
at max_parallel_workers_per_gather, i.e. why shouldn't it do:
   parallel_workers = Min(parallel_workers, max_parallel_workers);

Perhaps this was discussed and is actually intentional, though.

Regarding handling this at the GUC level - I agree with Peter that 
that's not a good idea. I suppose we could deal with checking the values 
in the GUC check/assign hooks, but what we don't have is a way to undo 
the changes in all the GUCs. That is, if I do
   SET max_parallel_workers = 0;   SET max_parallel_workers = 16;

I expect to end up with just max_parallel_workers GUC changed and 
nothing else.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

27 March 2017, 02:23:12

On 03/25/2017 02:01 PM, David Rowley wrote:>
> I wondered if there's anything we can do here to better test cases
> when no workers are able to try to ensure the parallel nodes work
> correctly, but the more I think about it, the more I don't see wrong
> with just using SET max_parallel_workers = 0;
>

It's demonstrably a valid way to disable parallel queries (i.e. there's 
nothing wrong with it), because the docs say this:
   Setting this value to 0 disables parallel query execution.
>
> My vote would be to leave the GUC behaviour as is and add some tests
> to select_parallel.sql which exploit setting max_parallel_workers to 0
> for running some tests.
>
> If that's not going to fly, then unless we add something else to allow
> us to reliably not get any workers, then we're closing to close the
> door on being able to write automatic tests to capture this sort of
> bug.
>
> ... thinks for a bit....
>
> perhaps some magic value like -1 could be used for this... unsure of
> how that would be documented though...
>

I agree it'd be very useful to have a more where we generate parallel 
plans but then prohibit starting any workers. That would detect this and 
similar issues, I think.

I'm not sure we need to invent a new magic value, though. Can we simply 
look at force_parallel_mode, and if it's 'regress' then tread 0 differently?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

David Rowley

Date:

27 March 2017, 02:43:22

On 27 March 2017 at 10:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> I'm not sure we need to invent a new magic value, though. Can we simply look
> at force_parallel_mode, and if it's 'regress' then tread 0 differently?

see standard_planner()

if (force_parallel_mode != FORCE_PARALLEL_OFF && best_path->parallel_safe)
{
Gather   *gather = makeNode(Gather);


Probably force_parallel_mode is good for testing the tuple queue code,
and some code in Gather, but I'm not quite sure what else its good
for. Certainly not GatherMerge.



-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

27 March 2017, 08:29:21

On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 03/25/2017 05:18 PM, Rushabh Lathia wrote:

On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com
<mailto:peter.eisentraut@2ndquadrant.com>> wrote:

On 3/25/17 09:01, David Rowley wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com <mailto:rushabh.lathia@gmail.com>> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.

Another problem is that the GUC system doesn't really support cases
where the validity of one setting depends on the current value of
another setting. So each individual setting needs to be robust against
cases of related settings being nonsensical.

Okay.

About the original issue reported by Tomas, I did more debugging and
found that - problem was gather_merge_clear_slots() was not returning
the clear slot when nreader is zero (means nworkers_launched = 0).
Due to the same scan was continue even all the tuple are exhausted,
and then end up with server crash at gather_merge_getnext(). In the patch
I also added the Assert into gather_merge_getnext(), about the index
should be less then the nreaders + 1 (leader).

PFA simple patch to fix the problem.

I think there are two issues at play, here - the first one is that we still produce parallel plans even with max_parallel_workers=0, and the second one is the crash in GatherMerge when nworkers=0.

Your patch fixes the latter (thanks for looking into it), which is obviously a good thing - getting 0 workers on a busy system is quite possible, because all the parallel workers can be already chewing on some other query.

Thanks.

But it seems a bit futile to produce the parallel plan in the first place, because with max_parallel_workers=0 we can't possibly get any parallel workers ever. I wonder why compute_parallel_worker() only looks at max_parallel_workers_per_gather, i.e. why shouldn't it do:

parallel_workers = Min(parallel_workers, max_parallel_workers);

I agree with you here. Producing the parallel plan when max_parallel_workers = 0 is wrong. But rather then your suggested fix, I think that we should do something like:

/*
    * In no case use more than max_parallel_workers_per_gather or
    * max_parallel_workers.
    */
    parallel_workers = Min(parallel_workers, Min(max_parallel_workers, max_parallel_workers_per_gather));

Perhaps this was discussed and is actually intentional, though.

Yes, I am not quite sure about this.

Regarding handling this at the GUC level - I agree with Peter that that's not a good idea. I suppose we could deal with checking the values in the GUC check/assign hooks, but what we don't have is a way to undo the changes in all the GUCs. That is, if I do

SET max_parallel_workers = 0;
SET max_parallel_workers = 16;

I expect to end up with just max_parallel_workers GUC changed and nothing else.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Rushabh Lathia

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

27 March 2017, 14:40:57

On Mon, Mar 27, 2017 at 10:59 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/25/2017 05:18 PM, Rushabh Lathia wrote:

On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com
<mailto:peter.eisentraut@2ndquadrant.com>> wrote:

On 3/25/17 09:01, David Rowley wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com <mailto:rushabh.lathia@gmail.com>> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.

Another problem is that the GUC system doesn't really support cases
where the validity of one setting depends on the current value of
another setting. So each individual setting needs to be robust against
cases of related settings being nonsensical.

Okay.

About the original issue reported by Tomas, I did more debugging and
found that - problem was gather_merge_clear_slots() was not returning
the clear slot when nreader is zero (means nworkers_launched = 0).
Due to the same scan was continue even all the tuple are exhausted,
and then end up with server crash at gather_merge_getnext(). In the patch
I also added the Assert into gather_merge_getnext(), about the index
should be less then the nreaders + 1 (leader).

PFA simple patch to fix the problem.

I think there are two issues at play, here - the first one is that we still produce parallel plans even with max_parallel_workers=0, and the second one is the crash in GatherMerge when nworkers=0.

Your patch fixes the latter (thanks for looking into it), which is obviously a good thing - getting 0 workers on a busy system is quite possible, because all the parallel workers can be already chewing on some other query.

Thanks.

I was doing more testing with the patch and I found one more server

crash with the patch around same area, when we forced the gather
merge for the scan having zero rows.

create table dept ( deptno numeric, dname varchar(20);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set force_parallel_mode=regress;
explain analyze select * from dept order by deptno;

This is because for leader we don't initialize the slot into gm_slots. So
in case where launched worker is zero and table having zero rows, we
end up having NULL slot into gm_slots array.

Currently gather_merge_clear_slots() clear out the tuple table slots for each
gather merge input and returns clear slot. In the patch I modified function
gather_merge_clear_slots() to just clear out the tuple table slots and
always return NULL when All the queues and heap us exhausted.

But it seems a bit futile to produce the parallel plan in the first place, because with max_parallel_workers=0 we can't possibly get any parallel workers ever. I wonder why compute_parallel_worker() only looks at max_parallel_workers_per_gather, i.e. why shouldn't it do:

parallel_workers = Min(parallel_workers, max_parallel_workers);

I agree with you here. Producing the parallel plan when max_parallel_workers = 0 is wrong. But rather then your suggested fix, I think that we should do something like:

/*
    * In no case use more than max_parallel_workers_per_gather or
    * max_parallel_workers.
    */
    parallel_workers = Min(parallel_workers, Min(max_parallel_workers, max_parallel_workers_per_gather));

Perhaps this was discussed and is actually intentional, though.

Yes, I am not quite sure about this.

Regarding handling this at the GUC level - I agree with Peter that that's not a good idea. I suppose we could deal with checking the values in the GUC check/assign hooks, but what we don't have is a way to undo the changes in all the GUCs. That is, if I do

SET max_parallel_workers = 0;
SET max_parallel_workers = 16;

I expect to end up with just max_parallel_workers GUC changed and nothing else.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Rushabh Lathia

Regards,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

gm_nworkers_launched_zero_v2.patch

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 14:59:27

On Mon, Mar 27, 2017 at 1:29 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
>> But it seems a bit futile to produce the parallel plan in the first place,
>> because with max_parallel_workers=0 we can't possibly get any parallel
>> workers ever. I wonder why compute_parallel_worker() only looks at
>> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>>
>>    parallel_workers = Min(parallel_workers, max_parallel_workers);
>>
>> Perhaps this was discussed and is actually intentional, though.
>>
>
> Yes, I am not quite sure about this.

It was intentional.  See the last paragraph of

https://www.postgresql.org/message-id/CA%2BTgmoaMSn6a1780VutfsarCu0LCr%3DCO2yi4vLUo-JQbn4YuRA@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crashes due to setting max_parallel_workers=0

From

Tom Lane

Date:

27 March 2017, 16:54:13

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 27, 2017 at 1:29 AM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> But it seems a bit futile to produce the parallel plan in the first place,
>> because with max_parallel_workers=0 we can't possibly get any parallel
>> workers ever. I wonder why compute_parallel_worker() only looks at
>> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>> parallel_workers = Min(parallel_workers, max_parallel_workers);
>> Perhaps this was discussed and is actually intentional, though.

> It was intentional.  See the last paragraph of
> https://www.postgresql.org/message-id/CA%2BTgmoaMSn6a1780VutfsarCu0LCr%3DCO2yi4vLUo-JQbn4YuRA@mail.gmail.com

Since this has now come up twice, I suggest adding a comment there
that explains why we're intentionally ignoring max_parallel_workers.
        regards, tom lane

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 18:51:05

On Mon, Mar 27, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 27, 2017 at 1:29 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>>> But it seems a bit futile to produce the parallel plan in the first place,
>>> because with max_parallel_workers=0 we can't possibly get any parallel
>>> workers ever. I wonder why compute_parallel_worker() only looks at
>>> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>>> parallel_workers = Min(parallel_workers, max_parallel_workers);
>>> Perhaps this was discussed and is actually intentional, though.
>
>> It was intentional.  See the last paragraph of
>> https://www.postgresql.org/message-id/CA%2BTgmoaMSn6a1780VutfsarCu0LCr%3DCO2yi4vLUo-JQbn4YuRA@mail.gmail.com
>
> Since this has now come up twice, I suggest adding a comment there
> that explains why we're intentionally ignoring max_parallel_workers.

Hey, imagine if the comments explained the logic behind the code!

Good idea.  How about the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

add-worker-comment.patch

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 18:57:35

On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> About the original issue reported by Tomas, I did more debugging and
> found that - problem was gather_merge_clear_slots() was not returning
> the clear slot when nreader is zero (means nworkers_launched = 0).
> Due to the same scan was continue even all the tuple are exhausted,
> and then end up with server crash at gather_merge_getnext(). In the patch
> I also added the Assert into gather_merge_getnext(), about the index
> should be less then the nreaders + 1 (leader).

Well, you and David Rowley seem to disagree on what the fix is here.
His patches posted upthread do A, and yours do B, and from a quick
look those things are not just different ways of spelling the same
underlying fix, but actually directly conflicting ideas about what the
fix should be.  Any chance you can review his patches, and maybe he
can review yours, and we could try to agree on a consensus position?
:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crashes due to setting max_parallel_workers=0

From

David Rowley

Date:

27 March 2017, 19:11:14

On 28 March 2017 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> About the original issue reported by Tomas, I did more debugging and
>> found that - problem was gather_merge_clear_slots() was not returning
>> the clear slot when nreader is zero (means nworkers_launched = 0).
>> Due to the same scan was continue even all the tuple are exhausted,
>> and then end up with server crash at gather_merge_getnext(). In the patch
>> I also added the Assert into gather_merge_getnext(), about the index
>> should be less then the nreaders + 1 (leader).
>
> Well, you and David Rowley seem to disagree on what the fix is here.
> His patches posted upthread do A, and yours do B, and from a quick
> look those things are not just different ways of spelling the same
> underlying fix, but actually directly conflicting ideas about what the
> fix should be.  Any chance you can review his patches, and maybe he
> can review yours, and we could try to agree on a consensus position?
> :-)

When comparing Rushabh's to my minimal patch, both change
gather_merge_clear_slots() to clear the leader's slot. My fix
mistakenly changes things so it does ExecInitExtraTupleSlot() on the
leader's slot, but seems that's not required since
gather_merge_readnext() sets the leader's slot to the output of
ExecProcNode(outerPlan). I'd ignore my minimal fix because of that
mistake. Rushabh's patch sidesteps this by adding a conditional
pfree() to not free slot that we didn't allocate in the first place.

I do think the code could be improved a bit. I don't like the way
GatherMergeState's nreaders and nworkers_launched are always the same.
I think this all threw me off a bit and may have been the reason for
the bug in the first place.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 19:22:29

On Mon, Mar 27, 2017 at 12:11 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 28 March 2017 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>>> About the original issue reported by Tomas, I did more debugging and
>>> found that - problem was gather_merge_clear_slots() was not returning
>>> the clear slot when nreader is zero (means nworkers_launched = 0).
>>> Due to the same scan was continue even all the tuple are exhausted,
>>> and then end up with server crash at gather_merge_getnext(). In the patch
>>> I also added the Assert into gather_merge_getnext(), about the index
>>> should be less then the nreaders + 1 (leader).
>>
>> Well, you and David Rowley seem to disagree on what the fix is here.
>> His patches posted upthread do A, and yours do B, and from a quick
>> look those things are not just different ways of spelling the same
>> underlying fix, but actually directly conflicting ideas about what the
>> fix should be.  Any chance you can review his patches, and maybe he
>> can review yours, and we could try to agree on a consensus position?
>> :-)
>
> When comparing Rushabh's to my minimal patch, both change
> gather_merge_clear_slots() to clear the leader's slot. My fix
> mistakenly changes things so it does ExecInitExtraTupleSlot() on the
> leader's slot, but seems that's not required since
> gather_merge_readnext() sets the leader's slot to the output of
> ExecProcNode(outerPlan). I'd ignore my minimal fix because of that
> mistake. Rushabh's patch sidesteps this by adding a conditional
> pfree() to not free slot that we didn't allocate in the first place.
>
> I do think the code could be improved a bit. I don't like the way
> GatherMergeState's nreaders and nworkers_launched are always the same.
> I think this all threw me off a bit and may have been the reason for
> the bug in the first place.

Yeah, if those fields are always the same, then I think that they
should be merged.  That seems undeniably a good idea.  Possibly we
should fix the crash bug first, though, and then do that afterwards.
What bugs me a little about Rushabh's fix is that it looks like magic.
You have to know that we're looping over two things and freeing them
up, but there's one more of one thing than the other thing.  I think
that at least needs some comments or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crashes due to setting max_parallel_workers=0

From

Tom Lane

Date:

27 March 2017, 19:26:17

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 27, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Since this has now come up twice, I suggest adding a comment there
>> that explains why we're intentionally ignoring max_parallel_workers.

> Good idea.  How about the attached?

WFM ... but seems like there should be some flavor of this statement
in the user-facing docs too (ie, "max_parallel_workers_per_gather >
max_parallel_workers is a bad idea unless you're trying to test what
happens when a plan can't get all the workers it planned for").  The
existing text makes some vague allusions suggesting that the two
GUCs might be interrelated, but I think it could be improved.
        regards, tom lane

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 19:35:50

On Mon, Mar 27, 2017 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 27, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Since this has now come up twice, I suggest adding a comment there
>>> that explains why we're intentionally ignoring max_parallel_workers.
>
>> Good idea.  How about the attached?
>
> WFM ... but seems like there should be some flavor of this statement
> in the user-facing docs too (ie, "max_parallel_workers_per_gather >
> max_parallel_workers is a bad idea unless you're trying to test what
> happens when a plan can't get all the workers it planned for").  The
> existing text makes some vague allusions suggesting that the two
> GUCs might be interrelated, but I think it could be improved.

Do you have a more specific idea?  I mean, this seems like a
degenerate case of what the documentation for
max_parallel_workers_per_gather says already. Even if
max_parallel_workers_per_gather <= Min(max_worker_processes,
max_parallel_workers), it's quite possible that you'll regularly be
generating plans that can't obtain the budgeted number of workers.
The only thing that is really special about the case where
max_parallel_workers_per_gather > Min(max_worker_processes,
max_parallel_workers) is that this can happen even on an
otherwise-idle system.  I'm not quite sure how to emphasize that
without seeming to state the obvious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

27 March 2017, 19:36:57

On Mon, Mar 27, 2017 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 27, 2017 at 12:11 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 28 March 2017 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>>> About the original issue reported by Tomas, I did more debugging and
>>> found that - problem was gather_merge_clear_slots() was not returning
>>> the clear slot when nreader is zero (means nworkers_launched = 0).
>>> Due to the same scan was continue even all the tuple are exhausted,
>>> and then end up with server crash at gather_merge_getnext(). In the patch
>>> I also added the Assert into gather_merge_getnext(), about the index
>>> should be less then the nreaders + 1 (leader).
>>
>> Well, you and David Rowley seem to disagree on what the fix is here.
>> His patches posted upthread do A, and yours do B, and from a quick
>> look those things are not just different ways of spelling the same
>> underlying fix, but actually directly conflicting ideas about what the
>> fix should be. Any chance you can review his patches, and maybe he
>> can review yours, and we could try to agree on a consensus position?
>> :-)
>
> When comparing Rushabh's to my minimal patch, both change
> gather_merge_clear_slots() to clear the leader's slot. My fix
> mistakenly changes things so it does ExecInitExtraTupleSlot() on the
> leader's slot, but seems that's not required since
> gather_merge_readnext() sets the leader's slot to the output of
> ExecProcNode(outerPlan). I'd ignore my minimal fix because of that
> mistake. Rushabh's patch sidesteps this by adding a conditional
> pfree() to not free slot that we didn't allocate in the first place.
>
> I do think the code could be improved a bit. I don't like the way
> GatherMergeState's nreaders and nworkers_launched are always the same.
> I think this all threw me off a bit and may have been the reason for
> the bug in the first place.

Yeah, if those fields are always the same, then I think that they
should be merged. That seems undeniably a good idea.

Hmm I agree that it's good idea, and I will work on that as separate patch.

Possibly we
should fix the crash bug first, though, and then do that afterwards.
What bugs me a little about Rushabh's fix is that it looks like magic.
You have to know that we're looping over two things and freeing them
up, but there's one more of one thing than the other thing. I think
that at least needs some comments or something.

So in my second version of patch I change gather_merge_clear_slots() to

just clear the slot for the worker and some other clean up. Also throwing

NULL from gather_merge_getnext() when all the queues and heap are

exhausted - which earlier gather_merge_clear_slots() was returning clear

slot. This way we make sure that we don't run over freeing the slot for

the leader and gather_merge_getnext() don't need to depend on that

clear slot.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

27 March 2017, 19:41:00

On Mon, Mar 27, 2017 at 12:36 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Hmm I agree that it's good idea, and I will work on that as separate patch.

Maybe you want to start with what David already posted?

>> Possibly we
>> should fix the crash bug first, though, and then do that afterwards.
>> What bugs me a little about Rushabh's fix is that it looks like magic.
>> You have to know that we're looping over two things and freeing them
>> up, but there's one more of one thing than the other thing.  I think
>> that at least needs some comments or something.
>>
> So in my second version of patch I change  gather_merge_clear_slots() to
> just clear the slot for the worker and some other clean up. Also throwing
> NULL from gather_merge_getnext() when all the queues and heap are
> exhausted - which earlier gather_merge_clear_slots() was returning clear
> slot. This way we make sure that we don't run over freeing the slot for
> the leader and gather_merge_getnext() don't need to depend on that
> clear slot.

Ah, I missed that.  That does seem cleaner.  Anybody see a problem
with that approach?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

28 March 2017, 07:21:13


On 03/27/2017 05:51 PM, Robert Haas wrote:
> On Mon, Mar 27, 2017 at 9:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Mon, Mar 27, 2017 at 1:29 AM, Rushabh Lathia
>>> <rushabh.lathia@gmail.com> wrote:
>>>> But it seems a bit futile to produce the parallel plan in the first place,
>>>> because with max_parallel_workers=0 we can't possibly get any parallel
>>>> workers ever. I wonder why compute_parallel_worker() only looks at
>>>> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>>>> parallel_workers = Min(parallel_workers, max_parallel_workers);
>>>> Perhaps this was discussed and is actually intentional, though.
>>
>>> It was intentional.  See the last paragraph of
>>> https://www.postgresql.org/message-id/CA%2BTgmoaMSn6a1780VutfsarCu0LCr%3DCO2yi4vLUo-JQbn4YuRA@mail.gmail.com
>>
>> Since this has now come up twice, I suggest adding a comment there
>> that explains why we're intentionally ignoring max_parallel_workers.
> 
> Hey, imagine if the comments explained the logic behind the code!
> 
> Good idea.  How about the attached?
> 

Certainly an improvement. But perhaps we should also mention this at 
compute_parallel_worker, i.e. that not looking at max_parallel_workers 
is intentional.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

28 March 2017, 07:30:43


On 03/27/2017 01:40 PM, Rushabh Lathia wrote:
> 
> ...
> I was doing more testing with the patch and I found one more server
> crash with the patch around same area, when we forced the gather
> merge for the scan having zero rows.
> 
> create table dept ( deptno numeric, dname varchar(20);
> set parallel_tuple_cost =0;
> set parallel_setup_cost =0;
> set min_parallel_table_scan_size =0;
> set min_parallel_index_scan_size =0;
> set force_parallel_mode=regress;
> explain analyze select * from dept order by deptno;
> 
> This is because for leader we don't initialize the slot into gm_slots. So
> in case where launched worker is zero and table having zero rows, we
> end up having NULL slot into gm_slots array.
> 
> Currently gather_merge_clear_slots() clear out the tuple table slots for 
> each
> gather merge input and returns clear slot. In the patch I modified function
> gather_merge_clear_slots() to just clear out the tuple table slots and
> always return NULL when All the queues and heap us exhausted.
> 

Isn't that just another sign the code might be a bit too confusing? I 
see two main issues in the code:

1) allocating 'slots' as 'nreaders+1' elements, which seems like a good 
way to cause off-by-one errors

2) mixing objects with different life spans (slots for readers vs. slot 
for the leader) seems like a bad idea too

I wonder how much we gain by reusing the slot from the leader (I'd be 
surprised if it was at all measurable). David posted a patch reworking 
this, and significantly simplifying the GatherMerge node. Why not to 
accept that?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

28 March 2017, 12:07:39

On Tue, Mar 28, 2017 at 10:00 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 03/27/2017 01:40 PM, Rushabh Lathia wrote:

...
I was doing more testing with the patch and I found one more server
crash with the patch around same area, when we forced the gather
merge for the scan having zero rows.

create table dept ( deptno numeric, dname varchar(20);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set force_parallel_mode=regress;
explain analyze select * from dept order by deptno;

This is because for leader we don't initialize the slot into gm_slots. So
in case where launched worker is zero and table having zero rows, we
end up having NULL slot into gm_slots array.

Currently gather_merge_clear_slots() clear out the tuple table slots for each
gather merge input and returns clear slot. In the patch I modified function
gather_merge_clear_slots() to just clear out the tuple table slots and
always return NULL when All the queues and heap us exhausted.

Isn't that just another sign the code might be a bit too confusing? I see two main issues in the code:

1) allocating 'slots' as 'nreaders+1' elements, which seems like a good way to cause off-by-one errors

2) mixing objects with different life spans (slots for readers vs. slot for the leader) seems like a bad idea too

I wonder how much we gain by reusing the slot from the leader (I'd be surprised if it was at all measurable). David posted a patch reworking this, and significantly simplifying the GatherMerge node. Why not to accept that?

I think we all agree that we should get rid of nreaders from the GatherMergeState

and need to do some code re-factor. But if I understood correctly that Robert's

concern was to do that re-factor as separate commit.

I picked David's patch and started reviewing the changes. I applied that patch

on top of my v2 patch (which does the re-factor of gather_merge_clear_slots).

In David's patch, into gather_merge_init(), a loop where tuple array is getting

allocate, that loop need to only up to nworkers_launched. Because we don't

hold the tuple array for leader. I changed that and did some other simple

changes based on mine v2 patch. I also performed manual testing with
the changes.

Please find attached re-factor patch, which is v2 patch submitted for the

server crash fix. (Attaching both the patch here again, for easy of access).

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

Re: crashes due to setting max_parallel_workers=0

From

Tomas Vondra

Date:

29 March 2017, 06:08:11

Hi,

On 03/28/2017 11:07 AM, Rushabh Lathia wrote:
> ... 
> 
> I think we all agree that we should get rid of nreaders from the 
> GatherMergeState and need to do some code re-factor. But if I
> understood correctly that Robert's concern was to do that re-factor
> as separate commit.
> 

Maybe. It depends on how valuable it's to keep Gather and GatherMerge 
similar - having nreaders in one and not the other seems a bit weird. 
But maybe the refactoring would remove it from both nodes?

Also, it does not really solve the issue that we're using 'nreaders' or 
'nworkers_launched' to access array with one extra element.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crashes due to setting max_parallel_workers=0

From

Rushabh Lathia

Date:

30 March 2017, 12:12:00

On Wed, Mar 29, 2017 at 8:38 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 03/28/2017 11:07 AM, Rushabh Lathia wrote:
...
I think we all agree that we should get rid of nreaders from the GatherMergeState and need to do some code re-factor. But if I
understood correctly that Robert's concern was to do that re-factor
as separate commit.

Maybe. It depends on how valuable it's to keep Gather and GatherMerge similar - having nreaders in one and not the other seems a bit weird. But maybe the refactoring would remove it from both nodes?

Also, it does not really solve the issue that we're using 'nreaders' or 'nworkers_launched' to access array with one extra element.

With the latest patches, into gather_merge_init() there is one loop where we

initialize the tuple array (which is to hold the tuple came from the each worker)

and slot (which is to hold the current tuple slot).

We hold the tuple array for the worker because when we read tuples from workers,
it's a good idea to read several at once for efficiency when possible: this minimizes
context-switching overhead.

For leader we don't initialize the slot because for leader we directly call then ExecProcNode(),
which returns the slot. Also there is no point in holding the tuple array for the leader.

With the latest patch's, gather_merge_clear_slots() will only clean the tuple table

slot and the tuple array buffer for the workers (nworkers_launched).

Only time we loop through the "nworkers_launched + 1" is while reading the tuple

from the each gather merge input. Personally I don't see any problem with that,

but I am very much open for suggestions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Rushabh Lathia

Re: crashes due to setting max_parallel_workers=0

From

Robert Haas

Date:

01 April 2017, 04:22:14

On Tue, Mar 28, 2017 at 11:08 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Maybe. It depends on how valuable it's to keep Gather and GatherMerge
> similar - having nreaders in one and not the other seems a bit weird. But
> maybe the refactoring would remove it from both nodes?

Yeah, it appears to be the case that both Gather and GatherMerge
inevitably end up with nworkers_launched == nreaders, which seems
dumb.  If we're going to clean this up, I think we should  make them
both match.

> Also, it does not really solve the issue that we're using 'nreaders' or
> 'nworkers_launched' to access array with one extra element.

I'm not clear that there's any fundamental solution to that problem
other than trying to clarify it through comments.

Meantime, I think it's not good to leave this crashing, so I pushed
Rushabh's v2 patch for the actual crash for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company