Thread: Disable parallel query by default

Disable parallel query by default

From
"Scott Mead"
Date:
Hello Hackers, 

Over the last 24 months, I've noticed a pattern amongst users with unexpected plan flips landing on parallel plans.

77cd477 (9.6 beta) defaulted parallel query on (max_parallel_degree = 2), it's been nine years and I'd like to open the
discussionto see what our thoughts are.  Especially since it seems that the decision was made for 9.6 beta testing, and
neverreally revisited. 
 

I'll open by proposing that we prevent the planner from automatically selecting parallel plans by default, opting
insteadto allow users to set their max_parallel_workers_per_gather as needed.  IOW: lets make the default
max_parallel_workers_per_gather=0for V18 forward.
 

Just to be clear, my concern isn't with parallel query in general, the issue we see is when high-frequency, low-latency
queriesstart executing with parallelism on their own (i.e. the traditional plan flip with a twist).  Given that
max_parallel_workers_per_gatheris dynamic and easily configured per session (or even per query with something like the
pg_hint_planextension), dissuading the planner from opting in to parallelism by default will contain the fallout that
wesee when plans flip to parallel execution.
 

What is the fallout?  When a high-volume, low-latency query flips to parallel execution on a busy system, we end up in
asituation where the database is effectively DDOSing itself with a very high rate of connection establish and tear-down
requests. Even if the query ends up being faster (it generally does not), the CPU requirements for the same workload
rapidlydouble or worse, with most of it being spent in the OS (context switch, fork(), destroy()).  When looking at the
database,you'll see a high load average, and high wait for CPU with very little actual work being done within the
database. 
 

For an example of scale, we have seen users with low connection rates (<= 5 / minute) suddenly spike to between 2000
and3000 connect requests per minute until the system grinds to a halt.  
 

I'm looking forward to the upcoming monitoring in e7a9496 (Add two attributes to pg_stat_database for parallel workers
activity),it will be easier to empirically prove that parallel query is being used.  I don't think the patch goes far
enoughthough, we really need the ability to pinpoint the query and the specific variables used that triggered the
parallelplan.  When we tell a user that parallel query is in-use and suspected, it is almost always met with "no, we
don'tuse that feature".  Users do not even realize that it's happening and quickly ask for a list of all queries that
haveever undergone parallel execution.  It's pretty much impossible to give an accurate list of these because there is
noinstrumentation available (even in the new patch) to get to the per-query level.
 

When a user says "I'm not using parallel query" we have to walk through circumstantial evidence of its use.  I
typicallycombine IPC:BgWorkerShutDown, IPC:ParallelFinish, IO:DataFileRead (this helps nail it for sequential scans)
witha high rate of connection establishment.  When you look at all of these together, it still hard to see that
parallelismis the cause, but when we disable automated plan selection, system stability returns.
 

The recommendation that I give to users is pretty straightforward: "Disable automatic parallel query, enable it for
querieswhere you find substantial savings and can control the rate of execution."  I always tell users that if they're
usingparallel query for anything that should execute in less than 5 minutes, they're probably pushing on the wrong
tuningstrategy as the load induced by the parallel query infrastructure is likely going to negate the savings that
they'regetting.  
 

I'm curious to hear what others think of this proposal.  I've dealt with so many of these over the last 24 months, most
ofthem causing strife along the way, that I'm interested in what others think.  
 

--
Scott Mead
Amazon Web Services
scott@meads.us

Note: When testing the attached patch, there are failures in misc_sanity.out and misc_functions.out (replication origin
nameis too long).  I assume these are unrelated to my attached patch. 
Attachment

Re: Disable parallel query by default

From
Greg Sabino Mullane
Date:
On Tue, May 13, 2025 at 4:37 PM Scott Mead <scott@meads.us> wrote:
I'll open by proposing that we prevent the planner from automatically selecting parallel plans by default

That seems a pretty heavy hammer, when we have things like parallel_setup_cost that should be tweaked first.

The recommendation that I give to users is pretty straightforward: "Disable automatic parallel query, enable it for queries where you find substantial savings and can control the rate of execution."  I always tell users that if they're using parallel query for anything that should execute in less than 5 minutes, they're probably pushing on the wrong tuning strategy as the load induced by the parallel query infrastructure is likely going to negate the savings that they're getting.

Five minutes?! That's not been my experience. Not claiming parallelism is perfect yet, but there are plenty of parallel performance savings under the five minute mark.

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Disable parallel query by default

From
Laurenz Albe
Date:
On Tue, 2025-05-13 at 17:53 -0400, Scott Mead wrote:
> On Tue, May 13, 2025, at 5:07 PM, Greg Sabino Mullane wrote:
> > On Tue, May 13, 2025 at 4:37 PM Scott Mead <scott@meads.us> wrote:
> > > I'll open by proposing that we prevent the planner from automatically
> > > selecting parallel plans by default
> >
> > That seems a pretty heavy hammer, when we have things like
> > parallel_setup_cost that should be tweaked first.
>
> I agree it's a big hammer and I thought through parallel_setup_cost
> quite a bit myself.  The problem with parallel_setup_cost is that it
> doesn't actually represent the overhead of a setting up parallel
> query for a busy system.  It does define the cost of setup for a
> *single* parallel session, but it cannot accurately express the
> cost of CPU and other overhead associated with the second, third,
> fourth, etc... query that is executed as parallel.  The expense to
> the operating system is a function of the _rate_ of parallel query
> executions being issued.  Without new infrastructure, there's no way
> to define something that will give me a true representation of the
> cost of issuing a query with parallelism.

There is no way for the optimizer to represent that your system is
under CPU overload currently.  But I agree with Greg that
parallel_setup_cost is the setting that should be adjusted.
If PostgreSQL is more reluctant to even start considering a parallel plan,
that would be a move in the right direction in a case like this:

> > > What is the fallout?  When a high-volume, low-latency query flips to
> > > parallel execution on a busy system, we end up in a situation where
> > > the database is effectively DDOSing itself with a very high rate of
> > > connection establish and tear-down requests.  Even if the query ends
> > > up being faster (it generally does not), the CPU requirements for the
> > > same workload rapidly double or worse, with most of it being spent
> > > in the OS (context switch, fork(), destroy()).  When looking at the
> > > database, you'll see a high load average, and high wait for CPU with
> > > very little actual work being done within the database.

You are painting a bleak picture indeed.  I get to see PostgreSQL databases
in trouble regularly, but I have not seen anything like what you describe.
If a rather cheap, very frequent query is suddenly estimated to be
expensive enough to warrant a parallel plan, I'd suspect that the estimates
must be seriously off.

With an argument like that, you may as well disable nested loop joins.
I have seen enough cases where disabling nested loop joins, without any
deeper analysis, made very slow queries reasonably fast.

Sure enough, I often see systems where I recommend disabling parallel
query - in fact, whenever throughput is more important than response time.
But I also see many cases where parallel query works just like it should
and leads to a better user experience.

I have come to disable JIT by default, but not parallel query.

The primary problem that I encounter with parallel query is that dynamic
shared memory segments grow to a size where they cause OOM errors.
That's the most frequent reason for me to recommend disabling parallel query.

Yours,
Laurenz Albe