Thread: Parallel tuplesort, partitioning, merging, and the future

Parallel tuplesort, partitioning, merging, and the future

From

Peter Geoghegan

Date:

08 August 2016, 19:44:07

Over on the "Parallel tuplesort (for parallel B-Tree index creation)"
thread [1], there has been some discussion of merging vs.
partitioning. There is a concern about the fact the merge of the
tuplesort used to build a B-Tree is not itself parallelized. There is
a weak consensus that we'd be better off with partitioning rather than
merging anyway.

I've started this new thread to discuss where we want to be with
partitioning (and further parallel merging). In short: How can we
maximize the benefit of parallel tuplesort?

I've said or at least implied that I favor keeping the CREATE INDEX
merge serial, while pursuing partitioning (for other uses) at a later
date. Further parallelization is useful for parallel query much more
so than parallel CREATE INDEX.

To summarize, I favor not parallelizing merging because:

* The scalability of parallel CREATE INDEX as implemented is
apparently comparable to the scalability of similar features in other
systems [2] (systems with very mature implementations). ISTM that we
cannot really expect to do too much better than that.

* We already do some merging in parallel to get runs for the leader to
merge in serial, if and only if workers each produce more than one
initial run (not unlikely). That's still pretty effective, and ensures
that the final leader merge is cache efficient. I think that I'll
probably end up changing the patch to share maintenance_work_mem among
workers (not have it be per-worker), making it common for workers to
merge, while the leader merge ends up needed fairly few rounds of
preloading to merge the final output.

* Even if it helps a bit, parallel merging is not the future;
partitioning is (more on this later).

I don't think partitioning is urgent for CREATE INDEX, and may be
inappropriate for CREATE INDEX under any circumstances, because:

* Possible problems with parallel infrastructure and writes.

Can the parallel infrastructure be even made to write and WAL log an
index in parallel, too? Partitioning more or less forces this, at
least if you expect any benefit at all -- reading the concatenated
output files in a serial pass where the index is written seems likely
to hurt more than it helps. This is not much of an advantage when
you're likely to be I/O bound anyway, and when the tuple startup cost
*per worker* is not of particular concern.

* Unbalanced B-Trees (or the risk thereof).

This is the really big one for me. We can only write leaf pages out in
parallel; we have to "merge together sub B-Trees" to produce internal
pages that make up a finished B-Tree. That risks creating a set of
internal pages that reflect a skew that any partition-to-sort approach
happened to have, due to whatever problem might emerge in the choice
of separators in tuplesort.c (this is a documented issue with SQL
Server Enterprise Edition, in fact). Can we tolerate the risk of
ending up with an unbalanced final B-Tree in the event of a
misestimation? Seems like that's quite a lot worse than poor query
performance (due to poor load balancing among parallel workers). DBAs
are more or less used to the latter variety of problems. They are not
used to the former, especially because the problem might go unnoticed
for a long time.

* What I've come up with is minimally divergent from the existing
approach to tuplesorting.

This is useful because of the code complexity of having multiple
workers consuming final output (writing an index) in parallel looks to
be fairly high. Consider B-Tree related features that tuplesort must
care about today, which have further special considerations in a world
where this simple "consume all output at once in one process" model is
replaced for parallel CREATE INDEX. You'd get all kinds of subtle
problems at various boundaries in the final keyspace for things like
CREATE UNIQUE INDEX CONCURRENTLY. That seems like something that I'd
be quite happy to avoid worrying about (while still allowing parallel
CREATE UNIQUE INDEX CONCURRENTLY). Note that some other systems don't
allow the equivalent of parallel CREATE INDEX CONCURRENTLY at all,
presumably because of similar concerns. My patch supports CIC, and
does so almost automatically (although the second pass isn't performed
in parallel).

I bet that CREATE INDEX won't be the last case where that simplicity
and generality clearly matters.

Suggested partitioning algorithm
================================

I think a hybrid partitioning + merging approach would work well for
us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
Probabilistic Splitting" [3] has influenced my thinking here (this was
written by prominent researchers from the influential UW-Madison
Wisconsin database group). Currently, I have in mind something that is
closer to what they call exact splitting to what they call
probabilistic splitting, because I don't think it's going to be
generally possible to have good statistics on partition boundaries
immediately available (e.g., through something like their
probabilistic splitting sampling the relation ahead of time).

The basic idea I have in mind is that we create runs in workers in the
same way that the parallel CREATE INDEX patch does (one output run per
worker). However, rather than merging in the leader, we use a
splitting algorithm to determine partition boundaries on-the-fly. The
logical tape stuff then does a series of binary searches to find those
exact split points within each worker's "final" tape. Each worker
reports the boundary points of its original materialized output run in
shared memory. Then, the leader instructs workers to "redistribute"
slices of their final runs among each other, by changing the tapeset
metadata to reflect that each worker has nworker input tapes with
redrawn offsets into a unified BufFile. Workers immediately begin
their own private on-the-fly merges.

What's really nice about this is that not long after the initial
generation of nworker runs, which has already been shown to be very
scalable, workers can each *individually* start returning output
tuples immediately (the first tuple can be returned by each on-the-fly
merge just after the drawing of boundaries). There is almost no serial
bottleneck that they block on. I think that this can be more important
than improving sort throughput per se.

There'd probably also be an interface for callers to tell tuplesorts
what their splitter boundaries are directly, and an interface to
retrieve details of what splitter boundaries a tuplesort has
determined for itself (in order to have two sort nodes with matching
boundaries, as with a parallel merge join).

Clearly it's really hard to be sure that this is the right thing at
this point, but my intuition is that this is the way to go (while
avoiding anything like this for CREATE INDEX). I'd like to know how
others feel about it.

[1]
https://www.postgresql.org/message-id/flat/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com#CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmoY5JYs4R1g_ZJ-P6SkULSb19xx4zUh7S8LJiXonCgVTuQ%40mail.gmail.com
[3] http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/parsort.pdf
--
Peter Geoghegan

Re: Parallel tuplesort, partitioning, merging, and the future

From

Robert Haas

Date:

10 August 2016, 18:59:09

On Mon, Aug 8, 2016 at 3:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I don't think partitioning is urgent for CREATE INDEX, and may be
> inappropriate for CREATE INDEX under any circumstances, because:
>
> * Possible problems with parallel infrastructure and writes.
> * Unbalanced B-Trees (or the risk thereof).
> * What I've come up with is minimally divergent from the existing
> approach to tuplesorting.

My view on this - currently anyway - is that we shouldn't conflate the
tuplesort with the subsequent index generation, but that we should try
to use parallelism within the tuplesort itself to the greatest extent
possible.  If there is a single output stream that the leader uses to
generate the final index, then none of the above problems arise.  They
only arise if you've got multiple processes actually writing to the
index.

> Suggested partitioning algorithm
> ================================
>
> I think a hybrid partitioning + merging approach would work well for
> us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
> Probabilistic Splitting" [3] has influenced my thinking here (this was
> written by prominent researchers from the influential UW-Madison
> Wisconsin database group). Currently, I have in mind something that is
> closer to what they call exact splitting to what they call
> probabilistic splitting, because I don't think it's going to be
> generally possible to have good statistics on partition boundaries
> immediately available (e.g., through something like their
> probabilistic splitting sampling the relation ahead of time).
>
> The basic idea I have in mind is that we create runs in workers in the
> same way that the parallel CREATE INDEX patch does (one output run per
> worker). However, rather than merging in the leader, we use a
> splitting algorithm to determine partition boundaries on-the-fly. The
> logical tape stuff then does a series of binary searches to find those
> exact split points within each worker's "final" tape. Each worker
> reports the boundary points of its original materialized output run in
> shared memory. Then, the leader instructs workers to "redistribute"
> slices of their final runs among each other, by changing the tapeset
> metadata to reflect that each worker has nworker input tapes with
> redrawn offsets into a unified BufFile. Workers immediately begin
> their own private on-the-fly merges.

Yeah, this is pretty cool.  You end up with the final merge segmented
into N submerges producing nonoverlapping ranges.  So you could have
the leader perform submerge 0 itself, and while it's doing that the
other workers can perform submerges 1..N.  By the time  the leader
finishes submerge 0, the remaining submerges will likely be complete
and after that the leader can just read the outputs of those submerges
one after another and it has basically no additional work to do.

It might be a good idea to divide the work into a number of submerges
substantially greater than the number of workers.  For example,
suppose we expect between 1 and 4 workers, but we partition the work
into 64 submerges.  The leader claims submerge 0, which is only 1/64
of the total.  By the time it finishes consuming those tuples,
submerge 1 will likely be done.  Hopefully, even if there are only 1
or 2 workers, they can keep ahead of the leader so that very little of
the merging happens in the leader.  Also, if some submerges go faster
than others, the distribution of work among workers remains even,
because the ones that go quicker will handle more of the submerges and
the ones that go slower will handle fewer.

I think that last part is a very important property; my intuition is
that dividing up the work between cooperating processes in a way that
should come out equal will often fail to do so, either due to the
operating system scheduler or due to some data being cached and other
data not being cached or due to the comparator running faster on some
data than other data or due to NUMA effects that make some processes
run faster than others or due to any number of other causes.  So I
think that algorithms that allocate the work dynamically are going to
greatly outperform those that use a division of labor which is fixed
at the beginning of a computation phase.

> Clearly it's really hard to be sure that this is the right thing at
> this point, but my intuition is that this is the way to go (while
> avoiding anything like this for CREATE INDEX). I'd like to know how
> others feel about it.

The number of others weighing in on these topics is surely less than
either of us would like, but hopefully we can find a way to make
progress anyhow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort, partitioning, merging, and the future

From

Claudio Freire

Date:

10 August 2016, 19:08:13

On Mon, Aug 8, 2016 at 4:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The basic idea I have in mind is that we create runs in workers in the
> same way that the parallel CREATE INDEX patch does (one output run per
> worker). However, rather than merging in the leader, we use a
> splitting algorithm to determine partition boundaries on-the-fly. The
> logical tape stuff then does a series of binary searches to find those
> exact split points within each worker's "final" tape. Each worker
> reports the boundary points of its original materialized output run in
> shared memory. Then, the leader instructs workers to "redistribute"
> slices of their final runs among each other, by changing the tapeset
> metadata to reflect that each worker has nworker input tapes with
> redrawn offsets into a unified BufFile. Workers immediately begin
> their own private on-the-fly merges.

I think it's a great design, but for that, per-worker final tapes have
to always be random-access.

I'm not hugely familiar with the code, but IIUC there's some penalty
to making them random-access right?

Re: Parallel tuplesort, partitioning, merging, and the future

From

Peter Geoghegan

Date:

10 August 2016, 20:54:43

On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> My view on this - currently anyway - is that we shouldn't conflate the
> tuplesort with the subsequent index generation, but that we should try
> to use parallelism within the tuplesort itself to the greatest extent
> possible.  If there is a single output stream that the leader uses to
> generate the final index, then none of the above problems arise.  They
> only arise if you've got multiple processes actually writing to the
> index.

I'm not sure if you're agreeing with my contention about parallel
CREATE INDEX not being a good target for partitioning here. Are you?

You can get some idea of how much a separate pass over the
concatenated outputs would hurt by using the test GUC with my patch
applied (the one that artificially forces randomAccess by B-Tree
tuplesort callers).

>> Suggested partitioning algorithm
>> ================================

>> The basic idea I have in mind is that we create runs in workers in the
>> same way that the parallel CREATE INDEX patch does (one output run per
>> worker). However, rather than merging in the leader, we use a
>> splitting algorithm to determine partition boundaries on-the-fly. The
>> logical tape stuff then does a series of binary searches to find those
>> exact split points within each worker's "final" tape. Each worker
>> reports the boundary points of its original materialized output run in
>> shared memory. Then, the leader instructs workers to "redistribute"
>> slices of their final runs among each other, by changing the tapeset
>> metadata to reflect that each worker has nworker input tapes with
>> redrawn offsets into a unified BufFile. Workers immediately begin
>> their own private on-the-fly merges.
>
> Yeah, this is pretty cool.  You end up with the final merge segmented
> into N submerges producing nonoverlapping ranges.  So you could have
> the leader perform submerge 0 itself, and while it's doing that the
> other workers can perform submerges 1..N.  By the time  the leader
> finishes submerge 0, the remaining submerges will likely be complete
> and after that the leader can just read the outputs of those submerges
> one after another and it has basically no additional work to do.

Again, I'm a little puzzled by your remarks here. Surely the really
great case for parallel sort with partitioning is the case where there
remains minimal further IPC between workers? So, while "the leader can
just read the outputs of those submerges", ideally it will be reading
as little as possible from workers. For example, it's ideal when the
workers were able to determine that their particular range in the
parallel merge join has very few tuples to return, having also
"synchronized" their range within two underlying relations
(importantly, the merge join "synchronization" can begin per worker
when the tuplesort.c on-the-fly merge begins and returns its first
tuple -- that is, it can begin very soon).

In short, partitioning when sorting is as much about avoiding a serial
dependency for the entire query tree as it is about taking advantage
of available CPU cores and disk spindles. That is my understanding, at
any rate.

While all this speculation about choice of algorithm is fun,
realistically I'm not gong to write the patch for a rainy day (nor for
parallel CREATE INDEX, at least until we become very comfortable with
all the issues I raise, which could never happen). I'd be happy to
consider helping you improve parallel query by providing
infrastructure like this, but I need someone else to write the client
of the infrastructure (e.g. a parallel merge join patch), or to at
least agree to meet me half way with an interdependent prototype of
their own. It's going to be messy, and we'll have to do a bit of
stumbling to get to a good place. I can sign up to that if I'm not the
only one that has to stumble.

Remember how I said we should work on the merging bottleneck
indirectly? I'm currently experimenting with having merging use
sifting down to replace the root in the heap. This is very loosely
based on the Jeremy Harris patch from 2014, I suppose. Anyway, this
can be far, far faster, with perhaps none of the downsides that we saw
in the context of building an initial replacement selection heap,
because we have more control of the distribution of input (tapes have
sorted tuples), and because this heap is so tiny and cache efficient
to begin with. This does really well in the event of clustering of
values, which is a common case, but also helps with totally random
initially input.

I need to do some more research before posting a patch, but right now
I can see that it makes merging presorted numeric values more than 2x
faster. And that's with 8 tapes, on my very I/O bound laptop. I bet
that the benefits would also be large for text (temporal locality is
improved, and so strcoll() comparison caching is more effective).
Serial merging still needs work, it seems.

-- 
Peter Geoghegan

Re: Parallel tuplesort, partitioning, merging, and the future

From

Peter Geoghegan

Date:

10 August 2016, 21:14:05

On Wed, Aug 10, 2016 at 12:08 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I think it's a great design, but for that, per-worker final tapes have
> to always be random-access.

Thanks. I don't think I need to live with the randomAccess
restriction, because I can be clever about reading only the first
tuple on each logtape.c block initially. Much later, when the binary
search gets down to seeking within a single block, everything in the
block can be read at once into memory, and we can take the binary
search to that other representation. This latter part only needs to
happen once or twice per partition boundary per worker.

> I'm not hugely familiar with the code, but IIUC there's some penalty
> to making them random-access right?

Yeah, there is. For one thing, you have to store the length of the
tuple twice, to support incremental seeking in both directions. For
another, you cannot perform the final merge on-the-fly; you must
produce a serialized tape as output, which is used subsequently to
support random seeks. There is no penalty when you manage to do the
sort in memory, though (not that that has anything to do with parallel
sort).

-- 
Peter Geoghegan

Re: Parallel tuplesort, partitioning, merging, and the future

From

Peter Geoghegan

Date:

10 August 2016, 21:49:58

On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that last part is a very important property; my intuition is
> that dividing up the work between cooperating processes in a way that
> should come out equal will often fail to do so, either due to the
> operating system scheduler or due to some data being cached and other
> data not being cached or due to the comparator running faster on some
> data than other data or due to NUMA effects that make some processes
> run faster than others or due to any number of other causes.  So I
> think that algorithms that allocate the work dynamically are going to
> greatly outperform those that use a division of labor which is fixed
> at the beginning of a computation phase.

I agree that dynamic sampling has big advantages. Our Quicksort
implementation does dynamic sampling, of course.

You need to be strict about partition boundaries: they may not be
drawn at a point in the key space that is not precisely defined, and
in general there can be no ambiguity about what bucket a tuple can end
up in ahead of time. In other words, you cannot carelessly allow equal
tuples to go on either side of an equal boundary key.

The reason for this restriction is that otherwise, stuff breaks when
you later attempt to "align" boundaries across sort operations that
are performed in parallel. I don't think you can introduce an
artificial B-Tree style tie-breaker condition to avoid the problem,
because that will slow things right down (B&M Quicksort does really
well with many equal keys).

When you have one really common value, load balancing for partitioning
just isn't going to do very well. My point is that there will be a
somewhat unpleasant worst case that will need to be accepted. It's not
practical to go to the trouble of preventing it entirely. So, the
comparison with quicksort works on a couple of levels.

-- 
Peter Geoghegan

Re: Parallel tuplesort, partitioning, merging, and the future

From

Robert Haas

Date:

12 August 2016, 19:22:29

On Wed, Aug 10, 2016 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> My view on this - currently anyway - is that we shouldn't conflate the
>> tuplesort with the subsequent index generation, but that we should try
>> to use parallelism within the tuplesort itself to the greatest extent
>> possible.  If there is a single output stream that the leader uses to
>> generate the final index, then none of the above problems arise.  They
>> only arise if you've got multiple processes actually writing to the
>> index.
>
> I'm not sure if you're agreeing with my contention about parallel
> CREATE INDEX not being a good target for partitioning here. Are you?

No.  I agree that writing to the index in parallel is bad, but I think
it's entirely reasonable to try to set things up so that the leader
does as little of the final merge work itself as possible, instead
offloading that to workers.  Unless, of course, we can prove that the
overhead of the final merge pass is so low that it doesn't matter
whether we offload it.

> While all this speculation about choice of algorithm is fun,
> realistically I'm not gong to write the patch for a rainy day (nor for
> parallel CREATE INDEX, at least until we become very comfortable with
> all the issues I raise, which could never happen). I'd be happy to
> consider helping you improve parallel query by providing
> infrastructure like this, but I need someone else to write the client
> of the infrastructure (e.g. a parallel merge join patch), or to at
> least agree to meet me half way with an interdependent prototype of
> their own. It's going to be messy, and we'll have to do a bit of
> stumbling to get to a good place. I can sign up to that if I'm not the
> only one that has to stumble.

Fair enough.

> Serial merging still needs work, it seems.

At the risk of stating the obvious, improving serial execution
performance is always superior to comparable gains originating from
parallelism, so no complaints here about work in that area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company