Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZT54zyKE+JSch9RN6u_0BrBoTKk3Vr8FKpvFM+GyLK2jA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Gather Merge can't emit a tuple unless it has buffered at least one > tuple from every producer; otherwise, the next tuple it receives from > one of those producers might proceed whichever tuple it chooses to > emit. However, it doesn't need to wait until all of the workers are > completely done. The leader only needs to be at least slightly ahead > of the slowest worker. I'm not sure how that compares to Peter's > approach. I don't think that eager merging will prove all that effective, however it's implemented. I see a very I/O bound system when parallel CREATE INDEX merges serially. There is no obvious reason why you'd have a straggler worker process with CREATE INDEX, really. > What I'm worried about is that we're implementing two separate systems > to do the same thing, and that the parallel sort approach is actually > a lot less general. I think it's possible to imagine a Parallel Sort > implementation which does things Gather Merge can't. If all of the > workers collaborate to sort all of the data rather than each worker > sorting its own data, then you've got something which Gather Merge > can't match. But this is not that. It's not that yet, certainly. I think I've sketched a path forward for making partitioning a part of logtape.c that is promising. The sharing of ranges within tapes and so on will probably have a significant amount in common with what I've come up with. I don't think that any executor infrastructure is a particularly good model when *batch output* is needed -- the tuple queue mechanism will be a significant bottleneck, particularly because it does not integrate read-ahead, etc. The best case that I saw advertised for Gather Merge was TPC-H query 9 [1]. That doesn't look like a good proxy for how Gather Merge adapted to parallel CREATE INDEX would do, since it benefits from the GroupAggregate merge having many equal values, possibly with a clustering in the original tables that can naturally be exploited (no TID tiebreaker needed, since IndexTuples are not being merged). Also, it looks like Gather Merge may do that well by enabling things, rather than parallelizing the sort effectively per se. Besides, the query 9 case is significantly less scalable than good cases for this parallel CREATE INDEX patch have already been shown to be. I think I've been pretty modest about what this parallel CREATE INDEX patch gets us from the beginning. It is a generalization of tuplesort.c to work in parallel; we need a lot more for that to make things like GroupAggregate as scalable as possible, and I don't pretend that this helps much with that. There are actually more changes to nbtsort.c to coordinate all of this than there are to tuplesort.c in the latest version, so I think that this simpler approach for parallel CREATE INDEX and CLUSTER is worthwhile. The bottom line is that it's inherently difficult for me to refute the idea that Gather Merge could do just as well as what I have here, because proving that involves adding a significant amount of new infrastructure (e.g., to teach the executor about IndexTuples). I think that the argument for this basic approach is sound (it appears to offer comparable scalability to the parallel CREATE INDEX implementations of other systems), but it's simply impractical for me to offer much reassurance beyond that. [1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql -- Peter Geoghegan
pgsql-hackers by date: