Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAEepm=1SzJ-wWgPMQuDN2tLA_c74FM5GCy3_OWpq_XjQqN35Tg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote: >>>> Attached is V7 of the patch. >>> >>> I am doing some testing. First, some superficial things from first pass: >>> >>> [Various minor cosmetic issues] >> >> Oops. > > As this review is very recent, I have moved the patch to CF 2017-03. ParallelContext * -CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) +CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers, + bool serializable_okay){ MemoryContext oldcontext; ParallelContext*pcxt; @@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) * workers, at least not until somebody enhances that mechanism to be * parallel-aware. */ - if (IsolationIsSerializable()) + if (IsolationIsSerializable() && !serializable_okay) nworkers = 0; That's a bit weird but I can't think of a problem with it. Workers run with MySerializableXact == InvalidSerializableXact, even though they may have the snapshot of a SERIALIZABLE leader. Hopefully soon the restriction on SERIALIZABLE in parallel queries can be lifted anyway, and then this could be removed. Here are some thoughts on the overall approach. Disclaimer: I haven't researched the state of the art in parallel sort or btree builds. But I gather from general reading that there are a couple of well known approaches, and I'm sure you'll correct me if I'm off base here. 1. All participants: parallel sequential scan, repartition on the fly so each worker has tuples in a non-overlapping range, sort, build disjoint btrees; barrier; leader: merge disjoint btrees into one. 2. All participants: parallel sequential scan, sort, spool to disk; barrier; leader: merge spooled tuples and build btree. This patch is doing the 2nd thing. My understanding is that some systems might choose to do that if they don't have or don't like the table's statistics, since repartitioning for balanced load requires carefully chosen ranges and is highly sensitive to distribution problems. It's pretty clear that approach 1 is a difficult project. From my research into dynamic repartitioning in the context of hash joins, I can see that that infrastructure is a significant project in its own right: subproblems include super efficient tuple exchange, buffering, statistics/planning and dealing with/adapting to bad outcomes. I also suspect that repartitioning operators might need to be specialised for different purposes like sorting vs hash joins, which may have differing goals. I think it's probably easy to build a slow dynamic repartitioning mechanism that frequently results in terrible worst case scenarios where you paid a fortune in IPC overheads and still finished up with one worker pulling most of the whole load. Without range partitioning, I don't believe you can merge the resulting non-disjoint btrees efficiently so you'd probably finish up writing a complete new btree to mash them together. As for merging disjoint btrees, I assume there are ways to do a structure-preserving merge that just rebuilds some internal pages and incorporates the existing leaf pages directly, a bit like tree manipulation in functional programming languages; that'll take some doing. So I'm in favour of this patch, which is relatively simple and give us faster index builds soon. Eventually we might also be able to have approach 1. From what I gather, it's entirely possible that we might still need 2 to fall back on in some cases. Will you move the BufFile changes to a separate patch in the next revision? Still testing and reviewing, more soon. -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: