Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAH2-Wz=rRMRGTwNZ0qAn7jwZ5pp0HDFsJ5WdKpr+5gS=0FY03w@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > More comments: Attached patch has all open issues worked through, including those that I respond to or comment on below, as well as the other feedback from your previous e-mails. Note also that I fixed the issue that Amit raised, as well as the GetOldestXmin()-argument bug that I noticed in passing when responding to Amit. I also worked on the attribution in the commit message. Before getting to my responses to your most recent round of feedback, I want to first talk about some refactoring that I decided to do. As you can see from the master branch, tuplesort_performsort() isn't necessarily reached for spool2, even when we start out with a spool2 (that is, for many unique index builds, spool2 never even does a tuplesort_performsort()). We may instead decide to shut down spool2 when it has no (dead) tuples. I made this work just as well for the parallel case in this latest revision. I had to teach tuplesort.c to accept an early tuplesort_end() for LEADER() -- it had to be prepared to release still-waiting workers in some cases, rather than depending on nbtsort.c having called tuplesort_performsort() already. Several routines within nbtsort.c that previously knew something about parallelism now know nothing about it. This seems like a nice win. Separately, I took advantage of the fact that within the leader, its *worker* Tuplesortstate can safely call tuplesort_end() before the leader state's tuplesort_performsort() call. The overall effect of these two changes is that there is now a _bt_leader_heapscan() call for the parallel case that nicely mirrors the serial case's IndexBuildHeapScan() call, and once we're done with populating spools, no subsequent code needs to know a single thing about parallelism as a special case. You may notice some small changes to the tuplesort.h overview, which now advertises that callers can take advantage of this leeway. Now on to my responses to your most recent round of feeback... > BufFileView() looks fairly pointless. It basically creates a copy of > the input and, in so doing, destroys the input, which is a lot like > returning the input parameter except that it uses more cycles. It > does do a few things. While it certainly did occur to me that that was kind of weird, and I struggled with it on my own for a little while, I ultimately agreed with Thomas that it added something to have ltsConcatWorkerTapes() call some buffile function in every iteration of its loop. (BufFileView() + BufFileViewAppend() are code that Thomas actually wrote, though I added the asserts and comments myself.) If you think about this in terms of the interface rather than the implementation, then it may make more sense. The encapsulation adds something which might pay off later, such as when extendBufFile() needs to work with a concatenated set of BufFiles. And even right now, I cannot simply reuse the BufFile without then losing the assert that is currently in BufFileViewAppend() (must not have associated shared fileset assert). So I'd end up asserting less (rather than more) there if BufFileView() was removed. It wastes some cycles to not simply use the BufFile directly, but not terribly many in the grand scheme of things. This happens once per external sort operation. > In miscadmin.h, I'd put the prototype for the new GUC next to > max_worker_processes, not maintenance_work_mem. But then I'd really have to put it next to max_worker_processes in globals.c, too. That would mean that it would go under "Primary determinants of sizes of shared-memory structures" within globals.c, which seems wrong to me. What do you think? > The ereport() in index_build will, I think, confuse people when it > says that there are 0 parallel workers. I suggest splitting this into > two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(... > "building index \"%s\" on table \"%s\" serially" ...) else ereport(... > "building index \"%s\" on table \"%s\" in parallel with request for %d > parallel workers" ...). WFM. I've simply dropped any reference to leader participation in the messages here, to keep things simple. This seemed okay because the only thing that affects leader participation is the parallel_leader_participation GUC, which is under the user's direct control at all times, and is unlikely to be changed. Those that really want further detail have trace_sort for that. > The logic in IndexBuildHeapRangeScan() around need_register_snapshot > and OldestXmin seems convoluted and not very well-edited to me. Having revisited it, I now agree that the code added to IndexBuildHeapRangeScan() was unclear, primarily in that the need_unregister_snapshot local variable was overloaded in a weird way. > I suggest that you go back to the original code organization > and then just insert an additional case for a caller-supplied scan, so > that the overall flow looks like this: > > if (scan != NULL) > { > ... > } > else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent) > { > ... > } > else > { > ... > } The problem that I see with this alternative flow is that the "if (scan != NULL)" and the "else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)" blocks clearly must contain code for two distinct, non-overlapping cases, despite the fact those two cases actually do overlap somewhat. That is, a call to IndexBuildHeapRangeScan() may have a (parallel) heap scan argument (control reaches your first code block), or may not (control reaches your second or third code block). At the same time, a call to IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX), or may need an MVCC snapshot (either by registering its own, or using the parallel one). These two things are orthogonal. I think I still get the gist of what you're saying, though. I've come up with a new structure that is a noticeable improvement on what I had. Importantly, the new structure let me add a number of parallelism-agnostic asserts that make sure that every ambuild routine that supports parallelism gets the details right. > Along with that, I'd change the name of need_register_snapshot to > need_unregister_snapshot (it's doing both jobs right now) and > initialize it to false. Done. > + * This support code isn't reliable when called from within a parallel > + * worker process due to the fact that our state isn't propagated. This is > + * why parallel index builds are disallowed on catalogs. It is possible > + * that we'll fail to catch an attempted use of a user index undergoing > + * reindexing due the non-propagation of this state to workers, which is not > + * ideal, but the problem is not particularly likely to go undetected due to > + * our not doing better there. > > I understand the first two sentences, but I have no idea what the > third one means, especially the part that says "not particularly > likely to go undetected due to our not doing better there". It sounds > scary that something bad is only "not particularly likely to go > undetected"; don't we need to detect bad things reliably? The primary point here, that you said you understood, is that we definitely need to detect when we're reindexing a catalog index within the backend, so that systable_beginscan() can do the right thing and not use the index (we also must avoid assertion failures). My solution to that problem is, of course, to not allow the use of parallel create index when REINDEXing a system catalog. That seems 100% fine to me. There is a little bit of ambiguity about other cases, though -- that's the secondary point I tried to make within that comment block, and the part that you took issue with. To put this secondary point another way: It's possible that we'd fail to detect it if someone's comparator went bananas and decided it was okay to do SQL access (that resulted in an index scan of the index undergoing reindex). That does seem rather unlikely, but I felt it necessary to say something like this because ReindexIsProcessingIndex() isn't already something that only deals with catalog indexes -- it works with all indexes. Anyway, I reworded this. I hope that what I came up with is clearer than before. > But also, > you used the word "not" three times and also the prefix "un-", meaning > "not", once. Four negations in 13 words! Perhaps I'm not entirely in > a position to cast aspersions on overly-complex phraseology -- the pot > calling the kettle black and all that -- but I bet that will be a lot > clearer if you reduce the number of negations to either 0 or 1. You're not wrong. Simplified. > The comment change in standard_planner() doesn't look helpful to me; > I'd leave it out. Okay. > + * tableOid is the table that index is to be built on. indexOid is the OID > + * of a index to be created or reindexed (which must be a btree index). > > I'd rewrite that first sentence to end "the table on which the index > is to be built". The second sentence should say "an index" rather > than "a index". Okay. > But, actually, I think we would be better off just ripping > leaderWorker/leaderParticipates out of this function altogether. > compute_parallel_worker() is not really under any illusion that it's > computing a number of *participants*; it's just computing a number of > *workers*. That distinction does seem to cause plenty of confusion. While I accept what you say about compute_parallel_worker(), I still haven't gone as far as removing the leaderParticipates argument altogether, because compute_parallel_worker() isn't the only thing that matters here. (More on that below.) > I think it's fine for > parallel_leader_participation=off to simply mean that you get one > fewer participants. That's actually what would happen with parallel > query, too. Parallel query would consider > parallel_leader_participation later, in get_parallel_divisor(), when > working out the cost of one path vs. another, but it doesn't use it to > choose the number of workers. So it seems to me that getting rid of > all of the workerLeader considerations will make it both simpler and > more consistent with what we do for queries. I was aware of those details, and figured that parallel query fudges the compute_parallel_worker() figure's leader participation in some sense, and that that was what I needed to compensate for. After all, when parallel_leader_participation=off, having compute_parallel_worker() return 1 means rather a different thing to what it means with parallel_leader_participation=on, even though in general we seem to assume that parallel_leader_participation can only make a small difference overall. Here's what I've done based on your feedback: I've changed the header comments, but stopped leaderParticipates from affecting the compute_parallel_worker() calculation (so, as I said, leaderParticipates stays). The leaderParticipates argument continues to affect these two aspects of plan_create_index_workers()'s return value: 1. It continues to be used so we have a total number of participants (not workers) to apply our must-have-32MB-workMem limit on participants. Parallel query has no equivalent of this, and it seems warranted. Note that this limit is no longer applied when parallel_workers storage param was set, as discussed. 2. I continue to use the leaderParticipates argument to disallow the case where there is only one CREATE INDEX participant but parallelism is in use, because, of course, that clearly makes no sense -- we should just use a serial sort instead. (It might make sense to allow this if parallel_leader_participation was *purely* a testing GUC, only for use by by backend hackers, but AFAICT it isn't.) The planner can allow a single participant parallel sequential scan path to be created without worrying about the fact that that doesn't make much sense, because a plan with only one parallel participant is always going to cost more than some serial plan (you will only get a 1 participant parallel sequential scan when force_parallel_mode is on). Obviously plan_create_index_workers() doesn't generate (partial) paths at all, so I simply have to get the same outcome (avoiding a senseless 1 participant parallel operation) some other way here. > If you have an idea how to make a better > cost model than this for CREATE INDEX, I'm willing to consider other > options. If you don't, or want to propose that as a follow-up patch, > then I think it's OK to use what you've got here for starters. I just > don't want it to be more baroque than necessary. I suspect that the parameters of any cost model for parallel CREATE INDEX that we're prepared to consider for v11 are: "Use a number of parallel workers that is one below the number at which the total duration of the CREATE INDEX either stays the same or goes up". It's hard to do much better than this within those parameters. I can see a fairly noticeable benefit to parallelism with 4 parallel workers and a measly 1MB of maintenance_work_mem (when parallelism is forced) relative to the serial case with the same amount of memory. At least on my laptop, it seems to be rather hard to lose relative to a serial sort when using parallel CREATE INDEX (to be fair, I'm probably actually using way more memory than 1MB to do this due to FS cache usage). I can think of a cleverer approach to costing parallel CREATE INDEX, but it's only cleverer by weighing distributed costs. Not very relevant, for the time being. BTW, the 32MB per participant limit within plan_create_index_workers() was chosen based on the realization that any higher value would make having a default setting of 2 for max_parallel_maintenance_workers (to match the max_parallel_workers_per_gather default) pointless when the default maintenance_work_mem value of 64MB is in use. That's not terribly scientific, though it at least doesn't come at the expense of a more scientific idea for a limit like that (I don't actually have one, you see). I am merely trying to avoid being *gratuitously* wasteful of shared resources that are difficult to accurately cost in (e.g., the distributed cost of random I/O to the system as a whole when we do a parallel index build while ridiculously low on maintenance_work_mem). > I think that the naming of the wait events could be improved. Right > now, they are named by which kind of process does the waiting, but it > really should be named based on what the thing for which we're > waiting. I also suggest that we could just write Sort instead of > Tuplesort. In short, I suggest ParallelTuplesortLeader -> > ParallelSortWorkersDone and ParallelTuplesortLeader -> > ParallelSortTapeHandover. WFM. Also added documentation for the wait events to monitoring.sgml, which I somehow missed the first time around. > Not for this patch, but I wonder if it might be a worthwhile future > optimization to allow workers to return multiple tapes to the leader. > One doesn't want to go crazy with this, of course. If the worker > returns 100 tapes, then the leader might get stuck doing multiple > merge passes, which would be a foolish way to divide up the labor, and > even if that doesn't happen, Amdahl's law argues for minimizing the > amount of work that is not done in parallel. Still, what if a worker > (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth > merging them so that the leader can do a 5-way merge instead of a > 15-way merge? I did think about this myself, or rather I thought specifically about building a serial/bigserial PK during pg_restore, a case that must be very common. The worker merges for such an index build will typically be *completely pointless* when all input runs are in sorted order, because the merge heap will only need to consult the root of the heap and its two immediate children throughout (commit 24598337c helped cases like this enormously). You might as well merge hundreds of runs in the leader, provided you still have enough memory per tape that you can get the full benefit of OS readahead (this is not that hard when you're only going to repeatedly read from the same tape anyway). I'm not too worried about it, though. The overall picture is still very positive even in this case. The "extra worker merging" isn't generally a big proportion of the overall cost, especially there. More importantly, if I tried to do better, it would be the "quicksort with spillover" cost model story all over again (remember how tedious that was?). How hard are we prepared to work to ensure that we get it right when it comes to skipping worker merging, given that users always pay some overhead, even when that doesn't happen? Note also that parallel index builds manage to unfairly *gain* advantage over serial cases (they have the good variety of dumb luck, rather than the bad variety) in certain other common cases. This happens with an *inverse* physical/logical correlation (e.g. a DESC index builds on a date field). They manage to artificially do better than theory would predict, simply because a greater number of smaller quicksorts are much faster during initial run generation, without also taking a concomitant performance hit at merge time. Thomas showed this at one point. Note that even that's only true because of the qsort precheck (what I like to call the "banana skin prone" precheck, that we added to our qsort implementation in 2006) -- it would be true for *all* correlations, but that one precheck thing complicates matters. All of this is a tricky business, and that isn't going to get any easier IMV. > + * Make sure that the temp file(s) underlying the tape set are created in > + * suitable temp tablespaces. This is only really needed for serial > + * sorts. > > This comment makes me wonder whether it is "sorta" needed for parallel sorts. I removed "really". The point of the comment is that we've already set up temp tablespaces for the shared fileset in the parallel case. Shared filesets figure out which tablespaces will be used up-front -- see SharedFileSetInit(). > - if (trace_sort) > + if (trace_sort && !WORKER(state)) > > I have a feeling we still want to get this output even from workers, > but maybe I'm missing something. I updated tuplesort_end() so that trace_sort reports on the end of the sort, even for worker processes. (We still don't show generic tuplesort_begin* message for workers, though.) > + arg5 indicates serial, parallel worker, or parallel leader sort.</entry> > > I think it should say what values are used for each case. I based this on "arg0 indicates heap, index or datum sort", where it's implied that the values are respective to the order that they appear in in the sentence (starting from 0). But okay, I'll do it that way all the same. > + /* Release worker tuplesorts within leader process as soon as possible */ > > IIUC, the worker tuplesorts aren't really holding onto much of > anything in terms of resources. I think it might be better to phrase > this as /* The sort we just did absorbed the final tapes produced by > these tuplesorts, which are of no further use. */ or words to that > effect. Okay. Done that way. > Instead of making a special case in CreateParallelContext for > serializable_okay, maybe index_build should just use SetConfigOption() > to force the isolation level to READ COMMITTED right after it does > NewGUCNestLevel(). The change would only be temporary because the > subsequent call to AtEOXact_GUC() will revert it. I tried doing it that way, but it doesn't seem workable: postgres=# begin transaction isolation level serializable ; BEGIN postgres=*# reindex index test_unique; ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query LOCATION: call_string_check_hook, guc.c:9953 Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify default_transaction_isolation -- not transaction_isolation. Instead, I added a bit more to comments within CreateParallelContext(), to justify what I've done along the lines you went into. Hopefully this works better for you. > There is *still* more to review here, but my concentration is fading. > If you could post an updated patch after adjusting for the comments > above, I think that would be helpful. I'm not totally out of things > to review that I haven't already looked over once, but I think I'm > close. I'm impressed with how quickly you're getting through review of the patch. Hopefully we can keep that momentum up. Thanks -- Peter Geoghegan
Attachment
pgsql-hackers by date: