Re: Parallel Full Hash Join - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Parallel Full Hash Join |
Date | |
Msg-id | CA+hUKGJStxwSeyDOFcqXMQ3T5EB1Qro4paTwUZvH6V+vxkD8xQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Full Hash Join (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Parallel Full Hash Join
|
List | pgsql-hackers |
On Sun, Mar 26, 2023 at 9:52 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I have some very minor pieces of feedback, mainly about extraneous > commas that made me uncomfortable ;) Offensive punctuation removed. > > discussion). Therefore FULL JOIN inhibited page-based parallelism, > > as the other join strategies can't do it either. > > I actually don't quite understand what this means? It's been awhile for > me, so perhaps I'm being dense, but what is page-based parallelism? Reworded. I just meant our usual kind of "partial path" parallelism (the kind when you don't know anything at all about the values of the tuples that each process sees, and typically it's chopped up by storage pages at the scan level). > > That unfairness is considered acceptable for now, because it's better > > than no parallelism at all. The build and probe phases are run in > > parallel, and the new scan-for-unmatched phase, while serial, is usually > > applied to the smaller of the two relations and is either limited by > > some multiple of work_mem, or it's too big and is partitioned into > > batches and then the situation is improved by batch-level parallelism. > > In future work on deadlock avoidance strategies, we may find a way to > > parallelize the new phase safely. > > Is it worth mentioning something about parallel-oblivious parallel hash > join not being able to do this still? Or is that obvious? That's kind of what I meant above. > > @@ -3116,18 +3256,31 @@ ExecHashTableDetachBatch(HashJoinTable hashtable) > full/right joins should never fall into this code path, right? Yeah, this is the normal way we detach from a batch. This is reached when shutting down the executor early, or when moving to the next batch, etc. *** I found another problem. I realised that ... FULL JOIN ... LIMIT n might be able to give wrong answers with unlucky scheduling. Unfortunately I have been unable to reproduce the phenomenon I am imagining yet but I can't think of any mechanism that prevents the following sequence of events: P0 probes, pulls n tuples from the outer relation and then the executor starts to shut down (see commit 3452dc52 which pushed down LIMIT), but just then P1 attaches, right before P0 does. P1 continues, and finds < n outer tuples while probing and then runs out so it enters the unmatched scan phase, and starts emitting bogusly unmatched tuples. Some outer tuples we needed to get the complete set of match bits and thus the right answer were buffered inside P0's subplan and abandoned. I've attached a simple fixup for this problem. Short version: if you're abandoning your PHJ_BATCH_PROBE phase without reaching the end, you must be shutting down, so the executor must think it's OK to abandon tuples this process has buffered, so it must also be OK to throw all unmatched tuples out the window too, as if this process was about to emit them. Right? *** With all the long and abstract discussion of hard to explain problems in this thread and related threads, I thought I should take a step back and figure out a way to demonstrate what this thing really does visually. I wanted to show that this is a very useful feature that unlocks previously unobtainable parallelism, and to show the compromise we've had to make so far in an intuitive way. With some extra instrumentation hacked up locally, I produced the attached "progress" graphs for a very simple query: SELECT COUNT(*) FROM r FULL JOIN s USING (i). Imagine a time axis along the bottom, but I didn't bother to add numbers because I'm just trying to convey the 'shape' of execution with relative times and synchronisation points. Figures 1-3 show that phases 'h' (hash) and 'p' (probe) are parallelised and finish sooner as we add more processes to help out, but 's' (= the unmatched inner tuple scan) is not. Note that if all inner tuples are matched, 's' becomes extremely small and the parallelism is almost as fair as a plain old inner join, but here I've maximised it: all inner tuples were unmatched, because the two relations have no matches at all. Even if we achieve perfect linear scalability for the other phases, the speedup will be governed by https://en.wikipedia.org/wiki/Amdahl%27s_law and the only thing that can mitigate that is if there is more useful work those early-quitting processes could do somewhere else in your query plan. Figure 4 shows that it gets a lot fairer in a multi-batch join, because there is usually useful work to do on other batches of the same join. Notice how processes initially work on loading, probing and scanning different batches to reduce contention, but they are capable of ganging up to load and/or probe the same batch if there is nothing else left to do (for example P2 and P3 both work on p5 near the end). For now, they can't do that for the s phases. (BTW, the little gaps before loading is the allocation phase that I didn't bother to plot because they can't fit a label on them; this visualisation technique is a WIP.) With the "opportunistic" change we are discussing for later work, figure 4 would become completely fair (P0 and P2 would be able to join in and help out with s6 and s7), but single-batch figures 1-3 would not (that would require a different executor design AFAICT, or a eureka insight we haven't had yet; see long-winded discussion). The last things I'm thinking about now: Are the planner changes right? Are the tests enough? I suspect we'll finish up changing that chunk-based approach yet again in future work on memory efficiency, but I'm OK with that; this change suits the current problem and we don't know what we'll eventually settle on with more research.
Attachment
pgsql-hackers by date: