Home > mailing lists

Re: Parallel Full Hash Join - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Parallel Full Hash Join
Date	March 27, 2023 23:03:42
Msg-id	CA+hUKGJStxwSeyDOFcqXMQ3T5EB1Qro4paTwUZvH6V+vxkD8xQ@mail.gmail.com Whole thread Raw
In response to	Re: Parallel Full Hash Join (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: Parallel Full Hash Join
List	pgsql-hackers

Tree view

On Sun, Mar 26, 2023 at 9:52 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I have some very minor pieces of feedback, mainly about extraneous
> commas that made me uncomfortable ;)

Offensive punctuation removed.

> > discussion).  Therefore FULL JOIN inhibited page-based parallelism,
> > as the other join strategies can't do it either.
>
> I actually don't quite understand what this means? It's been awhile for
> me, so perhaps I'm being dense, but what is page-based parallelism?

Reworded.  I just meant our usual kind of "partial path" parallelism
(the kind when you don't know anything at all about the values of the
tuples that each process sees, and typically it's chopped up by
storage pages at the scan level).

> > That unfairness is considered acceptable for now, because it's better
> > than no parallelism at all.  The build and probe phases are run in
> > parallel, and the new scan-for-unmatched phase, while serial, is usually
> > applied to the smaller of the two relations and is either limited by
> > some multiple of work_mem, or it's too big and is partitioned into
> > batches and then the situation is improved by batch-level parallelism.
> > In future work on deadlock avoidance strategies, we may find a way to
> > parallelize the new phase safely.
>
> Is it worth mentioning something about parallel-oblivious parallel hash
> join not being able to do this still? Or is that obvious?

That's kind of what I meant above.

> > @@ -3116,18 +3256,31 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)

> full/right joins should never fall into this code path, right?

Yeah, this is the normal way we detach from a batch.  This is reached
when shutting down the executor early, or when moving to the next
batch, etc.

***

I found another problem.  I realised that ... FULL JOIN ... LIMIT n
might be able to give wrong answers with unlucky scheduling.
Unfortunately I have been unable to reproduce the phenomenon I am
imagining yet but I can't think of any mechanism that prevents the
following sequence of events:

P0 probes, pulls n tuples from the outer relation and then the
executor starts to shut down (see commit 3452dc52 which pushed down
LIMIT), but just then P1 attaches, right before P0 does.  P1
continues, and finds < n outer tuples while probing and then runs out
so it enters the unmatched scan phase, and starts emitting bogusly
unmatched tuples.  Some outer tuples we needed to get the complete set
of match bits and thus the right answer were buffered inside P0's
subplan and abandoned.

I've attached a simple fixup for this problem.  Short version: if
you're abandoning your PHJ_BATCH_PROBE phase without reaching the end,
you must be shutting down, so the executor must think it's OK to
abandon tuples this process has buffered, so it must also be OK to
throw all unmatched tuples out the window too, as if this process was
about to emit them.  Right?

***

With all the long and abstract discussion of hard to explain problems
in this thread and related threads, I thought I should take a step
back and figure out a way to demonstrate what this thing really does
visually.  I wanted to show that this is a very useful feature that
unlocks previously unobtainable parallelism, and to show the
compromise we've had to make so far in an intuitive way.  With some
extra instrumentation hacked up locally, I produced the attached
"progress" graphs for a very simple query: SELECT COUNT(*) FROM r FULL
JOIN s USING (i).  Imagine a time axis along the bottom, but I didn't
bother to add numbers because I'm just trying to convey the 'shape' of
execution with relative times and synchronisation points.

Figures 1-3 show that phases 'h' (hash) and 'p' (probe) are
parallelised and finish sooner as we add more processes to help out,
but 's' (= the unmatched inner tuple scan) is not.  Note that if all
inner tuples are matched, 's' becomes extremely small and the
parallelism is almost as fair as a plain old inner join, but here I've
maximised it: all inner tuples were unmatched, because the two
relations have no matches at all.  Even if we achieve perfect linear
scalability for the other phases, the speedup will be governed by
https://en.wikipedia.org/wiki/Amdahl%27s_law and the only thing that
can mitigate that is if there is more useful work those early-quitting
processes could do somewhere else in your query plan.

Figure 4 shows that it gets a lot fairer in a multi-batch join,
because there is usually useful work to do on other batches of the
same join.  Notice how processes initially work on loading, probing
and scanning different batches to reduce contention, but they are
capable of ganging up to load and/or probe the same batch if there is
nothing else left to do (for example P2 and P3 both work on p5 near
the end).  For now, they can't do that for the s phases.  (BTW, the
little gaps before loading is the allocation phase that I didn't
bother to plot because they can't fit a label on them; this
visualisation technique is a WIP.)

With the "opportunistic" change we are discussing for later work,
figure 4 would become completely fair (P0 and P2 would be able to join
in and help out with s6 and s7), but single-batch figures 1-3 would
not (that would require a different executor design AFAICT, or a
eureka insight we haven't had yet; see long-winded discussion).

The last things I'm thinking about now:  Are the planner changes
right?  Are the tests enough?  I suspect we'll finish up changing that
chunk-based approach yet again in future work on memory efficiency,
but I'm OK with that; this change suits the current problem and we
don't know what we'll eventually settle on with more research.

Attachment

pgsql-hackers by date:

From: Jacob Champion
Date: 27 March 2023, 23:01:59
Subject: Re: Data is copied twice when specifying both child and parent table in publication

From: Bruce Momjian
Date: 27 March 2023, 23:19:21
Subject: Re: Moving forward with TDE

Re: Parallel Full Hash Join - Mailing list pgsql-hackers

Attachment

Previous

Next