Re: [HACKERS] WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] WIP: [[Parallel] Shared] Hash |
Date | |
Msg-id | CAEepm=1vGcv6LBrxZeqPb_rPxfraidWAF_8_4z2ZMQ+7DOjj9w@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] WIP: [[Parallel] Shared] Hash (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Re: [HACKERS] WIP: [[Parallel] Shared] Hash Re: [HACKERS] WIP: [[Parallel] Shared] Hash Re: [HACKERS] WIP: [[Parallel] Shared] Hash |
List | pgsql-hackers |
On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I will post a new rebased version soon with that and > some other nearby problems fixed. Here is a new WIP patch. I have plenty of things to tidy up (see note at end), but the main ideas are now pretty clear and I'd appreciate some feedback. The main changes since the last patch, other than debugging, are: * the number of batches now increases if work_mem would be exceeded; the work of 'shrinking' the hash table in memory in that case is done in parallel * work_mem accounting is done at chunk level, instead of tuples * interlocking has been rethought Previously, I had some ideas about using some lock free tricks for managing chunks of memory, but you may be relieved to hear that I abandoned those plans. Now, atomic ops are used only for one thing: pushing tuples into the shared hash table buckets. An LWLock called 'chunk_lock' protects various linked lists of chunks of memory, and also the shared work_mem accounting. The idea is that backends can work independently on HASH_CHUNK_SIZE blocks of tuples at a time in between needing to acquire that lock briefly. Also, there is now a second barrier, used to coordinate hash table shrinking. This can happen any number of times during PHJ_PHASE_HASHING and PHJ_PHASE_LOADING_BATCH(n) phases as required to stay under work_mem, so it needed to be a separate barrier. The communication in this patch is a bit more complicated than other nearby parallel query projects I've looked at; probably the worst bit is the leader deadlock avoidance stuff (see ExecHashCheckForEarlyExit), and the second worst bit is probably the switch statements for allowing participants to show up late and get in sync, which makes that other problem even more annoying; without those problems and with just the right kind of reusable shared tuplestore, this would be a vastly simpler patch. Those are not really fundamental problems of parallel joins using a shared hash tables, but they're problems I don't have a better solution to right now. Stepping back a bit, I am aware of the following approaches to hash join parallelism: 1. Run the inner plan and build a private hash table in each participant, and then scatter the outer plan arbitrarily across participants. This is what 9.6 does, and it's a good plan for small hash tables with fast inner plans, but a terrible plan for expensive or large inner plans. Communication overhead: zero; CPU overhead: runs the inner plan in k workers simultaneously; memory overhead: builds k copies of the hashtable; disk overhead: may need to spill k copies of all batches to disk if work_mem exceeded; restrictions: Can't do right/full joins because no shared 'matched' flags. 2. Run a partition-wise hash join[1]. Communication overhead: zero; CPU overhead: zero; memory overhead: zero; disk overhead: zero; restrictions: the schema must include compatible partition keys, and potential parallelism is limited by the number of partitions. 3. Repartition the data on the fly, and then run a partition-wise hash join. Communication overhead: every tuple on at least one and possibly both sides must be rerouted to the correct participant; CPU overhead: zero, once repartitioning is done; memory overhead: none; disk overhead: may need to spill partitions to disk if work_mem is exceeded 4. Scatter both the inner and outer plans arbitrarily across participants (ie uncorrelated partitioning), and build a shared hash table. Communication overhead: synchronisation of build/probe phases, but no tuple rerouting; CPU overhead: none; memory overhead: none; disk overhead: may need to spill batches to disk; restrictions: none in general, but currently we have to drop the leader after the first batch of a multi-batch join due to our consumer/producer leader problem mentioned in earlier messages. We have 1. This proposal aims to provide 4. It seems we have 2 on the way (that technique works for all 3 join algorithms without any changes to the join operators and looks best by any measure, but is limited by the user's schema, ie takes careful planning on the user's part instead of potentially helping any join). Other databases including SQL Server offer 3. I suspect that 4 is probably a better fit than 3 for Postgres today, because the communication overhead of shovelling nearly all tuples through extra tuple queues to route them to the right hash table would surely be very high, though I can see that it's very attractive to have a reusable tuple repartitioning operator and then run k disjoint communication-free joins (again, without code change to the join operator, and to the benefit of all join operators). About the shared batch reading code: this patch modifies BufFile so that a temporary file can be shared read-only with other participants, and then introduces a mechanism for coordinating shared reads. Each worker starts out reading all the tuples from the file that it wrote, before attempting to steal tuples from the files written by other participants, until there are none left anywhere. In the best case they all write out and then read back in just their own files with minimal contention, and contention rises as tuples are less evenly distributed among participants, but we never quite get the best case because the leader always leaves behind a bunch of batches for the others to deal with when it quits early. Maybe I should separate all the batch reader stuff into another patch so it doesn't clutter the hash join code up so much? I will start reviewing Parallel Tuplesort shortly, which includes some related ideas. Some assorted notes on the status: I need to do some thinking about the file cleanup logic: both explicit deletes at the earliest possible time, and failure/error paths. Currently the creator of each file is responsible for cleaning it up, but I guess if the creator aborts early the file disappears underneath the others' feet, and then I guess they might raise a confusing error report that races against the root cause error report; I'm looking into that. Rescans and skew buckets not finished yet. The new chunk-queue based ExecScanHashTableForUnmatched isn't tested yet (it replaces and earlier version that was doing a bucket-by-bucket parallel scan). There are several places where I haven't changed the private hash table code to match the shared version because I'm not sure about that, in particular the idea of chunk-based accounting (which happens to be convenient for this code, but I also believe it to be more correct). I'm still trying to decide how to report the hash table tuple count and size: possibly the grand totals. Generally I need to do some tidying and provide a suite of queries that hits interesting cases. I hope to move on these things fairly quickly now that I've got the hash table resizing and batch sharing stuff working (a puzzle that kept me very busy for a while) though I'm taking a break for a bit to do some reviewing. The test query I've been looking at recently is TPCH Q9. With scale 1GB and work_mem = 64KB, I get a query plan that includes three different variants of Hash node: Hash (run in every backend, duplicate hash tables), Shared Hash (run in just one backend, but allowed to use the sum of work_mem of all the backends, so usually wins by avoiding batching), and Parallel Shared Hash (run in parallel and using sum of work_mem). As an anecdatum, I see around 2.5x speedup against master, using only 2 workers in both cases, though it seems to be bimodal, either 2x or 2.8x, which I expect has something to do with that leader exit stuff and I'm looking into that.. More on performance soon. Thanks for reading! [1] https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj%3DEaDTSA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: