Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+TgmoZU67xAT0nny+7DEWetXGdxcZ3OnnOhqeeAZe1ksVMcHA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's my thought process... please tell me where I'm going wrong: > > I have been assuming that it's not enough to just deal with this when > the leader detaches on the theory that other participants will always > detach first: that probably isn't true in some error cases, and could > contribute to spurious racy errors where other workers complain about > disappearing files if the leader somehow shuts down and cleans up > while a worker is still running. Therefore we need *some* kind of > refcounting, whether it's a new kind or a new mechanism based on the > existing kind. +1. > I have also been assuming that we don't want to teach dsm.c directly > about this stuff; it shouldn't need to know about other modules, so we > don't want it talking to buffile.c directly and managing a special > table of files; instead we want a system of callbacks. Therefore > client code needs to do something after attaching to the segment in > each backend. +1. > It doesn't matter whether we use an on_dsm_detach() callback and > manage our own refcount to infer that destruction is imminent, or a > new on_dsm_destroy() callback which tells us so explicitly: both ways > we'll need to make sure that anyone who attaches to the segment also > "attaches" to this shared BufFile manager object inside it, because > any backend might turn out to be the one that is last to detach. Not entirely. In the first case, you don't need the requirement that everyone who attaches the segment must attach to the shared BufFile manager. In the second case, you do. > That bring us to the race you mentioned. Isn't it sufficient to say > that you aren't allowed to do anything that might throw in between > attaching to the segment and attaching to the SharedBufFileManager > that it contains? That would be sufficient, but I think it's not a very good design. It means, for example, that nothing between the time you attach to the segment and the time you attach to this manager can palloc() anything. So, for example, it would have to happen before ParallelWorkerMain reaches the call to shm_mq_attach, which kinda sucks because we want to do that as soon as possible after attaching to the DSM segment so that errors are reported properly thereafter. Note that's the very first thing we do now, except for working out what the arguments to that call need to be. Also, while it's currently safe to assume that shm_toc_attach() and shm_toc_lookup() don't throw errors, I've thought about the possibility of installing some sort of cache in shm_toc_lookup() to amortize the cost of lookups, if the number of keys ever got too large. And that would then require a palloc(). Generally, backend code should be free to throw errors. When it's absolutely necessary for a short segment of code to avoid that, then we do, but you can't really rely on any substantial amount of code to be that way, or stay that way. And in this case, even if we didn't mind those problems or had some solution to them, I think that the shared buffer manager shouldn't have to be something that is whacked directly into parallel.c all the way at the beginning of the initialization sequence so that nothing can fail before it happens. I think it should be an optional data structure that clients of the parallel infrastructure can decide to use, or to not use. It should be at arm's length from the core code, just like the way ParallelQueryMain() is distinct from ParallelWorkerMain() and sets up its own set of data structures with their own set of keys. All that stuff is happy to happen after whatever ParallelWorkerMain() feels that it needs to do, even if ParallelWorkerMain might throw errors for any number of unknown reasons. Similarly, I think this new things should be something than an executor node can decide to create inside its own per-node space -- reserved via ExecParallelEstimate, initialized ExecParallelInitializeDSM, etc. There's no need for it to be deeply coupled to parallel.c itself unless we force that choice by sticking a no-fail requirement in there. > Up until two minutes ago I assumed that policy would leave only two > possibilities: you attach to the DSM segment and attach to the > SharedBufFileManager successfully or you attach to the DSM segment and > then die horribly (but not throw) and the postmaster restarts the > whole cluster and blows all temp files away with RemovePgTempFiles(). > But I see now in the comment of that function that crash-induced > restarts don't call that because "someone might want to examine the > temp files for debugging purposes". Given that policy for regular > private BufFiles, I don't see why that shouldn't apply equally to > shared files: after a crash restart, you may have some junk files that > won't be cleaned up until your next clean restart, whether they were > private or shared BufFiles. I think most people (other than Tom) would agree that that policy isn't really sensible any more; it probably made sense when the PostgreSQL user community was much smaller and consisted mostly of the people developing PostgreSQL, but these days it's much more likely to cause operational headaches than to help a developer debug. Regardless, I think the primary danger isn't failure to remove a file (although that is best avoided) but removing one too soon (causing someone else to error when opening it, or on Windows causing the delete itself to error out). It's not really OK for random stuff to throw errors in corner cases because we were too lazy to ensure that cleanup operations happen in the right order. >> I thought the idea was that the structure we're talking about here >> owns all the files, up to 2 from a leader that wandered off plus up to >> 2 for each worker. Last process standing removes them. Or are you >> saying each worker only needs 2 files but the leader needs a >> potentially unbounded number? > > Yes, potentially unbounded in rare case. If we plan for N batches, > and then run out of work_mem because our estimates were just wrong or > the distributions of keys is sufficiently skewed, we'll run > HashIncreaseNumBatches, and that could happen more than once. I have > a suite of contrived test queries that hits all the various modes and > code paths of hash join, and it includes a query that plans for one > batch but finishes up creating many, and then the leader exits. I'll > post that to the other thread along with my latest patch series soon. Hmm, OK. So that's going to probably require something where a fixed amount of DSM can describe an arbitrary number of temp file series. But that also means this is an even-more-special-purpose tool that shouldn't be deeply tied into parallel.c so that it can run before any errors happen. Basically, I think the "let's write the code between here and here so it throws no errors" technique is, for 99% of PostgreSQL programming, difficult and fragile. We shouldn't rely on it if there is some other reasonable option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: