Re: parallel mode and parallel contexts - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: parallel mode and parallel contexts |
Date | |
Msg-id | CA+TgmoY0Nj_eNx3tGQKsi3M8wf7hq0p6jXSUpsdPLPRn5O1vxQ@mail.gmail.com Whole thread Raw |
In response to | Re: parallel mode and parallel contexts (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: parallel mode and parallel contexts
|
List | pgsql-hackers |
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Observations: > * Some tailing whitespace in the readme. Very nice otherwise. Fixed. Thanks. > * Don't like CreateParallelContextForExtension much as a function name - > I don't think we normally equate the fact that code is located in a > loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? > * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's > that about? Gee, maybe I should have added a comment so I'd remember. If the buffer size isn't MAXALIGN'd, that would be really bad, because shm_mq_create() assumes that it's being given an aligned address. Maybe I should add an Assert() there. If it is MAXALIGN'd but not BUFFERALIGN'd, we might waste a few bytes of space, since shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I don't think anything will actually break. Not sure if that's worth an assert or not. > * the plain shm calculations intentionally use mul_size/add_size to deal > with overflows. On 32bit it doesn't seem impossible, but unlikely to > overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. > * I'd s/very // i "We might be running in a very short-lived memory > context.". Or replace it with "too". Removed "very". > * In +LaunchParallelWorkers, does it make sense trying to start workers > if one failed? ISTM that's likely to not be helpful. I.e. it should > just break; after the first failure. It can't just break, because clearing pcxt->worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. > * +WaitForParallelWorkersToFinish says that it waits for workers to exit > cleanly. To me that's ambiguous. How about "fully"? I've removed the word "cleanly" and added a comment to more fully explain the danger: + * Even if the parallel operation seems to have completed successfully, it's + * important to call this function afterwards. We must not miss any errors + * the workers may have thrown during the parallel operation, or any that they + * may yet throw while shutting down. > * ParallelMain restores libraries before GUC state. Given that > librararies can, and actually somewhat frequently do, inspect GUCs on > load, it seems better to do it the other way round? You argue "We want > to do this before restoring GUCs, because the libraries might define > custom variables.", but I don't buy that. It's completely normal for > namespaced GUCs to be present before a library is loaded? Especially > as you then go and process_session_preload_libraries() after setting > the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. > * Should ParallelMain maybe enter the parallel context before loading > user defined libraries? It's far from absurd to execute code touching > the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. > * rename ParallelMain to ParallelWorkerMain? Sounds good. Done. > * I think restoring snapshots needs to fudge the worker's PGXACT->xmin > to be the minimum of the top transaction id and the > snapshots. Otherwise if the initiating process dies badly > (e.g. because postmaster died) the workers will continue to work, > while other processes may remove things. RestoreTransactionSnapshot() works the same way as the existing import/export snapshot stuff, so that ought to be no less safe than what we're doing already. Any other snapshots that we're restoring had better not have an xmin lower than that one; if they do, the master messed up. Possibly it would be a good idea to have additional safeguards there; not sure exactly what. Perhaps RestoreSnapshot() could assert that the xmin of the restored snapshot follows-or-is-equal-to PGXACT->xmin? That would be safe for the first snapshot we restore because our xmin will be InvalidTransactionId, and after that it should check the condition you're worried about? Thoughts? > Also, you don't seem to > prohibit popping the active snapshot (should that be prohibitted > maybe?) which might bump the initiator's xmin horizon. I think as long as our transaction snapshot is installed correctly our xmin horizon can't advance; am I missing something? It's generally OK to pop the active snapshot, as long as you only pop what you pushed. In a worker, you can push additional snapshots on the active snapshot stack and then pop them, but you can't pop the one ParallelWorkerMain installed for you. You'll probably notice if you do, because then the snapshot stack will be empty when you get back to ParallelWorkerMain() and you'll fail an assertion. Similarly, the master shouldn't pop the snapshot that was active at the start of parallelism until parallelism is done, but again if you did you'd probably fail an assertion later on. Generally, we haven't had much of a problem with PushActiveSnapshot() and PopActiveSnapshot() calls being unbalanced, so I don't really think this is an area that needs especially strong cross-checks. At least not unless we get some evidence that this is a more-common mistake in code that touches parallelism than it is in general. > * Comment about 'signal_handler' in +HandleParallelMessageInterrupt > is outdated. Removed. > * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker > exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally > not sure it's wise to use this faux FE/BE protocol here... Well, I'm not sure about that either and never have been, but I was even less sure inventing a new one was any better. We might need a few new protocol messages (or to reuse a few existing ones for other things) but being able to reuse the existing format for ErrorResponse, NoticeResponse, etc. seems like a pretty solid win. Those are reasonably complex message formats and reimplementing them for no reason seems like a bad idea. Terminate is 'X', not 'T', and it's a frontend-only message. The worker is speaking the backend half of the protocol. We could use it anyway; that wouldn't be silly. I picked ReadyForQuery because that's what the backend sends when it is completely done processing everything that the user most recently requested, which seems defensible here. > * HandlingParallelMessages looks dead. Good catch. Removed. > * ParallelErrorContext has the wrong comment. Doh. Fixed. > * ParallelErrorContext() provides the worker's pid in the context > message. I guess that's so it's easier to interpret when sent to the > initiator? It'll look odd when logged by the failing process. Yes, that's why. Regarding logging, true. I guess the master could add the context instead, although making sure the PID is available looks pretty annoying. At the time we establish the queue, the PID isn't known yet, and by the time we read the error from it, the worker might be gone, such that we can't read its PID. To fix, we'd have to invent a new protocol message that means "here's my PID". Another alternative is to just say that the error came from a parallel worker (e.g. "while executing in parallel worker") and not mention the PID, but that seems like it'd be losing possibly-useful information. > * We now have pretty much the same handle_sigterm in a bunch of > places. Can't we get rid of those? You actually probably can just use > die(). Good idea. Done. > * The comments in xact.c above XactTopTransactionId pretty much assume > that the reader knows that that is about parallel stuff. What would you suggest? The comment begins "Only a single TransactionStateData is placed on the parallel worker's state stack", which seems like a pretty clear way of giving the user a hint that we are talking about parallel stuff. An older version of the patch was much less clear, but I thought I'd remedied that. > * I'm a bit confused by the fact that Start/CommitTransactionCommand() > emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas > ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be > pretty much impossible to hit a ReleaseSavepoint() with a parallel > transaction in progress? We'd have had to end the previous transaction > command while parallel stuff was in progress - right? > (Internal/ReleaseCurrentSubTransaction is different, that's used in code) It's pretty simple to hit ReleaseSavepoint() while a transaction is in progress. It's pretty much directly SQL-callable, so a PL function run in a parallel worker could easily hit it, or anything that uses SPI. There are similar checks in e.g. EndTransaction(), but you can't invoke CommitTransactionCommand() directly. > * Why are you deviating from the sorting order used in other places for > ParallelCurrentXids? That seems really wierd, especially as we use > something else a couple lines down. Especially as you actually seem to > send stuff in xidComparator order? The transaction IDs have to be sorted into some order so that they can be binary-searched, and this seemed simplest. xidComparator sorts in numerical order, not transaction-ID order, so that's how we send them. That turns out to be convenient anyway, because binary-search on a sorted array of integers is really simple. If we sent them sorted in transaction-ID order we'd have to make sure that the reference transaction ID was the same for both backends, which might not be hard, but I don't see how it would be better than this. > * I don't think skipping AtEOXact_Namespace() entirely if parallel is a > good idea. That function already does some other stuff than cleaning > up temp tables. I think you should pass in parallel and do the skip in > there. That's a very good point. Fixed. > * Start/DndParallelWorkerTransaction assert the current state, whereas > the rest of the file FATALs in that case. I think it'd actually be > good to be conservative and do the same in this case. Well, actually, StartTransaction() does this: if (s->state != TRANS_DEFAULT) elog(WARNING, "StartTransaction while in %s state", TransStateAsString(s->state)); I could copy and paste that code into StartParallelWorkerTransaction() and changing WARNING to FATAL, but the first thing StartParallelWorkerTransaction() does is call StartTransaction(). It seems pretty stupid to have two identical tests that differ only in their log level. The same considerations apply to EndParalellWorkerTransaction() and CommitTransaction(). A worthwhile question is why somebody thought that it was a good idea for the log level there to be WARNING rather than FATAL. But I don't think it's this patch's job to second-guess that decision. > * You're manually passing function names to > PreventCommandIfParallelMode() in a fair number of cases. I'd either > try and keep the names consistent with what the functions are actually > called at the sql level (adding the types in the parens) or just use > PG_FUNCNAME_MACRO to keep them correct. I think putting the type names in is too chatty; I'm not aware we use that style in other error messages. We don't want to lead people to believe that only the form with the particular argument types they used is not OK. PG_FUNCNAME_MACRO will give us the C name, not the SQL name. > * Wait. You now copy all held relation held "as is" to the standby? I > quite doubt that's a good idea, and it's not my reading of the > conclusions made in the group locking thread. At the very least this > part needs to be extensively documented. And while > LockAcquireExtended() refers to > src/backend/access/transam/README.parallel for details I don't see > anything pertinent in there. And the function header sounds like the > only difference is the HS logging - not mentioning that it essentially > disables lock queuing entirely. > > This seems unsafe (e.g. consider if the initiating backend died and > somebody else acquired the lock, possible e.g. if postmaster died) and > not even remotely enough discussed. I think this should be removed > from the patch for now. If it's broken, we need to identify what's wrong and fix it, not just rip it out. It's possible that something is broken with that code, but it's dead certain that something is broken without it: rhaas=# select parallel_count('pgbench_accounts', 1); NOTICE: PID 57956 counted 2434815 tuples NOTICE: PID 57957 counted 1565185 tuples CONTEXT: parallel worker, pid 57957parallel_count ---------------- 4000000 (1 row) rhaas=# begin; BEGIN rhaas=# lock pgbench_accounts; LOCK TABLE rhaas=# select parallel_count('pgbench_accounts', 1); NOTICE: PID 57956 counted 4000000 tuples ...and then it hangs forever. On the specific issues: 1. I agree that it's very dangerous for the parallel backend to acquire the lock this way if the master no longer holds it. Originally, I was imagining that there would be no interlock between the master shutting down and the worker starting up, but you and others convinced me that was a bad idea. So now transaction commit or abort waits for all workers to be gone, which I think reduces the scope of possible problems here pretty significantly. However, it's quite possible that it isn't airtight. One thing we could maybe do to make it safer is pass a pointer to the initiator's PGPROC. If we get the lock via the fast-path we are safe anyway, but if we have to acquire the partition lock, then we can cross-check that the initiator's lock is still there. I think that would button this up pretty tight. 2. My reading of the group locking discussions was that everybody agreed that the originally-proposed group locking approach, which involved considering locks from the same locking group as mutually non-conflicting, was not OK. Several specific examples were offered - e.g. it's clearly not OK for two backends to extend a relation at the same time just because the same locking group. So I abandoned that approach. When I instead proposed the approach of copying only the locks that the master *already* had at the beginning of parallelism and considering *only those* as mutually conflicting, I believe I got several comments to the effect that this was "less scary". Considering the topic area, I'm not sure I'm going to do any better than that. 3. I welcome proposals for other ways of handling this problem, even if they restrict the functionality that can be offered. For example, a proposal that would make parallel_count revert to single-threaded mode but terminate without an indefinite wait would be acceptable to me, provided that it offers some advantage in safety and security over what I've already got. A proposal to make it parallel_count error out in the above case would not be acceptable to me; the planner must not generate parallel plans that will sometimes fail unexpectedly at execution-time. I generally believe that we will be much happier if application programmers need not worry about the failure of parallel workers to obtain locks already held by the master; some such failure modes may be very complex and hard to predict. The fact that the current approach handles the problem entirely within the lock manager, combined with the fact that it is extremely simple, is therefore very appealing to me. Nonetheless, a test case that demonstrates this approach falling down badly would force a rethink; do you have one? Or an idea about what it might look like? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: