Re: [HACKERS] Transactions involving multiple postgres foreign servers - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Transactions involving multiple postgres foreign servers |
Date | |
Msg-id | CA+TgmoYkBfRb3awg2Gpz6FWiVf5POp2Jw3QsUKR-psfvbshYxw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Transactions involving multiple postgres foreign servers (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [HACKERS] Transactions involving multiple postgres foreign servers
|
List | pgsql-hackers |
On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Overall, what's the status of this patch? Are we hung up on this >> issue only, or are there other things? > > AFAIK there is no more technical issue in this patch so far other than > this issue. The patch has tests and docs, and includes all stuff to > support atomic commit to distributed transactions: the introducing > both the atomic commit ability to distributed transactions and some > corresponding FDW APIs, and having postgres_fdw support 2pc. I think > this patch needs to be reviewed, especially the functionality of > foreign transaction resolution which is re-designed before. OK. I'm going to give 0002 a read-through now, but I think it would be a good thing if others also contributed to the review effort. There is a lot of code here, and there are a lot of other patches competing for attention. That said, here we go: In the documentation for pg_prepared_fdw_xacts, the first two columns have descriptions ending in a preposition. That's typically to be avoided in formal writing. The first one can be fixed by moving "in" before "which". The second can be fixed by changing "that" to "with which" and then dropping the trailing "with". The first three columns have descriptions ending in a period; the latter two do not. Make it consistent with whatever the surrounding style is, or at least internally consistent if the surrounding style varies. Also, some of the descriptions begin with "the" and others do not; again, seems better to be consistent and adhere to surrounding style. The documentation of the serverid column seems to make no sense. Possibly you mean "OID of the foreign server on which this foreign transaction is prepared"? As it is, you use "foreign server" twice, which is why I'm confused. The documentation of max_prepared_foreign_transactions seems a bit brief. I think that if I want to be able to use 2PC for N transactions each across K servers, this variable needs to be set to N*K, not just N. That's not the right way to word it here, but somehow you probably want to explain that a single local transaction can give rise to multiple foreign transactions and that this should be taken into consideration when setting a value for this variable. Maybe also include a link to where the user can find more information, which brings me to another point: there doesn't seem to be any general, user-facing explanation of this system. You explain the catalogs, the GUCs, the interface, etc. but there's nothing anywhere that explains the overall point of the whole thing, which seems pretty important. The closest thing you've got is a description for people writing FDWs, but we really need a place to explain the concept to *users*. One idea is to add a new chapter to the "Server Administration" section, maybe just after "Logical Replication" and before "Regression Tests". But I'm open to other ideas. It's important that the documentation of the various GUCs provide users with some clue as to how to set them. I notice this particularly for foreign_transaction_resolution_interval; off the top of my head, without having read the rest of this patch, I don't know why I shouldn't want this to be zero. But the others could use more explanation as well. It is unclear from reading the documentation for GetPreparedId why this should be the responsibility of the FDW, or how the FDW is supposed to guarantee uniqueness. PrepareForignTransaction is spelled wrong. Nearby typos: prepareing, tranasaction. A bit further down, "can _prepare" has an unwanted space in the middle. Various places in this section capitalize certain words in the middle of sentences which is not standard English style. For example, in "Every Foreign Data Wrapper is required..." the word "Every" is appropriately capitalized because it begins a sentence, but there's no reason to capitalize the others. Likewise for "...employs Two-phase commit protocol..." and other similar cases. EndForeignTransaction doesn't explain what sorts of things the FDW can legally do when called, or how this method is intended to be used. Those seem like important details. Especially, an FDW that tries to do arbitrary stuff during abort cleanup will probably cause big problems. The fdw-transactions section of the documentation seems to imply that henceforth every FDW must call FdwXactRegisterForeignServer, which I think is an unacceptable API break. It doesn't seem advisable to make this behavior depend on max_prepared_foreign_transactions. I think that it should be an server-level option: use 2PC for this server, or not? FDWs that don't have 2PC default to "no"; but others can be set to "yes" if the user wishes. But we shouldn't force that decision to be on a cluster-wide basis. + <xref linkend="functions-fdw-transaction-table"/> shows the functions + available for foreign transaction managements. management + Resolve a foreign transaction. This function search foreign transaction searches for + matching the criteria and resolve then. This function doesn't resolve critera->arguments, resolve->resolves, doesn't->won't + an entry of which transaction is in-progress, or that is locked by some a foreign transaction which is in progress, or one that is locked by some This doesn't seem like a great API contract. It would be nice to have the guarantee that, if the function returns without error, all transactions that were prepared before this function was run and which match the given arguments are now resolved. Skipping locked transactions removes that guarantee. + This function works the same as <function>pg_resolve_fdw_xact</function> + except it remove foreign transaction entry without resolving. Explain why that's useful. + <entry>OID of the database that the foreign transaction resolver process connects to</entry> to which the ... is connected + <entry>Time of last resolved a foreign transaction</entry> Time at which the process last resolved a foreign transaction + of foreign trasactions. The new wait events aren't documented. Spelling error. + * This module manages the transactions involving foreign servers. Remove this. Doesn't add any information. + * This comment summarises how the transaction manager handles transactions + * involving one or more foreign servers. This too. + * connection is identified by oid fo foreign server and user. fo -> of + * first phase doesn not succeed for whatever reason, the foreign servers doesn -> does But more generally: + * The commit is executed in two phases. In the first phase executed during + * pre-commit phase, transactions are prepared on all the foreign servers, + * which can participate in two-phase commit protocol. Transaction on other + * foreign servers are committed in the same phase. In the second phase, if + * first phase doesn not succeed for whatever reason, the foreign servers + * are asked to rollback respective prepared transactions or abort the + * transactions if they are not prepared. This process is executed by backend + * process that executed the first phase. If the first phase succeeds, the + * backend process registers ourselves to the queue in the shared memory and then + * ask the foreign transaction resolver process to resolve foreign transactions + * that are associated with the its transaction. After resolved all foreign + * transactions by foreign transaction resolve process the backend wakes up + * and resume to process. The only way this can be reliable, I think, is if we prepare all of the remote transactions before committing locally and commit them after committing locally. Otherwise, if we crash or fail before committing locally, our atomic commit is no longer atomic. I think the way this should work is: during pre-commit, we prepare the transaction everywhere. After committing or rolling back, we notify the resolver process and tell it to try to commit or roll back those transactions. If we ask it to commit, we also tell it to notify us when it's done, so that we can wait (interruptibly) for it to finish, and so that we're not trying to locally do work that might fail with an ERROR after already committed (that would confuse the client). If the user hits ^C, then we handle it like synchronous replication: we emit a WARNING saying that the transaction might not be remotely committed yet, and return success. I see you've got that logic in FdwXactWaitToBeResolved() so maybe this comment isn't completely in sync with the latest version of the patch, but I think there are some remaining ordering problems as well; see below. I think it is, generally, confusing to describe this process as having two phases. For one thing, two-phase commit has two phases, and they're not these two phases, but we're talking about them in a patch about two-phase commit. But also, they really aren't two phases. Saying something has two phases means that A happens and then B happens. But, at least as this is presently described here, B might not need to happen at all. So that's not really a phase. I think you need a different word here, like maybe "ways", unless I'm just misunderstanding what this whole thing is saying. + * * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions() + * have been modified to go through fdw_xact->inredo entries that have + * not made to disk yet. This doesn't seem to be true. I see no reference to these functions being modified elsewhere in the patch. Nor is it clear why they would need to be modified. For local 2PC, prepared transactions need to be included in all snapshots that are taken. Otherwise, if a local 2PC transaction were committed, concurrent transactions would see the effects of that transaction appear all at once, even though they hadn't gotten a new snapshot. That is the reason why we need StandbyRecoverPreparedTransactions() before opening for hot standby. But for FDW 2PC, even if we knew which foreign transactions were prepared but not yet committed, we have no mechanism for preventing those changes from being visible on the remote servers, nor do they have any effect on local visibility. So there's no need for this AFAICS. Similarly, this appears to me to be incorrect: + RecoveryRequiresIntParameter("max_prepared_foreign_transactions", + max_prepared_foreign_xacts, + ControlFile->max_prepared_foreign_xacts); I might be confused here, but it seems to me that the value of max_prepared_foreign_transactions is irrelevant to whether we can initialize Hot Standby, because, again, those remote xacts have no effect on our local snapshot. Rather, the problem is that we are *altogether unable to proceed with recovery* if this value is too low, regardless of whether we are doing Hot Standby or not. twophase.c handles that by just erroring out in PrepareRedoAdd() if we run out of slots, and insert_fdw_xact does the same thing (although that message doesn't follow style guidelines -- no space before a colon, please!). So it seems to me that you can just delete this code and the associated documentation mention; these concerns are irrelevant here and the actual failure case is otherwise handled. + * We save them to disk and alos set fdw_xact->ondisk to true. + * * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions() + errmsg("prepread foreign transactions are disabled"), + errmsg("out of foreign trasanction resolver slots"), More typos: alos, RecoverPreparedTrasactions, prepread, trasanction +#include <sys/types.h> +#include <sys/stat.h> +#include <unistd.h> + +#include "postgres.h" Thou shalt not have any #includes before "postgres.h" +#include "miscadmin.h" +#include "funcapi.h" These should be in the main alphabetized list. If that doesn't work, then some header is broken. + if (fdw_conn->serverid == serverid && fdw_conn->userid == userid) + { + fdw_conn->modified |= modify; I suggest avoiding |= with Boolean. It might be harmless, but it would be just as easy to write this as if (existing conditions && !fdw_conn->modified) fdw_conn->modified = true, which avoids any assumption about the bit patterns. + max_hash_size = max_prepared_foreign_xacts; + init_hash_size = max_hash_size / 2; I think we decided that hash tables other than the lock manager should initialize with maximum size = initial size. See 7c797e7194d969f974abf579cacf30ffdccdbb95. + if (list_length(MyFdwConnections) < 1) How about if (MyFdwConnections) == NIL? This occurs multiple times in the patch. + if (list_length(MyFdwConnections) == 0) How about if (MyFdwConnections) == NIL? + if ((list_length(MyFdwConnections) > 1) || + (list_length(MyFdwConnections) == 1 && (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL))) + return true; I think this would be clearer written this way: int nserverswritten = list_length(MyFdwConnections); if (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL != 0) ++nserverswritten; return nserverswritten > 1; But that brings up another issue: why is MyFdwConnections named that way and why does it have those semantics? That is, why do we really need a list of every FDW connection? I think we really only need the ones that are 2PC-capable writes. If a non-2PC-capable foreign server is involved in the transaction, then we don't really to keep it in a list. We just need to flag the transaction as not locally prepareable i.e. clear TwoPhaseReady. I think that this could all be figured out in a much less onerous way: if we ever perform a modification of a foreign table, have nodeModifyTable.c either mark the transaction non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign server is not 2PC capable, or otherwise add the appropriate information to MyFdwConnections, which can then be renamed to indicate that it contains only information about preparable stuff. Then you don't need each individual FDW to be responsible about calling some new function; the core code just does the right thing automatically. + if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid, + fdw_conn->umid, true)) + elog(WARNING, "could not commit transaction on server %s", + fdw_conn->servername); First, as I noted above, this assumes that the local transaction can't fail after this point, which is certainly false -- if nothing else, think about a plug pull. Second, it assumes that the right thing to do would be to throw a WARNING, which I think is false: if this is running in the pre-commit phase, it's not too late to switch to the abort path, and certainly if we make changes on only 1 server and the commit on that server fails, we should be rolling back, not committing with a warning. Third, if we did need to restrict ourselves to warnings here, that's probably impractical. This function needs to do network I/O, which is not a no-fail operation, and even if it works, the remote side can fail in any arbitrary way. +FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid, This is not a very clear function name. Apparently, the thing that is doing the registration is an FdwXact, and the thing being registered is also an FdwXact. + /* + * Between FdwXactRegisterFdwXact call above till this backend hears back + * from foreign server, the backend may abort the local transaction + * (say, because of a signal). During abort processing, it will send + * an ABORT message to the foreign server. If the foreign server has + * not prepared the transaction, the message will succeed. If the + * foreign server has prepared transaction, it will throw an error, + * which we will ignore and the prepared foreign transaction will be + * resolved by a foreign transaction resolver. + */ + if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid, + fdw_conn->umid, fdw_xact_id)) + { Again, I think this is an impractical API contract. It assumes that prepare_foreign_xact can't throw errors, which is likely to make for a lot of implementation problems on the FDW side -- you can't even palloc. You can't call any existing code you've already got that might throw an error for any reason. You definitely can't do a syscache lookup. You can't accept interrupts, even though you're doing network I/O that could hang. The only reason to structure it this way is to avoid having a transaction that we think is prepared in our local bookkeeping when, on the remote side, it really isn't. But that is an unavoidable problem, because the whole system could crash after the remote prepare has been done and before prepare_foreign_xact even returns control. Or we could fail after XLogInsert() and before XLogFlush(). AtEOXact_FdwXacts has the same problem. I think you can fix make this a lot cleaner if you make it part of the API contract that resolver shouldn't fail when asked to roll back a transaction that was never prepared. Then it can work like this: just before we prepare the remote transaction, we note the information in shared memory and in the WAL. If that fails, we just abort. When the resolver sees that our xact is no longer running and did not commit, it concludes that we must have failed and tries to roll back all the remotely-prepared xacts. If they don't exist, then the PREPARE failed; if they do, then we roll 'em back. On the other hand, if all of the prepares succeed, then we commit. Seeing that our XID is no longer running and did commit, the resolver tries to commit those remotely-prepared xacts. In the commit case, we log a complaint if we don't find the xact, but in the abort case, it's got to be an expected outcome. If we really wanted to get paranoid about this, we could log a WAL record after finishing all the prepares -- or even each prepare -- saying "hey, after this point the remote xact should definitely exist!". And then the resolver could complaints a nonexistent remote xact when rolling back only if that record hasn't been issued yet. But to me that looks like overkill; the window between issuing that record and issuing the actual commit record would be very narrow, and in most cases they would be flushed together anyway. We could of course force the new record to be separately flushed first, but that's just making everything slower for no gain. Note that this requires that we not truncate away portions of clog that contain commit status information about no-longer-running transactions that have unresolved FDW 2PC xacts, or at least that we issue a WAL record updating the state of the fdw_xact so that it doesn't refer to that portion of clog any more - e.g. by setting the XID to either FrozenTransactionId or InvalidTransactionId, though that would be a problem since it would break the unique-XID assumption. I don't see the patch doing either of those things right now, although I may have missed it. Note that here again the differences between local 2PC and FDW 2PC really make a difference. Local 2PC has a PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent clog truncation, because the relevant XIDs are showing up in snapshots. The PrescanPreparedTransactions stuff only needs to nail things down until we reach consistency, and then the regular mechanisms take over. We have no such easy out for this system. + if (unlink(path)) + if (errno != ENOENT || giveWarning) Poor style. Use &&, not nested ifs. Oh, I guess you copied this from the twophase.c code; but let's fix it anyway. It's not exactly clear to me what the point of "locked" FdwXacts is, but whatever that point may be, how can remove_fdw_xact() get away with summarily releasing a lock that may be held by some other process? If we're the process that has the FdwXact locked, then we'll delete it from MyLockedFdwXacts, but if some other process has it locked, nothing will happen. If that's safe for some non-obvious reason, it at least needs a comment. I think this whole function could be written with a lot less nesting if you first write a loop to find the appropriate value for cnt, then error out if we end up with cnt >= FdwXactCtl->numFdwXacts, and then finally do all of the stuff that happens once we identify a match. That saves two levels of indentation for most of the function. The delayCkpt interlocking which twophase.c uses is absent here. That's important, because otherwise a checkpoint can happen between the time we write to WAL and the time we actually perform the on-disk operations. If a crash happens just before the operation is actually performed, then it never happens on the master but still happens on the standbys. Oops. +void +FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid) +{ + FdwXact fdw_xact; + + Assert(RecoveryInProgress()); + + fdw_xact = get_fdw_xact(xid, serverid, userid); + + if (fdw_xact) + { + /* Now we can clean up any files we already left */ + Assert(fdw_xact->inredo); + remove_fdw_xact(fdw_xact); + } + else + { + /* + * Entry could be on disk. Call with giveWarning = false + * since it can be expected during replay. + */ + RemoveFdwXactFile(xid, serverid, userid, false); + } +} I hope this won't sound too harsh, but he phrase that comes to mind here is "spaghetti code". First, we look for a matching FdwXact in shared memory and, if we find one, do all of the cleanup inside remove_fdw_xact() which also removes it from the disk. Otherwise, we try to remove it from disk anyway. if (condition) do_a_and_b(); else do_b(); is not generally a good way to structure code. Moreover, it's not clear why we should be doing it like this in the first place. There's no similar logic in twophase.c; PrepareRedoRemove does nothing on disk if the state isn't found in memory. The fact that you've got this set up this way suggests that you haven't structured things so as to guarantee that the in-memory state is always accurate. If so, that should be fixed; if not, this code isn't needed anyway. + if (!fdwXactExitRegistered) + { + before_shmem_exit(AtProcExit_FdwXact, 0); + fdwXactExitRegistered = true; + } Sadly, this code has a latent hazard. If somebody ever calls this from inside a PG_ENSURE_ERROR_CLEANUP() block, then they can end up failing to unregister their handler, because of the limitations described in before_shmem_cleanup()'s handler. It's better to do this in FdwXactShmemInit(). + if (list_length(entries_to_resolve) == 0) Here again, just test against NIL. fdwxact_resolver.c is very light on comments. +static +void FdwXactRslvLoop(void) Not project style. There are other, similar instances. + need_twophase = TwoPhaseCommitRequired(); I think this nomenclature is going to cause confusion. We need to distinguish somehow between using remote 2PC for foreign transactions and using local 2PC. The TwoPhase naming is already well-established as referring to the latter, so I think we should name this some other way. + if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid)) + { + Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp); + ereport(ERROR, + (errmsg("server \"%s\" has unresolved prepared transactions on it", + NameStr(srvForm->srvname)))); + } I think if this happens, it would be more appropriate to just issue a WARNING and forget about those transactions. Blocking DROP is not nice, and shouldn't be done without a really good reason. + (errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0"))); Bogus capitalization. +#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8) I am very much not impressed by this uncommented macro definition. You can probably guess the reason. :-) + ereport(WARNING, (errmsg("could not resolve dangling foreign transaction for xid %u, foreign server %u and user %d", + fdwxact->local_xid, fdwxact->serverid, fdwxact->userid))); Formatting is wrong. My ability to concentrate on this patch is just about exhausted for today so I think I'm going to have to stop here. But in general I would say this patch still needs a lot of work. As noted above, the concurrency, crash-safety, and error-handing issues don't seem to have been thought through carefully enough, and there are even a fairly large number of trivial spelling errors and coding and/or message style violations. Comments are lacking in some places where they are clearly needed. There seems to be a fair amount of work needed to ensure that each thing has exactly one name: not two, and not a shared name with something else, and that all of those names are clear. There are a few TODO items remaining in the code. I think that it is going to take a significant effort to get all of this cleaned up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: