Thread: abort-time portal cleanup
I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few days and have come to the conclusion that the code is not entirely up to our usual standards. I believe that a good deal of the reason for this is attributable to the poor quality of the code comments in this area, although there are perhaps some other contributing factors as well, such as bullheadedness on my part and perhaps others. The trouble starts with the header comment for AtAbort_Portals, which states that "At this point we run the cleanup hook if present, but we can't release the portal's memory until the cleanup call." At the time this logic was introduced in commit de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02), AtAbort_Portals() affected all non-held portals without caring whether they were active or not, and, UserAbortTransactionBlock() called AbortTransaction() directly, so typing "ROLLBACK;" would cause AtAbort_Portals() to be reached from within PortalRun(). Even if PortalRun() managed to return without crashing, the caller would next try to call PortalDrop() on what was now an invalid pointer. However, commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed things so that UserAbortEndTransaction() just sets things up so that the subsequent call to CommitTransactionCommand() would call AbortTransaction() instead of trying to do it right away, and that doesn't happen until after we're done with the portal. As far as I can see, that change made this comment mostly false, but the comment has nevertheless managed to survive for another ~15 years. I think we can, and in fact should, just drop the portal right here. As far as actually making that work, there are a few wrinkles. The first is that we might be in the middle of FATAL. In that case, unlike the ROLLBACK case, a call to PortalRun() is still on the stack, but we'll exit the process rather than returning, so the fact that we've created a dangling pointer for the caller won't matter. However, as shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01) and the report that led up to it at https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de it's not a good idea to try to clean up the portal in that case, because we might've already started shutting down critical systems. It seems not only risky but also unnecessary: our process-local state is about to go away, and the executor shouldn't need to clean up any shared memory state that won't also get cleaned up by some other mechanism. So, it seems to me that if we reach AtAbort_Portals() during FATAL processing, we should either (1) do nothing at all and just return or (2) forget about all the existing portals without cleaning them up and then return. The second option seems a little safer to me, because it guarantees that if we somehow reach code that might try to look up a portal later, it won't find anything. But I think it's arguable. The second wrinkle is that there might be an active portal. Apart from the FATAL case already mentioned, I think the only way this can happen is some C code that calls purposefully calls AbortTransaction() in the middle of executing a command. It can't be an ERROR, because then the portal would be marked as failed before we get here, and it can't be an explicit ROLLBACK, because as noted above, that case was changed 15 years ago. It's got to be some other case where C code calls AbortTransaction() voluntarily in the middle of a statement. For over a decade, there were no cases at all of this type, but the code in this function catered to hypothetical cases by marking the portal failed. By 2016, Noah had figured out that this was bogus, and that any future cases would likely require different handling, but Tom and I shouted him down: http://postgr.es/m/67674.1454259004@sss.pgh.pa.us The end result of that discussion was commit 41baee7a9312eefb315b6b2973ac058c9efaa9cf (2016-02-05) which left the code as it was but added comments nothing that it was wrong. It actually wasn't entirely wrong, because it handled the FATAL case mentioned above by the byzantine mechanism of invoking the portal's cleanup callback after first setting the status to PORTAL_FAILED. Since the only existing cleanup callback arranges to do nothing if the status is PORTAL_FAILED, this worked out to a complicated way of (correctly) skipping callback in the FATAL case. But, probably because that wasn't documented in any understandable way, possibly because nobody really understood it, when commit 8561e4840c81f7e345be2df170839846814fa004 (2018-01-22) added support for transaction control in procedures, it just removed the code marking active portals as failed, just as Noah had predicted would be necessary ~2 years earlier. Andres noticed that this broke the FATAL case and tracked it back to the removal of this code, resulting it it getting put back, but just for the FATAL case, in commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01). See also discussion at: https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de I think that the code here still isn't really right. Apart from the fact that the comments don't explain anything very clearly and the FATAL case is handled in a way that looks overly complicated and accidental, the handling in AtSubAbort_Portals() hasn't been updated, so it's now inconsistent with AtAbort_Portals() as to both substance and comments. I think that's only held together because stored procedures don't yet support explicit control over subtransactions, only top-level transactions. Stepping back a bit, stored procedures are a good example of a command that uses multiple transactions internally. We have a few others, such as VACUUM, but at present, that case only commits transactions internally; it does not roll them back internally. If it did, it would want the same thing that procedures want, namely, to leave the active portal alone. It doesn't quite work to leave the active portal completely alone, because the portal has got a pointer to a ResourceOwner which is about to be freed by end-of-transaction cleanup; at the least, we've got to clear the pointer to that when the multi-transaction statement eventually finishes in some future transaction, it doesn't try to do anything with a dangling pointer. But note that the resource owner can't really be in use in this situation anyway; if it is, then the transaction abort is about to blow away resources that the statement still needs. Similarly, the statement can't be using any transaction-local memory, because that too is about to get blown away. The only thing that can work at all here is a statement that's been carefully designed to be able to survive starting and ending transactions internally. Such a statement must not rely on any transaction-local resources. The only thing I'm sure we have to do is set portal->resowner to NULL. Calling the cleanup hook, as the current code does, can't be right, because we'd be cleaning up something that isn't going away. I think it only works now because this is another case where the cleanup hook arranges to do nothing in the cases where calling it is wrong in the first place. The current code also calls PortalReleaseCachedPlan in this case; I'm not 100% certain whether that's appropriate or not. Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely in favor of dropping portals on the spot in At(Sub)Abort_Portals(), (2) overhauls the comments in this area, and (3) makes AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible concern here - which Andres mentioned to me during off-list discussion - is that someone might create a portal for a ROLLBACK statement, and then try to execute that portal after an error has occurred. In theory, keeping the portal around between abort time and cleanup time might allow this to work, but it actually doesn't work, so there's no functional regression. As noted by Amit Kapila also during off-list discussion, IsTransactionStmtList() only works if portal->stmts is set, and PortalReleaseCachedPlan() clears portal->stmts, so once we've aborted the transaction, any previously-created portals are unrecognizable as transaction statements. This area is incredibly confusing to understand, so it's entirely possible that I've gotten some things wrong. However, I think it's worth the effort to try to do some cleanup here, because I think it's overly complex and under-documented. On top of the commits already mentioned, some of which I think demonstrate that the issues here haven't been completely understood, I found other bug fix commits that look like bugs that might've never happened in the first place if this weren't all so confusing. Feedback appreciated, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few > days and have come to the conclusion that the code is not entirely up > to our usual standards. I believe that a good deal of the reason for > this is attributable to the poor quality of the code comments in this > area, although there are perhaps some other contributing factors as > well, such as bullheadedness on my part and perhaps others. > > The trouble starts with the header comment for AtAbort_Portals, which > states that "At this point we run the cleanup hook if present, but we > can't release the portal's memory until the cleanup call." At the time > this logic was introduced in commit > de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02), > AtAbort_Portals() affected all non-held portals without caring whether > they were active or not, and, UserAbortTransactionBlock() called > AbortTransaction() directly, so typing "ROLLBACK;" would cause > AtAbort_Portals() to be reached from within PortalRun(). Even if > PortalRun() managed to return without crashing, the caller would next > try to call PortalDrop() on what was now an invalid pointer. However, > commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed > things so that UserAbortEndTransaction() just sets things up so that > the subsequent call to CommitTransactionCommand() would call > AbortTransaction() instead of trying to do it right away, and that > doesn't happen until after we're done with the portal. As far as I > can see, that change made this comment mostly false, but the comment > has nevertheless managed to survive for another ~15 years. I think we > can, and in fact should, just drop the portal right here. > > As far as actually making that work, there are a few wrinkles. The > first is that we might be in the middle of FATAL. In that case, unlike > the ROLLBACK case, a call to PortalRun() is still on the stack, but > we'll exit the process rather than returning, so the fact that we've > created a dangling pointer for the caller won't matter. However, as > shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01) > and the report that led up to it at > https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de > it's not a good idea to try to clean up the portal in that case, > because we might've already started shutting down critical systems. > It seems not only risky but also unnecessary: our process-local state > is about to go away, and the executor shouldn't need to clean up any > shared memory state that won't also get cleaned up by some other > mechanism. So, it seems to me that if we reach AtAbort_Portals() > during FATAL processing, we should either (1) do nothing at all and > just return or (2) forget about all the existing portals without > cleaning them up and then return. The second option seems a little > safer to me, because it guarantees that if we somehow reach code that > might try to look up a portal later, it won't find anything. But I > think it's arguable. > I agree with your position on this. > > Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely > in favor of dropping portals on the spot in At(Sub)Abort_Portals(), > (2) overhauls the comments in this area, and (3) makes > AtSubAbort_Portals() consistent with AtAbort_Portals(). The overall idea seems good to me, but I have a few comments on the changes. 1. @@ -2756,7 +2756,6 @@ CleanupTransaction(void) /* * do abort cleanup processing */ - AtCleanup_Portals(); /* now safe to release portal memory */ AtEOXact_Snapshot(false, true); /* and release the transaction's snapshots */ CurrentResourceOwner = NULL; /* and resource owner */ @@ -5032,8 +5031,6 @@ CleanupSubTransaction(void) elog(WARNING, "CleanupSubTransaction while in %s state", TransStateAsString(s->state)); - AtSubCleanup_Portals(s->subTransactionId); - After this cleanup, I think we don't need At(Sub)Abort_Portals in AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and friends. This is because AbortTransaction itself would have zapped the portal. 2. You seem to forgot removing AtCleanup_Portals() from portal.h 3. /* - * If it was created in the current transaction, we can't do normal - * shutdown on a READY portal either; it might refer to objects - * created in the failed transaction. See comments in - * AtSubAbort_Portals. - */ - if (portal->status == PORTAL_READY) - MarkPortalFailed(portal); - Why it is safe to remove this check? It has been explained in commit 7981c342 why we need that check. I don't see any explanation in email or patch which justifies this code removal. Is it because you removed PortalCleanup? If so, that is still called from PortalDrop? 4. -AtCleanup_Portals(void) -{ - HASH_SEQ_STATUS status; - PortalHashEnt *hentry; - - hash_seq_init(&status, PortalHashTable); - - while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL) - { - Portal portal = hentry->portal; - - /* - * Do not touch active portals --- this can only happen in the case of - * a multi-transaction command. + * If the status is PORTAL_ACTIVE, then we must be executing a command + * that uses multiple transactions internally. In that case, the + * command in question must be one that does not internally rely on + * any transaction-lifetime resources, because they would disappear + * in the upcoming transaction-wide cleanup. */ if (portal->status == PORTAL_ACTIVE) I am not able to understand how we can reach with the portal state as 'active' for a multi-transaction command. It seems wherever we mark portal as active, we don't relinquish the control unless its state is changed. Can you share some example where this can happen? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > /* * Otherwise, do nothing to cursors held over from a previous * transaction. */ if (portal->createSubid == InvalidSubTransactionId) continue; /* * Do nothing to auto-held cursors. This is similar to the case of a * cursor from a previous transaction, but it could also be that the * cursor was auto-held in this transaction, so it wants to live on. */ if (portal->autoHeld) continue; I have one doubt that why do we need the second check. Because before setting portal->autoHeld to true we always call HoldPortal therein we set portal->createSubid to InvalidSubTransactionId. So it seems to me that the second condition will never reach. Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-09-12 16:42:46 -0400, Robert Haas wrote: > The trouble starts with the header comment for AtAbort_Portals, which > states that "At this point we run the cleanup hook if present, but we > can't release the portal's memory until the cleanup call." At the time > this logic was introduced in commit > de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02), > AtAbort_Portals() affected all non-held portals without caring whether > they were active or not, and, UserAbortTransactionBlock() called > AbortTransaction() directly, so typing "ROLLBACK;" would cause > AtAbort_Portals() to be reached from within PortalRun(). Even if > PortalRun() managed to return without crashing, the caller would next > try to call PortalDrop() on what was now an invalid pointer. However, > commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed > things so that UserAbortEndTransaction() just sets things up so that > the subsequent call to CommitTransactionCommand() would call > AbortTransaction() instead of trying to do it right away, and that > doesn't happen until after we're done with the portal. As far as I > can see, that change made this comment mostly false, but the comment > has nevertheless managed to survive for another ~15 years. I think we > can, and in fact should, just drop the portal right here. Nice digging. > As far as actually making that work, there are a few wrinkles. The > first is that we might be in the middle of FATAL. In that case, unlike > the ROLLBACK case, a call to PortalRun() is still on the stack, but > we'll exit the process rather than returning, so the fact that we've > created a dangling pointer for the caller won't matter. However, as > shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01) > and the report that led up to it at > https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de > it's not a good idea to try to clean up the portal in that case, > because we might've already started shutting down critical systems. > It seems not only risky but also unnecessary: our process-local state > is about to go away, and the executor shouldn't need to clean up any > shared memory state that won't also get cleaned up by some other > mechanism. So, it seems to me that if we reach AtAbort_Portals() > during FATAL processing, we should either (1) do nothing at all and > just return or (2) forget about all the existing portals without > cleaning them up and then return. The second option seems a little > safer to me, because it guarantees that if we somehow reach code that > might try to look up a portal later, it won't find anything. But I > think it's arguable. Hm. Doing that cleanup requires digging through all the portals etc. I'd rather rely on less state being correct than more during FATAL processing. > The second wrinkle is that there might be an active portal. Apart > from the FATAL case already mentioned, I think the only way this can > happen is some C code that calls purposefully calls AbortTransaction() > in the middle of executing a command. It can't be an ERROR, because > then the portal would be marked as failed before we get here, and it > can't be an explicit ROLLBACK, because as noted above, that case was > changed 15 years ago. It's got to be some other case where C code > calls AbortTransaction() voluntarily in the middle of a statement. For > over a decade, there were no cases at all of this type, but the code > in this function catered to hypothetical cases by marking the portal > failed. By 2016, Noah had figured out that this was bogus, and that > any future cases would likely require different handling, but Tom and > I shouted him down: Hm. But wouldn't doing so run into a ton of problems anyway? I mean we need to do a ton of checks and special case hangups to make CommitTransactionCommand();StartTransactionCommand(); work for VACUUM, CIC, ... The cases where one can use AbortTransaction() (via AbortCurrentTransaction() presumably) are ones where either there's no surrounding code relying on the transaction (e.g. autovacuum, postgres.c), or where special care has been taken with portals (e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so I think even if we accept your/Tom's reasoning from back then (I don't really), it's outdated now that the pin mechanism exists. I'd be happy if we added some defenses against such bogus cases being introduced (i.e. erroring out if we encounter an active portal during abort processing). > Stepping back a bit, stored procedures are a good example of a command > that uses multiple transactions internally. We have a few others, such > as VACUUM, but at present, that case only commits transactions > internally; it does not roll them back internally. If it did, it > would want the same thing that procedures want, namely, to leave the > active portal alone. It doesn't quite work to leave the active portal > completely alone, because the portal has got a pointer to a > ResourceOwner which is about to be freed by end-of-transaction > cleanup; Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(), which, via HoldPortal(), sets portal->resowner to = NULL. > The current code also calls PortalReleaseCachedPlan in this case; I'm > not 100% certain whether that's appropriate or not. I think it's appropriate, because we cannot guarantee that the plan is still usable. Besides normal plan invalidation issues, the catalog contents the plan might rely on might have only existed in the aborted transaction - which seems like a fatal problem. That's why holding portals persists the portalstore. Which then also obsoletes the plan. > Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely > in favor of dropping portals on the spot in At(Sub)Abort_Portals(), > (2) overhauls the comments in this area, and (3) makes > AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible > concern here - which Andres mentioned to me during off-list discussion > - is that someone might create a portal for a ROLLBACK statement, and > then try to execute that portal after an error has occurred. Besides not quite happening as I though, as you reference below, I think that only mattered if a rollback gets executed in a failed transaction - that has to happen after a sync message, because otherwise we'd just skip the command. But to be problematic, the bind would have to be from before the sync, and the exec from after - which'd be an extremely absurd use of the protocol. > /* > * Abort processing for portals. > * > - * At this point we run the cleanup hook if present, but we can't release the > - * portal's memory until the cleanup call. > + * Most portals don't and shouldn't survive transaction abort, but there are > + * some important special cases where they do and must: (1) held portals must > + * survive by definition, and (2) any active portal must be part of a command > + * that uses multiple transactions internally, and needs to survive until > + * execution of that command has completed. Hm. Why are active, rather than pinnned, portals relevant here? Normal multi-transactional commands (e.g. vacuum, CIC) shouldn't get here with an active portal, as they don't catch errors. And procedures should have pinned the relevant portals? Greetings, Andres Freund
On Tue, Sep 24, 2019 at 4:45 PM Andres Freund <andres@anarazel.de> wrote: > Hm. Doing that cleanup requires digging through all the portals etc. I'd > rather rely on less state being correct than more during FATAL > processing. I agree with the principle, but the advantage of removing the hash table entries is that it protects against any other code that might try to access the portal machinery later; I thought that was worthwhile enough to justify doing this here. I don't feel super-strongly about it, but I do still like that approach. > > The second wrinkle is that there might be an active portal. Apart > > from the FATAL case already mentioned, I think the only way this can > > happen is some C code that calls purposefully calls AbortTransaction() > > in the middle of executing a command. It can't be an ERROR, because > > then the portal would be marked as failed before we get here, and it > > can't be an explicit ROLLBACK, because as noted above, that case was > > changed 15 years ago. It's got to be some other case where C code > > calls AbortTransaction() voluntarily in the middle of a statement. For > > over a decade, there were no cases at all of this type, but the code > > in this function catered to hypothetical cases by marking the portal > > failed. By 2016, Noah had figured out that this was bogus, and that > > any future cases would likely require different handling, but Tom and > > I shouted him down: > > Hm. But wouldn't doing so run into a ton of problems anyway? I mean we > need to do a ton of checks and special case hangups to make > CommitTransactionCommand();StartTransactionCommand(); work for VACUUM, > CIC, ... Right... those are the kinds of cases I'm talking about here, just for abort rather than commit. > The cases where one can use AbortTransaction() (via > AbortCurrentTransaction() presumably) are ones where either there's no > surrounding code relying on the transaction (e.g. autovacuum, > postgres.c), or where special care has been taken with portals > (e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so > I think even if we accept your/Tom's reasoning from back then (I don't > really), it's outdated now that the pin mechanism exists. It isn't, actually. To respond to this and also your question below about why I'm looking at active portal rather than pinned portals, try adding this debugging code to AtAbort_Portals: + if (portal->status == PORTAL_ACTIVE) + elog(NOTICE, "this portal is ACTIVE and %spinned", + portal->portalPinned ? "" : "NOT "); Then run 'make -C src/pl/plpgsql check' and check src/pl/plpgsql/src/regression.diffs and you'll see a whole lot of this: +NOTICE: this portal is ACTIVE and NOT pinned The PLs pin the portals they generate internally, but they don't force the surrounding portal in which the toplevel query is executing to be pinned. AFAICT, pinning is mostly guarding against explicit user-initiated drops of portals that were automatically generated by a PL, whereas the portal's state is about tracking what the system is doing with the portal. (I think this could be a lot better documented than it is, but looking at the commit history, I'm fairly sure that's what is happening here.) > I'd be happy if we added some defenses against such bogus cases being > introduced (i.e. erroring out if we encounter an active portal during > abort processing). Erroring out during error handling is probably a bad idea, but also, see above. > > Stepping back a bit, stored procedures are a good example of a command > > that uses multiple transactions internally. We have a few others, such > > as VACUUM, but at present, that case only commits transactions > > internally; it does not roll them back internally. If it did, it > > would want the same thing that procedures want, namely, to leave the > > active portal alone. It doesn't quite work to leave the active portal > > completely alone, because the portal has got a pointer to a > > ResourceOwner which is about to be freed by end-of-transaction > > cleanup; > > Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(), > which, via HoldPortal(), sets portal->resowner to = NULL. Right, but it's still necessary for AtAbort_Portals() to do the same thing, again because the top-level portal isn't pinned. If you apply my patch, comment out the line that does portal->resowner = NULL; for an active portal, and run make -C src/pl/plpgsql check, it will seg fault inside exec_simple_query -> PortalDrop -> ResourceOwnerRelease -> etc. > > The current code also calls PortalReleaseCachedPlan in this case; I'm > > not 100% certain whether that's appropriate or not. > > I think it's appropriate, because we cannot guarantee that the plan is > still usable. Besides normal plan invalidation issues, the catalog > contents the plan might rely on might have only existed in the aborted > transaction - which seems like a fatal problem. That's why holding > portals persists the portalstore. Which then also obsoletes the plan. In a case like this, it can't really be an actual planable statement. If the executor were involved, aborting the running transaction and starting a new one would certainly result in a crash, because the memory context in which we had saved all of our working state would be vanish out from under us. It's got to be a procedure call or maybe some kind of DDL command that has been specially-crafted to survive a mid-command abort. It's not clear to me that the issues for plan invalidation and catalog contents are the same in those kinds of cases as they are for planable statements, but perhaps they are. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 24, 2019 at 5:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > After this cleanup, I think we don't need At(Sub)Abort_Portals in > AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and > friends. This is because AbortTransaction itself would have zapped the > portal. Not if the ROLLBACK itself failed - in that case, the portal would have been active at the time, and thus not subject to removal. And, as the existing comments in xact.c state, that's exactly why that function call is there. > 2. You seem to forgot removing AtCleanup_Portals() from portal.h Oops. Fixed in the attached version. > - if (portal->status == PORTAL_READY) > - MarkPortalFailed(portal); > > Why it is safe to remove this check? It has been explained in commit > 7981c342 why we need that check. I don't see any explanation in email > or patch which justifies this code removal. Is it because you removed > PortalCleanup? If so, that is still called from PortalDrop? All MarkPortalFailed() does is change the status to PORTAL_FAILED and call the cleanup hook. PortalDrop() calls the cleanup hook, and we don't need to change the status if we're removing it completely. > 4. > + * If the status is PORTAL_ACTIVE, then we must be > executing a command > + * that uses multiple transactions internally. In that > case, the > + * command in question must be one that does not > internally rely on > + * any transaction-lifetime resources, because they > would disappear > + * in the upcoming transaction-wide cleanup. > */ > if (portal->status == PORTAL_ACTIVE) > > I am not able to understand how we can reach with the portal state as > 'active' for a multi-transaction command. It seems wherever we mark > portal as active, we don't relinquish the control unless its state is > changed. Can you share some example where this can happen? Yeah -- a plpgsql function or procedure that does "ROLLBACK;" internally. The calling code doesn't relinquish control, but it does reach AbortTransaction(). If you want to see it happen, just put an elog() inside that block and run make -C src/pl/plpgsql check. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Sep 24, 2019 at 6:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > /* > * Otherwise, do nothing to cursors held over from a previous > * transaction. > */ > if (portal->createSubid == InvalidSubTransactionId) > continue; > > /* > * Do nothing to auto-held cursors. This is similar to the case of a > * cursor from a previous transaction, but it could also be that the > * cursor was auto-held in this transaction, so it wants to live on. > */ > if (portal->autoHeld) > continue; > > I have one doubt that why do we need the second check. Because before > setting portal->autoHeld to true we always call HoldPortal therein we > set portal->createSubid to InvalidSubTransactionId. So it seems to me > that the second condition will never reach. Am I missing something? Not that I can see, but I don't necessarily think this patch needs to change it, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-10-07 12:14:52 -0400, Robert Haas wrote: > > - if (portal->status == PORTAL_READY) > > - MarkPortalFailed(portal); > > > > Why it is safe to remove this check? It has been explained in commit > > 7981c342 why we need that check. I don't see any explanation in email > > or patch which justifies this code removal. Is it because you removed > > PortalCleanup? If so, that is still called from PortalDrop? > > All MarkPortalFailed() does is change the status to PORTAL_FAILED and > call the cleanup hook. PortalDrop() calls the cleanup hook, and we > don't need to change the status if we're removing it completely. Note that currently PortalCleanup() behaves differently depending on whether the portal is set to failed or not... - Andres
On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-10-07 12:14:52 -0400, Robert Haas wrote: > > > - if (portal->status == PORTAL_READY) > > > - MarkPortalFailed(portal); > > > > > > Why it is safe to remove this check? It has been explained in commit > > > 7981c342 why we need that check. I don't see any explanation in email > > > or patch which justifies this code removal. Is it because you removed > > > PortalCleanup? If so, that is still called from PortalDrop? > > > > All MarkPortalFailed() does is change the status to PORTAL_FAILED and > > call the cleanup hook. PortalDrop() calls the cleanup hook, and we > > don't need to change the status if we're removing it completely. > > Note that currently PortalCleanup() behaves differently depending on > whether the portal is set to failed or not... Urk, yeah, I forgot about that. I think that's a wretched hack that somebody ought to untangle at some point, but maybe for purposes of this patch it makes more sense to just put the MarkPortalFailed call back. It's unclear to me why there's a special case here specifically for PORTAL_READY. Like, why is PORTAL_NEW or PORTAL_DEFINED or PORTAL_DONE any different? It seems like if we're aborting the transaction, we should not be calling ExecutorFinish()/ExecutorEnd() for anything. We could achieve that result by just nulling out the cleanup hook unconditionally instead of having this complicated dance where we mark ready portals failed, which calls the cleanup hook, which decides not to do anything because the portal has been marked failed. It'd be great if there were a few more comments in this file explaining what the thinking behind all this was. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 9, 2019 at 6:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-10-07 12:14:52 -0400, Robert Haas wrote: > > > > - if (portal->status == PORTAL_READY) > > > > - MarkPortalFailed(portal); > > > > > > > > Why it is safe to remove this check? It has been explained in commit > > > > 7981c342 why we need that check. I don't see any explanation in email > > > > or patch which justifies this code removal. Is it because you removed > > > > PortalCleanup? If so, that is still called from PortalDrop? > > > > > > All MarkPortalFailed() does is change the status to PORTAL_FAILED and > > > call the cleanup hook. PortalDrop() calls the cleanup hook, and we > > > don't need to change the status if we're removing it completely. > > > > Note that currently PortalCleanup() behaves differently depending on > > whether the portal is set to failed or not... > > Yeah, this is the reason, I mentioned it. > Urk, yeah, I forgot about that. I think that's a wretched hack that > somebody ought to untangle at some point, but maybe for purposes of > this patch it makes more sense to just put the MarkPortalFailed call > back. > +1. > It's unclear to me why there's a special case here specifically for > PORTAL_READY. Like, why is PORTAL_NEW or PORTAL_DEFINED or > PORTAL_DONE any different? > If read the commit message of commit 7981c34279 [1] which introduced this, then we might get some clue. It is quite possible that we need same handling for PORTAL_NEW, PORTAL_DEFINED, etc. but it seems we just hit the problem mentioned in commit 7981c34279 for PORTAL_READY state. I think as per commit, if we don't mark it failed, then with auto_explain things can go wrong. [1] - commit 7981c34279fbddc254cfccb9a2eec4b35e692a12 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Feb 18 03:06:46 2010 +0000 Force READY portals into FAILED state when a transaction or subtransaction is aborted, if they were created within the failed xact. This prevents ExecutorEnd from being run on them, which is a good idea because they may contain references to tables or other objects that no longer exist. In particular this is hazardous when auto_explain is active, but it's really rather surprising that nobody has seen an issue with this before. I'm back-patching this to 8.4, since that's the first version that contains auto_explain or an ExecutorEnd hook, but I wonder whether we shouldn't back-patch further. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com