Thread: abort-time portal cleanup

abort-time portal cleanup

From

Robert Haas

Date:

12 September 2019, 20:42:46

I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
days and have come to the conclusion that the code is not entirely up
to our usual standards. I believe that a good deal of the reason for
this is attributable to the poor quality of the code comments in this
area, although there are perhaps some other contributing factors as
well, such as bullheadedness on my part and perhaps others.

The trouble starts with the header comment for AtAbort_Portals, which
states that "At this point we run the cleanup hook if present, but we
can't release the portal's memory until the cleanup call." At the time
this logic was introduced in commit
de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
AtAbort_Portals() affected all non-held portals without caring whether
they were active or not, and, UserAbortTransactionBlock() called
AbortTransaction() directly, so typing "ROLLBACK;" would cause
AtAbort_Portals() to be reached from within PortalRun().  Even if
PortalRun() managed to return without crashing, the caller would next
try to call PortalDrop() on what was now an invalid pointer. However,
commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
things so that UserAbortEndTransaction() just sets things up so that
the subsequent call to CommitTransactionCommand() would call
AbortTransaction() instead of trying to do it right away, and that
doesn't happen until after we're done with the portal.  As far as I
can see, that change made this comment mostly false, but the comment
has nevertheless managed to survive for another ~15 years. I think we
can, and in fact should, just drop the portal right here.

As far as actually making that work, there are a few wrinkles. The
first is that we might be in the middle of FATAL. In that case, unlike
the ROLLBACK case, a call to PortalRun() is still on the stack, but
we'll exit the process rather than returning, so the fact that we've
created a dangling pointer for the caller won't matter. However, as
shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
and the report that led up to it at
https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
it's not a good idea to try to clean up the portal in that case,
because we might've already started shutting down critical systems.
It seems not only risky but also unnecessary: our process-local state
is about to go away, and the executor shouldn't need to clean up any
shared memory state that won't also get cleaned up by some other
mechanism. So, it seems to me that if we reach AtAbort_Portals()
during FATAL processing, we should either (1) do nothing at all and
just return or (2) forget about all the existing portals without
cleaning them up and then return.  The second option seems a little
safer to me, because it guarantees that if we somehow reach code that
might try to look up a portal later, it won't find anything. But I
think it's arguable.

The second wrinkle is that there might be an active portal.  Apart
from the FATAL case already mentioned, I think the only way this can
happen is some C code that calls purposefully calls AbortTransaction()
in the middle of executing a command.  It can't be an ERROR, because
then the portal would be marked as failed before we get here, and it
can't be an explicit ROLLBACK, because as noted above, that case was
changed 15 years ago.  It's got to be some other case where C code
calls AbortTransaction() voluntarily in the middle of a statement. For
over a decade, there were no cases at all of this type, but the code
in this function catered to hypothetical cases by marking the portal
failed. By 2016, Noah had figured out that this was bogus, and that
any future cases would likely require different handling, but Tom and
I shouted him down:

http://postgr.es/m/67674.1454259004@sss.pgh.pa.us

The end result of that discussion was commit
41baee7a9312eefb315b6b2973ac058c9efaa9cf (2016-02-05) which left the
code as it was but added comments nothing that it was wrong. It
actually wasn't entirely wrong, because it handled the FATAL case
mentioned above by the byzantine mechanism of invoking the portal's
cleanup callback after first setting the status to PORTAL_FAILED.
Since the only existing cleanup callback arranges to do nothing if the
status is PORTAL_FAILED, this worked out to a complicated way of
(correctly) skipping callback in the FATAL case.

But, probably because that wasn't documented in any understandable
way, possibly because nobody really understood it, when commit
8561e4840c81f7e345be2df170839846814fa004 (2018-01-22) added support
for transaction control in procedures, it just removed the code
marking active portals as failed, just as Noah had predicted would be
necessary ~2 years earlier. Andres noticed that this broke the FATAL
case and tracked it back to the removal of this code, resulting it it
getting put back, but just for the FATAL case, in commit
ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01). See also
discussion at:

https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de

I think that the code here still isn't really right.  Apart from the
fact that the comments don't explain anything very clearly and the
FATAL case is handled in a way that looks overly complicated and
accidental, the handling in AtSubAbort_Portals() hasn't been updated,
so it's now inconsistent with AtAbort_Portals() as to both substance
and comments. I think that's only held together because stored
procedures don't yet support explicit control over subtransactions,
only top-level transactions.

Stepping back a bit, stored procedures are a good example of a command
that uses multiple transactions internally. We have a few others, such
as VACUUM, but at present, that case only commits transactions
internally; it does not roll them back internally.  If it did, it
would want the same thing that procedures want, namely, to leave the
active portal alone. It doesn't quite work to leave the active portal
completely alone, because the portal has got a pointer to a
ResourceOwner which is about to be freed by end-of-transaction
cleanup; at the least, we've got to clear the pointer to that when the
multi-transaction statement eventually finishes in some future
transaction, it doesn't try to do anything with a dangling pointer.
But note that the resource owner can't really be in use in this
situation anyway; if it is, then the transaction abort is about to
blow away resources that the statement still needs. Similarly, the
statement can't be using any transaction-local memory, because that
too is about to get blown away. The only thing that can work at all
here is a statement that's been carefully designed to be able to
survive starting and ending transactions internally. Such a statement
must not rely on any transaction-local resources. The only thing I'm
sure we have to do is set portal->resowner to NULL.  Calling the
cleanup hook, as the current code does, can't be right, because we'd
be cleaning up something that isn't going away. I think it only works
now because this is another case where the cleanup hook arranges to do
nothing in the cases where calling it is wrong in the first place. The
current code also calls PortalReleaseCachedPlan in this case; I'm not
100% certain whether that's appropriate or not.

Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
(2) overhauls the comments in this area, and (3) makes
AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
concern here - which Andres mentioned to me during off-list discussion
- is that someone might create a portal for a ROLLBACK statement, and
then try to execute that portal after an error has occurred. In
theory, keeping the portal around between abort time and cleanup time
might allow this to work, but it actually doesn't work, so there's no
functional regression. As noted by Amit Kapila also during off-list
discussion, IsTransactionStmtList() only works if portal->stmts is
set, and PortalReleaseCachedPlan() clears portal->stmts, so once we've
aborted the transaction, any previously-created portals are
unrecognizable as transaction statements.

This area is incredibly confusing to understand, so it's entirely
possible that I've gotten some things wrong. However, I think it's
worth the effort to try to do some cleanup here, because I think it's
overly complex and under-documented. On top of the commits already
mentioned, some of which I think demonstrate that the issues here
haven't been completely understood, I found other bug fix commits that
look like bugs that might've never happened in the first place if this
weren't all so confusing.

Feedback appreciated,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Improve-handling-of-portals-after-sub-transaction-ab.patch

Re: abort-time portal cleanup

From

Amit Kapila

Date:

24 September 2019, 09:30:48

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
> days and have come to the conclusion that the code is not entirely up
> to our usual standards. I believe that a good deal of the reason for
> this is attributable to the poor quality of the code comments in this
> area, although there are perhaps some other contributing factors as
> well, such as bullheadedness on my part and perhaps others.
>
> The trouble starts with the header comment for AtAbort_Portals, which
> states that "At this point we run the cleanup hook if present, but we
> can't release the portal's memory until the cleanup call." At the time
> this logic was introduced in commit
> de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
> AtAbort_Portals() affected all non-held portals without caring whether
> they were active or not, and, UserAbortTransactionBlock() called
> AbortTransaction() directly, so typing "ROLLBACK;" would cause
> AtAbort_Portals() to be reached from within PortalRun().  Even if
> PortalRun() managed to return without crashing, the caller would next
> try to call PortalDrop() on what was now an invalid pointer. However,
> commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
> things so that UserAbortEndTransaction() just sets things up so that
> the subsequent call to CommitTransactionCommand() would call
> AbortTransaction() instead of trying to do it right away, and that
> doesn't happen until after we're done with the portal.  As far as I
> can see, that change made this comment mostly false, but the comment
> has nevertheless managed to survive for another ~15 years. I think we
> can, and in fact should, just drop the portal right here.
>
> As far as actually making that work, there are a few wrinkles. The
> first is that we might be in the middle of FATAL. In that case, unlike
> the ROLLBACK case, a call to PortalRun() is still on the stack, but
> we'll exit the process rather than returning, so the fact that we've
> created a dangling pointer for the caller won't matter. However, as
> shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
> and the report that led up to it at
> https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
> it's not a good idea to try to clean up the portal in that case,
> because we might've already started shutting down critical systems.
> It seems not only risky but also unnecessary: our process-local state
> is about to go away, and the executor shouldn't need to clean up any
> shared memory state that won't also get cleaned up by some other
> mechanism. So, it seems to me that if we reach AtAbort_Portals()
> during FATAL processing, we should either (1) do nothing at all and
> just return or (2) forget about all the existing portals without
> cleaning them up and then return.  The second option seems a little
> safer to me, because it guarantees that if we somehow reach code that
> might try to look up a portal later, it won't find anything. But I
> think it's arguable.
>

I agree with your position on this.

>
> Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
> in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
> (2) overhauls the comments in this area, and (3) makes
> AtSubAbort_Portals() consistent with AtAbort_Portals().

The overall idea seems good to me, but I have a few comments on the changes.

1.
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
  /*
  * do abort cleanup processing
  */
- AtCleanup_Portals(); /* now safe to release portal
memory */
  AtEOXact_Snapshot(false, true); /* and release the transaction's
snapshots */

  CurrentResourceOwner = NULL; /* and resource owner */
@@ -5032,8 +5031,6 @@ CleanupSubTransaction(void)
  elog(WARNING, "CleanupSubTransaction while in %s state",
  TransStateAsString(s->state));

- AtSubCleanup_Portals(s->subTransactionId);
-

After this cleanup, I think we don't need At(Sub)Abort_Portals in
AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
friends. This is because AbortTransaction itself would have zapped the
portal.

2. You seem to forgot removing AtCleanup_Portals() from portal.h

3.
  /*
- * If it was created in the current transaction, we
can't do normal
- * shutdown on a READY portal either; it might refer to
objects
- * created in the failed transaction.  See comments in
- * AtSubAbort_Portals.
- */
- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);
-

Why it is safe to remove this check?  It has been explained in commit
7981c342 why we need that check.  I don't see any explanation in email
or patch which justifies this code removal.  Is it because you removed
PortalCleanup?  If so, that is still called from PortalDrop?

4.
-AtCleanup_Portals(void)
-{
- HASH_SEQ_STATUS status;
- PortalHashEnt *hentry;
-
- hash_seq_init(&status, PortalHashTable);
-
- while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) !=
NULL)
- {
- Portal portal = hentry->portal;
-
- /*
- * Do not touch active portals --- this can only happen
in the case of
- * a multi-transaction command.
+ * If the status is PORTAL_ACTIVE, then we must be
executing a command
+ * that uses multiple transactions internally. In that
case, the
+ * command in question must be one that does not
internally rely on
+ * any transaction-lifetime resources, because they
would disappear
+ * in the upcoming transaction-wide cleanup.
  */
  if (portal->status == PORTAL_ACTIVE)

I am not able to understand how we can reach with the portal state as
'active' for a multi-transaction command.  It seems wherever we mark
portal as active, we don't relinquish the control unless its state is
changed.  Can you share some example where this can happen?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: abort-time portal cleanup

From

Dilip Kumar

Date:

24 September 2019, 10:34:17

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
/*
* Otherwise, do nothing to cursors held over from a previous
* transaction.
*/
if (portal->createSubid == InvalidSubTransactionId)
continue;

/*
* Do nothing to auto-held cursors.  This is similar to the case of a
* cursor from a previous transaction, but it could also be that the
* cursor was auto-held in this transaction, so it wants to live on.
*/
if (portal->autoHeld)
continue;

I have one doubt that why do we need the second check. Because before
setting portal->autoHeld to true we always call HoldPortal therein we
set portal->createSubid to InvalidSubTransactionId.  So it seems to me
that the second condition will never reach.  Am I missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: abort-time portal cleanup

From

Andres Freund

Date:

24 September 2019, 20:45:10

Hi,

On 2019-09-12 16:42:46 -0400, Robert Haas wrote:
> The trouble starts with the header comment for AtAbort_Portals, which
> states that "At this point we run the cleanup hook if present, but we
> can't release the portal's memory until the cleanup call." At the time
> this logic was introduced in commit
> de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
> AtAbort_Portals() affected all non-held portals without caring whether
> they were active or not, and, UserAbortTransactionBlock() called
> AbortTransaction() directly, so typing "ROLLBACK;" would cause
> AtAbort_Portals() to be reached from within PortalRun().  Even if
> PortalRun() managed to return without crashing, the caller would next
> try to call PortalDrop() on what was now an invalid pointer. However,
> commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
> things so that UserAbortEndTransaction() just sets things up so that
> the subsequent call to CommitTransactionCommand() would call
> AbortTransaction() instead of trying to do it right away, and that
> doesn't happen until after we're done with the portal.  As far as I
> can see, that change made this comment mostly false, but the comment
> has nevertheless managed to survive for another ~15 years. I think we
> can, and in fact should, just drop the portal right here.

Nice digging.


> As far as actually making that work, there are a few wrinkles. The
> first is that we might be in the middle of FATAL. In that case, unlike
> the ROLLBACK case, a call to PortalRun() is still on the stack, but
> we'll exit the process rather than returning, so the fact that we've
> created a dangling pointer for the caller won't matter. However, as
> shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
> and the report that led up to it at
> https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
> it's not a good idea to try to clean up the portal in that case,
> because we might've already started shutting down critical systems.
> It seems not only risky but also unnecessary: our process-local state
> is about to go away, and the executor shouldn't need to clean up any
> shared memory state that won't also get cleaned up by some other
> mechanism. So, it seems to me that if we reach AtAbort_Portals()
> during FATAL processing, we should either (1) do nothing at all and
> just return or (2) forget about all the existing portals without
> cleaning them up and then return.  The second option seems a little
> safer to me, because it guarantees that if we somehow reach code that
> might try to look up a portal later, it won't find anything. But I
> think it's arguable.

Hm. Doing that cleanup requires digging through all the portals etc. I'd
rather rely on less state being correct than more during FATAL
processing.


> The second wrinkle is that there might be an active portal.  Apart
> from the FATAL case already mentioned, I think the only way this can
> happen is some C code that calls purposefully calls AbortTransaction()
> in the middle of executing a command.  It can't be an ERROR, because
> then the portal would be marked as failed before we get here, and it
> can't be an explicit ROLLBACK, because as noted above, that case was
> changed 15 years ago.  It's got to be some other case where C code
> calls AbortTransaction() voluntarily in the middle of a statement. For
> over a decade, there were no cases at all of this type, but the code
> in this function catered to hypothetical cases by marking the portal
> failed. By 2016, Noah had figured out that this was bogus, and that
> any future cases would likely require different handling, but Tom and
> I shouted him down:

Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
need to do a ton of checks and special case hangups to make
CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
CIC, ...

The cases where one can use AbortTransaction() (via
AbortCurrentTransaction() presumably) are ones where either there's no
surrounding code relying on the transaction (e.g. autovacuum,
postgres.c), or where special care has been taken with portals
(e.g. _SPI_rollback()).  We didn't have the pin mechanism back then, so
I think even if we accept your/Tom's reasoning from back then (I don't
really), it's outdated now that the pin mechanism exists.

I'd be happy if we added some defenses against such bogus cases being
introduced (i.e. erroring out if we encounter an active portal during
abort processing).

> Stepping back a bit, stored procedures are a good example of a command
> that uses multiple transactions internally. We have a few others, such
> as VACUUM, but at present, that case only commits transactions
> internally; it does not roll them back internally.  If it did, it
> would want the same thing that procedures want, namely, to leave the
> active portal alone. It doesn't quite work to leave the active portal
> completely alone, because the portal has got a pointer to a
> ResourceOwner which is about to be freed by end-of-transaction
> cleanup;

Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
which, via HoldPortal(), sets portal->resowner to = NULL.


> The current code also calls PortalReleaseCachedPlan in this case; I'm
> not 100% certain whether that's appropriate or not.

I think it's appropriate, because we cannot guarantee that the plan is
still usable. Besides normal plan invalidation issues, the catalog
contents the plan might rely on might have only existed in the aborted
transaction - which seems like a fatal problem. That's why holding
portals persists the portalstore. Which then also obsoletes the plan.


> Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
> in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
> (2) overhauls the comments in this area, and (3) makes
> AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
> concern here - which Andres mentioned to me during off-list discussion
> - is that someone might create a portal for a ROLLBACK statement, and
> then try to execute that portal after an error has occurred.

Besides not quite happening as I though, as you reference below, I think
that only mattered if a rollback gets executed in a failed transaction -
that has to happen after a sync message, because otherwise we'd just
skip the command. But to be problematic, the bind would have to be from
before the sync, and the exec from after - which'd be an extremely
absurd use of the protocol.


>  /*
>   * Abort processing for portals.
>   *
> - * At this point we run the cleanup hook if present, but we can't release the
> - * portal's memory until the cleanup call.
> + * Most portals don't and shouldn't survive transaction abort, but there are
> + * some important special cases where they do and must: (1) held portals must
> + * survive by definition, and (2) any active portal must be part of a command
> + * that uses multiple transactions internally, and needs to survive until
> + * execution of that command has completed.

Hm. Why are active, rather than pinnned, portals relevant here? Normal
multi-transactional commands (e.g. vacuum, CIC) shouldn't get here with
an active portal, as they don't catch errors. And procedures should have
pinned the relevant portals?


Greetings,

Andres Freund

Re: abort-time portal cleanup

From

Robert Haas

Date:

27 September 2019, 14:35:47

On Tue, Sep 24, 2019 at 4:45 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. Doing that cleanup requires digging through all the portals etc. I'd
> rather rely on less state being correct than more during FATAL
> processing.

I agree with the principle, but the advantage of removing the hash
table entries is that it protects against any other code that might
try to access the portal machinery later; I thought that was
worthwhile enough to justify doing this here. I don't feel
super-strongly about it, but I do still like that approach.

> > The second wrinkle is that there might be an active portal.  Apart
> > from the FATAL case already mentioned, I think the only way this can
> > happen is some C code that calls purposefully calls AbortTransaction()
> > in the middle of executing a command.  It can't be an ERROR, because
> > then the portal would be marked as failed before we get here, and it
> > can't be an explicit ROLLBACK, because as noted above, that case was
> > changed 15 years ago.  It's got to be some other case where C code
> > calls AbortTransaction() voluntarily in the middle of a statement. For
> > over a decade, there were no cases at all of this type, but the code
> > in this function catered to hypothetical cases by marking the portal
> > failed. By 2016, Noah had figured out that this was bogus, and that
> > any future cases would likely require different handling, but Tom and
> > I shouted him down:
>
> Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
> need to do a ton of checks and special case hangups to make
> CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
> CIC, ...

Right... those are the kinds of cases I'm talking about here, just for
abort rather than commit.

> The cases where one can use AbortTransaction() (via
> AbortCurrentTransaction() presumably) are ones where either there's no
> surrounding code relying on the transaction (e.g. autovacuum,
> postgres.c), or where special care has been taken with portals
> (e.g. _SPI_rollback()).  We didn't have the pin mechanism back then, so
> I think even if we accept your/Tom's reasoning from back then (I don't
> really), it's outdated now that the pin mechanism exists.

It isn't, actually.  To respond to this and also your question below
about why I'm looking at active portal rather than pinned portals, try
adding this debugging code to AtAbort_Portals:

+        if (portal->status == PORTAL_ACTIVE)
+            elog(NOTICE, "this portal is ACTIVE and %spinned",
+                 portal->portalPinned ? "" : "NOT ");

Then run 'make -C src/pl/plpgsql check' and check
src/pl/plpgsql/src/regression.diffs and you'll see a whole lot of
this:

+NOTICE:  this portal is ACTIVE and NOT pinned

The PLs pin the portals they generate internally, but they don't force
the surrounding portal in which the toplevel query is executing to be
pinned.  AFAICT, pinning is mostly guarding against explicit
user-initiated drops of portals that were automatically generated by a
PL, whereas the portal's state is about tracking what the system is
doing with the portal.

(I think this could be a lot better documented than it is, but looking
at the commit history, I'm fairly sure that's what is happening here.)

> I'd be happy if we added some defenses against such bogus cases being
> introduced (i.e. erroring out if we encounter an active portal during
> abort processing).

Erroring out during error handling is probably a bad idea, but also, see above.

> > Stepping back a bit, stored procedures are a good example of a command
> > that uses multiple transactions internally. We have a few others, such
> > as VACUUM, but at present, that case only commits transactions
> > internally; it does not roll them back internally.  If it did, it
> > would want the same thing that procedures want, namely, to leave the
> > active portal alone. It doesn't quite work to leave the active portal
> > completely alone, because the portal has got a pointer to a
> > ResourceOwner which is about to be freed by end-of-transaction
> > cleanup;
>
> Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
> which, via HoldPortal(), sets portal->resowner to = NULL.

Right, but it's still necessary for AtAbort_Portals() to do the same
thing, again because the top-level portal isn't pinned.  If you apply
my patch, comment out the line that does portal->resowner = NULL; for
an active portal, and run make -C src/pl/plpgsql check, it will seg
fault inside exec_simple_query -> PortalDrop -> ResourceOwnerRelease
-> etc.

> > The current code also calls PortalReleaseCachedPlan in this case; I'm
> > not 100% certain whether that's appropriate or not.
>
> I think it's appropriate, because we cannot guarantee that the plan is
> still usable. Besides normal plan invalidation issues, the catalog
> contents the plan might rely on might have only existed in the aborted
> transaction - which seems like a fatal problem. That's why holding
> portals persists the portalstore. Which then also obsoletes the plan.

In a case like this, it can't really be an actual planable statement.
If the executor were involved, aborting the running transaction and
starting a new one would certainly result in a crash, because the
memory context in which we had saved all of our working state would be
vanish out from under us. It's got to be a procedure call or maybe
some kind of DDL command that has been specially-crafted to survive a
mid-command abort. It's not clear to me that the issues for plan
invalidation and catalog contents are the same in those kinds of cases
as they are for planable statements, but perhaps they are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: abort-time portal cleanup

From

Robert Haas

Date:

07 October 2019, 16:14:52

On Tue, Sep 24, 2019 at 5:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> After this cleanup, I think we don't need At(Sub)Abort_Portals in
> AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
> friends. This is because AbortTransaction itself would have zapped the
> portal.

Not if the ROLLBACK itself failed - in that case, the portal would
have been active at the time, and thus not subject to removal. And, as
the existing comments in xact.c state, that's exactly why that
function call is there.

> 2. You seem to forgot removing AtCleanup_Portals() from portal.h

Oops. Fixed in the attached version.

> - if (portal->status == PORTAL_READY)
> - MarkPortalFailed(portal);
>
> Why it is safe to remove this check?  It has been explained in commit
> 7981c342 why we need that check.  I don't see any explanation in email
> or patch which justifies this code removal.  Is it because you removed
> PortalCleanup?  If so, that is still called from PortalDrop?

All MarkPortalFailed() does is change the status to PORTAL_FAILED and
call the cleanup hook. PortalDrop() calls the cleanup hook, and we
don't need to change the status if we're removing it completely.

> 4.
> + * If the status is PORTAL_ACTIVE, then we must be
> executing a command
> + * that uses multiple transactions internally. In that
> case, the
> + * command in question must be one that does not
> internally rely on
> + * any transaction-lifetime resources, because they
> would disappear
> + * in the upcoming transaction-wide cleanup.
>   */
>   if (portal->status == PORTAL_ACTIVE)
>
> I am not able to understand how we can reach with the portal state as
> 'active' for a multi-transaction command.  It seems wherever we mark
> portal as active, we don't relinquish the control unless its state is
> changed.  Can you share some example where this can happen?

Yeah -- a plpgsql function or procedure that does "ROLLBACK;"
internally. The calling code doesn't relinquish control, but it does
reach AbortTransaction().

If you want to see it happen, just put an elog() inside that block and
run make -C src/pl/plpgsql check.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v2-0001-Improve-handling-of-portals-after-sub-transaction.patch

Re: abort-time portal cleanup

From

Robert Haas

Date:

07 October 2019, 16:27:55

On Tue, Sep 24, 2019 at 6:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> /*
> * Otherwise, do nothing to cursors held over from a previous
> * transaction.
> */
> if (portal->createSubid == InvalidSubTransactionId)
> continue;
>
> /*
> * Do nothing to auto-held cursors.  This is similar to the case of a
> * cursor from a previous transaction, but it could also be that the
> * cursor was auto-held in this transaction, so it wants to live on.
> */
> if (portal->autoHeld)
> continue;
>
> I have one doubt that why do we need the second check. Because before
> setting portal->autoHeld to true we always call HoldPortal therein we
> set portal->createSubid to InvalidSubTransactionId.  So it seems to me
> that the second condition will never reach.  Am I missing something?

Not that I can see, but I don't necessarily think this patch needs to
change it, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: abort-time portal cleanup

From

Andres Freund

Date:

08 October 2019, 18:10:41

Hi,

On 2019-10-07 12:14:52 -0400, Robert Haas wrote:
> > - if (portal->status == PORTAL_READY)
> > - MarkPortalFailed(portal);
> >
> > Why it is safe to remove this check?  It has been explained in commit
> > 7981c342 why we need that check.  I don't see any explanation in email
> > or patch which justifies this code removal.  Is it because you removed
> > PortalCleanup?  If so, that is still called from PortalDrop?
>
> All MarkPortalFailed() does is change the status to PORTAL_FAILED and
> call the cleanup hook. PortalDrop() calls the cleanup hook, and we
> don't need to change the status if we're removing it completely.

Note that currently PortalCleanup() behaves differently depending on
whether the portal is set to failed or not...

- Andres

Re: abort-time portal cleanup

From

Robert Haas

Date:

09 October 2019, 13:25:47

On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-10-07 12:14:52 -0400, Robert Haas wrote:
> > > - if (portal->status == PORTAL_READY)
> > > - MarkPortalFailed(portal);
> > >
> > > Why it is safe to remove this check?  It has been explained in commit
> > > 7981c342 why we need that check.  I don't see any explanation in email
> > > or patch which justifies this code removal.  Is it because you removed
> > > PortalCleanup?  If so, that is still called from PortalDrop?
> >
> > All MarkPortalFailed() does is change the status to PORTAL_FAILED and
> > call the cleanup hook. PortalDrop() calls the cleanup hook, and we
> > don't need to change the status if we're removing it completely.
>
> Note that currently PortalCleanup() behaves differently depending on
> whether the portal is set to failed or not...

Urk, yeah, I forgot about that.  I think that's a wretched hack that
somebody ought to untangle at some point, but maybe for purposes of
this patch it makes more sense to just put the MarkPortalFailed call
back.

It's unclear to me why there's a special case here specifically for
PORTAL_READY.  Like, why is PORTAL_NEW or PORTAL_DEFINED or
PORTAL_DONE any different? It seems like if we're aborting the
transaction, we should not be calling ExecutorFinish()/ExecutorEnd()
for anything. We could achieve that result by just nulling out the
cleanup hook unconditionally instead of having this complicated dance
where we mark ready portals failed, which calls the cleanup hook,
which decides not to do anything because the portal has been marked
failed. It'd be great if there were a few more comments in this file
explaining what the thinking behind all this was.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: abort-time portal cleanup

From

Amit Kapila

Date:

10 October 2019, 12:33:01

On Wed, Oct 9, 2019 at 6:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-10-07 12:14:52 -0400, Robert Haas wrote:
> > > > - if (portal->status == PORTAL_READY)
> > > > - MarkPortalFailed(portal);
> > > >
> > > > Why it is safe to remove this check?  It has been explained in commit
> > > > 7981c342 why we need that check.  I don't see any explanation in email
> > > > or patch which justifies this code removal.  Is it because you removed
> > > > PortalCleanup?  If so, that is still called from PortalDrop?
> > >
> > > All MarkPortalFailed() does is change the status to PORTAL_FAILED and
> > > call the cleanup hook. PortalDrop() calls the cleanup hook, and we
> > > don't need to change the status if we're removing it completely.
> >
> > Note that currently PortalCleanup() behaves differently depending on
> > whether the portal is set to failed or not...
> >

Yeah, this is the reason, I mentioned it.

> Urk, yeah, I forgot about that.  I think that's a wretched hack that
> somebody ought to untangle at some point, but maybe for purposes of
> this patch it makes more sense to just put the MarkPortalFailed call
> back.
>

+1.

> It's unclear to me why there's a special case here specifically for
> PORTAL_READY.  Like, why is PORTAL_NEW or PORTAL_DEFINED or
> PORTAL_DONE any different?
>

If read the commit message of commit 7981c34279 [1] which introduced
this, then we might get some clue.  It is quite possible that we need
same handling for PORTAL_NEW, PORTAL_DEFINED, etc. but it seems we
just hit the problem mentioned in commit 7981c34279 for PORTAL_READY
state.  I think as per commit, if we don't mark it failed, then with
auto_explain things can go wrong.

[1] -
commit 7981c34279fbddc254cfccb9a2eec4b35e692a12
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Feb 18 03:06:46 2010 +0000

Force READY portals into FAILED state when a transaction or
subtransaction is aborted, if they were created within the failed
xact.  This  prevents ExecutorEnd from being run on them, which is a
good idea because they may contain references to tables or other
objects that no longer exist.  In particular this is hazardous when
auto_explain is active, but it's really rather surprising that nobody
has seen an issue with this before.  I'm back-patching this to 8.4,
since that's the first version that contains auto_explain or an
ExecutorEnd hook, but I wonder whether we shouldn't back-patch
further.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com