Thread: Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

Hi, 

Please find the attached updated and rebased patch.
I have added a test in the test_dsa module that uses a function
to create a dsa area. This function is called after 
the resowner->releasing is set to true, using an injection point.

Thank you,
Rahila Syed
Attachment
> On 24 Mar 2025, at 20:31, Rahila Syed <rahilasyed90@gmail.com> wrote:

> Please find the attached updated and rebased patch.

Thanks for this rebase, as was mentioned in the other thread I plan to get this
committed fairly soon.  A few comments on the code, all performed in the
attached v3.


+    else
+    {
+        /* Log failure of unpinning */
+        elog(DEBUG2, "unable to unpin the segment %u as CurrentResourceOwner is NULL or releasing",
+             seg);
+        seg->resowner = NULL;
+    }
I removed the elog() calls since I can't see it adding enough value, and the
assignment to NULL can be removed as well since we've already asserted that
seg->resowner is NULL.


+        INJECTION_POINT("dsa_create_on_res_release");
This feels like a name which limits its use to this one test, whereas it is a
general purpose injection point.  Renamed, and also moved to using dashes
rather than underscore as the former is project style.


+void
+test_dsa_inj(const char *name, const void *private_data)
Rather than duplicating the code I created an internal function for this test
which can be called from the existing basic test as well as this new test.

I also did a little bit of renaming to make it more readable.

As it can only really be tested with an injection point I plan on only
backpatching to 17 initially.  Searching the archives I didn't find any mention
of this bug ever being hit so it seems safe to let it prove itself in testable
versions before going further back with it.

--
Daniel Gustafsson


Attachment
Hi Daniel,

Thank you for your review and code improvements. 

Please find below some observations.

1. dsm_unpin_mapping(dsm_segment *seg)
+       if (CurrentResourceOwner && IsResourceOwnerReleasing(CurrentResourceOwner))
+               return;

Given that the function can return without setting resowner to a 
CurrentResourceOwner which is not NULL, shall we change the function
signature to return true when "unpin" is successful and false when not?
This behavior existed earlier too, but adding the check has made it explicit.
Although this function is not currently in use, it would be beneficial to make
the API more verbose.

2.  If value of IsResourceOwnerReleasing changes between dsm_create_descriptor
and attach_internal, the dsm segment and dsa area will end up with different resource owners.
Earlier the chances of CurrentResourceOwner changing between the two functions were zero.
May be something can be done to keep resowner assignments under both these functions
in sync.

Thank you,
Rahila Syed
> On 11 Apr 2025, at 21:08, Rahila Syed <rahilasyed90@gmail.com> wrote:
>
> Hi Daniel,
>
> Thank you for your review and code improvements.
>
> Please find below some observations.
>
> 1. dsm_unpin_mapping(dsm_segment *seg)
> +       if (CurrentResourceOwner && IsResourceOwnerReleasing(CurrentResourceOwner))
> +               return;
>
> Given that the function can return without setting resowner to a
> CurrentResourceOwner which is not NULL, shall we change the function
> signature to return true when "unpin" is successful and false when not?

Maybe, but at this point in the cycle we cannot change the prototype like that.
Food for thought for v19 though.

> 2.  If value of IsResourceOwnerReleasing changes between dsm_create_descriptor
> and attach_internal, the dsm segment and dsa area will end up with different resource owners.
> Earlier the chances of CurrentResourceOwner changing between the two functions were zero.
> May be something can be done to keep resowner assignments under both these functions
> in sync.

Hmm, that's a good question, not sure. Do you have a repro for this handy?

Attached is a current v4 with a few small tweaks.

--
Daniel Gustafsson


Attachment
On Wed, Apr 30, 2025 at 5:24 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> Attached is a current v4 with a few small tweaks.

Sorry to turn up late here, but I strongly disagree with the notion
that this is a bug in the DSM or DSA code. It seems to me that it is
the caller's responsibility to provide a valid resource owner, not the
job of the called code to ignore the resource owner when it's
unusable. I suspect that there are many other parts of the system that
rely on the ResourceOwner machinery which likewise assume that the
ResourceOwner that they are passed is valid.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, Apr 30, 2025 at 06:03:49PM -0400, Robert Haas wrote:
> Sorry to turn up late here, but I strongly disagree with the notion
> that this is a bug in the DSM or DSA code. It seems to me that it is
> the caller's responsibility to provide a valid resource owner, not the
> job of the called code to ignore the resource owner when it's
> unusable. I suspect that there are many other parts of the system that
> rely on the ResourceOwner machinery which likewise assume that the
> ResourceOwner that they are passed is valid.

Yeah, dshash would be one, I think.  It feels to me that if you want
to enforce this kind of policy to be checked, this is something that
should be done in the shape of one or more assertion based the state
of the resource owner expected in these low-level paths rather than
tweaking the DSA and DSM code to do what you are expecting here, and
only enforce such new policies on HEAD to avoid disruption with
existing systems.

I'm actually rather scared of the patch, isn't there a risk of
breaking existing patterns that worked out of the box by forcing the
resowner to not be set?  My spidey sense tingles when I see such
patterns, because this is enforcing assumptions directly hidden to the
callers.
--
Michael

Attachment
> On 1 May 2025, at 00:04, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 30, 2025 at 5:24 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Attached is a current v4 with a few small tweaks.
>
> Sorry to turn up late here, but I strongly disagree with the notion
> that this is a bug in the DSM or DSA code. It seems to me that it is
> the caller's responsibility to provide a valid resource owner, not the
> job of the called code to ignore the resource owner when it's
> unusable. I suspect that there are many other parts of the system that
> rely on the ResourceOwner machinery which likewise assume that the
> ResourceOwner that they are passed is valid.

Thanks for review, I’m currently travelling over the extended weekend but will rework the approach shortly when back at
theoffice. 

./daniel


On Wed, Apr 30, 2025 at 6:56 PM Michael Paquier <michael@paquier.xyz> wrote:
> I'm actually rather scared of the patch, isn't there a risk of
> breaking existing patterns that worked out of the box by forcing the
> resowner to not be set?  My spidey sense tingles when I see such
> patterns, because this is enforcing assumptions directly hidden to the
> callers.

I'm not worried about this particular problem. We use resowner == NULL
in various places to mean "for the lifetime of the backend," which
makes a lot of sense if you think about it: the job of the resowner is
to release resources when the resowner is cleaned up; if you don't
ever want to release the resources, then you don't need a resowner.

What I'm concerned about is that I think that (as I said on the other
thread) is that ProcessGetMemoryContextInterrupt is not really at all
safe to execute at an arbitrary CHECK_FOR_INTERRUPTS(). In other
places where we use resource owners, we don't run into the problem
that happened here because we are careful about the timing of our
resource owner operations. You can see that caution in, for example,
AbortTransaction() and AbortSubTransaction(), where we make sure to
close down higher-level subsystems first, so that when we get to
lower-level cleanup, we know that no more resources are going to be
allocated afterward. But here, we are executing complex operations
that require a "clean" state at pretty much any point in the code,
where we don't actually know that things are sane enough for this code
to execute successfully. This particular error is just a symptom of
that more general problem.

In my mind, the possible fixes here are (1) revert that patch, (2)
redesign things so that the problematic code can only get called when
we know that the backend state is sane, or (3) redesign the code so
that it has fewer requirements for correct operation. Regarding (3),
this particular problem could maybe be solved by not relying on the
resowner machinery at all: we always want the DSA to be pinned, so if
we could create it pre-pinned rather than pinning it as an
after-the-fact step, we wouldn't need the resowner at all. Except that
I don't really think that would be entirely safe, because that seems
to open up the possibility that we'd just leak the DSA area entirely:
say you enter dsa_create with no resowner and then fail someplace
inside that function after you've allocated resources. There's nothing
to clean up those resources, but since you never returned the DSA
handle, it's not stored anywhere, so those resources have just been
leaked. I think this is actually why the interface works the way it
does. Maybe there's some way to fix it by creating a temporary
resource owner that is used by just this code ... but in general I'm
skeptical that we can really set up something that is OK to do in an
aborted transaction, because our ability to handle any further errors
at that point is extremely limited, and this code is definitely
complex enough that it could error out.

--
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On Thu, May 1, 2025 at 4:26 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Apr 30, 2025 at 06:03:49PM -0400, Robert Haas wrote:
> Sorry to turn up late here, but I strongly disagree with the notion
> that this is a bug in the DSM or DSA code. It seems to me that it is
> the caller's responsibility to provide a valid resource owner, not the
> job of the called code to ignore the resource owner when it's
> unusable. I suspect that there are many other parts of the system that
> rely on the ResourceOwner machinery which likewise assume that the
> ResourceOwner that they are passed is valid.

Yeah, dshash would be one, I think.  It feels to me that if you want
to enforce this kind of policy to be checked, this is something that
should be done in the shape of one or more assertion based the state
of the resource owner expected in these low-level paths rather than
tweaking the DSA and DSM code to do what you are expecting here, and
only enforce such new policies on HEAD to avoid disruption with
existing systems.

I believe it would be particularly useful to add an assertion in dsm_unpin_mapping().
This function relies on CurrentResourceOwner being non-NULL and not releasing to
successfully unpin a mapping.
 

I'm actually rather scared of the patch, isn't there a risk of
breaking existing patterns that worked out of the box by forcing the
resowner to not be set?
 
In this context, resowner not being set means that the dsm segment
would remain attached until the session ends or until it is explicitly
detached.  IIUC, the key difference is that a segment will stay mapped
for longer than expected in cases where the mapping was created
when a transaction was aborting.

Thank you for the review comments, it makes sense to include this
check in the callers of the DSA API.

Wrapping the dsa_create/dsa_attach call in the following snippet helps
resolve the issue in case of ProcessGetMemoryContextInterrupt.

+               ResourceOwner   current_owner;
+               bool            res_releasing = false;
+
+               if (CurrentResourceOwner && IsResourceOwnerReleasing(CurrentResourceOwner))
+               {
+                       current_owner = CurrentResourceOwner;
+                       CurrentResourceOwner = NULL;
+                       res_releasing = true;
+               }
+
                MemoryStatsDsaArea = dsa_create(memCxtArea->lw_lock.tranche);

+               if (res_releasing)
+                       CurrentResourceOwner = current_owner;

Kindly let me know your views.

Thank you,
Rahila Syed




Hello, everyone.

Not sure if it helps but encountered the same problem in relation to
injection points:

https://www.postgresql.org/message-id/flat/CANtu0oiTgFW47QgpTwrMOVm3Bq4N0Y5bjvTy5sP0gYWLQuVgjw%40mail.gmail.com