Thread: Using Expanded Objects other than Arrays from plpgsql
Hello!
I'm working on the OneSparse Postgres extension that wraps the GraphBLAS API with a SQL interface for doing graph analytics and other sparse linear algebra operations:
OneSparse wraps the GraphBLAS opaque handles in Expanded Object Header structs that register ExpandedObjectMethods for flattening and expanding objects from their "live" handle that can be passed to the SuiteSparse API, and their "flat" representations are de/serialized and get written as TOAST values. This works perfectly.
However during some single source shortest path (sssp) benchmarking I was getting good numbers but not as good as I expected, and noticed some sublinear scaling as the problems got bigger. It seems my objects are getting constantly flattened/expanded from plpgsql during the iterative phases of an algorithm. As the solution grows the result vector gets bigger and the expand/flatten cost increases on each iteration.
I found this thread from the original path implementation from Tom Lane in 2015:
In this initial implementation, a few heuristics have been hard-wired
into plpgsql to improve performance for arrays that are stored in
plpgsql variables. We would like to generalize those hacks so that
other datatypes can obtain similar improvements, but figuring out some
appropriate APIs is left as a task for future work.
Sure enough looking at the code I see this condition:
This is a showstopper for me as I can't see a good way around it, I tried to "fake" an array but didn't get too far down that approach but I may still pull it off as GraphBLAS objects are very much array-like, but I figured I'd also open the discussion on how we can fix this permanently so that future extensions don't run into this penalty.
My first thought was to add a flag to CREATE TYPE like "EXPANDED = true" or some other better name that indicates that the object can be safely taken ownership of in its expanded state and not copied. The GraphBLAS is specific in its API in that the object handle holder is the owner of the reference, so that would work fine for me. Another option I guess is some kind of whitelist or blacklist telling plpgsql which types can be kept expanded.
And then there is just removing the existing restriction on arrays only. Is any other expanded object out there really interested in being flattened/expanded over and over again?
Thanks,
-Michel
Michel Pelletier <pelletier.michel@gmail.com> writes: > I found this thread from the original path implementation from Tom Lane in > 2015: > https://www.postgresql.org/message-id/E1Ysvgz-0000s0-DP%40gemulon.postgresql.org >> In this initial implementation, a few heuristics have been hard-wired >> into plpgsql to improve performance for arrays that are stored in >> plpgsql variables. We would like to generalize those hacks so that >> other datatypes can obtain similar improvements, but figuring out some >> appropriate APIs is left as a task for future work. Yeah, we thought that it wouldn't be appropriate to try to design general APIs till we had more kinds of expandable objects. Maybe now is a good time to push forward on that. > My first thought was to add a flag to CREATE TYPE like "EXPANDED = true" or > some other better name that indicates that the object can be safely taken > ownership of in its expanded state and not copied. Isn't that inherent in the notion of R/W vs R/O expanded-object pointers? > And then there is just removing the existing restriction on arrays only. > Is any other expanded object out there really interested in being > flattened/expanded over and over again? I'm not sure. It seems certain that if the object is already expanded (either R/W or R/O), the paths for that in plpgsql_exec_function could be taken regardless of its specific type. The thing that is debatable is "if the object is in a flat state, should we forcibly expand it here?". That could be a win if the function later does object accesses that would benefit --- but it might never do so. We chose to always expand arrays, and we've gotten little pushback on that, but the tradeoffs might be different for other sorts of expanded objects. The other problem is that plpgsql only knows how to do such expansion for arrays, and it's not obvious how to extend that part. But it seems like we could get an easy win by adjusting plpgsql_exec_function along the lines of l. 549: - if (!var->isnull && var->datatype->typisarray) + if (!var->isnull) l. 564: - else + else if (var->datatype->typisarray) How far does that improve matters for you? The comment above line 549 cross-references exec_assign_value, but it looks like that's already set up to be similarly type-agnostic about values that are already expanded. regards, tom lane
On Sun, Oct 20, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michel Pelletier <pelletier.michel@gmail.com> writes:
> I found this thread from the original path implementation from Tom Lane in
>> appropriate APIs is left as a task for future work.
Yeah, we thought that it wouldn't be appropriate to try to design
general APIs till we had more kinds of expandable objects. Maybe
now is a good time to push forward on that.
Great, I'm happy to be involved!
> My first thought was to add a flag to CREATE TYPE like "EXPANDED = true" or
> some other better name that indicates that the object can be safely taken
> ownership of in its expanded state and not copied.
Isn't that inherent in the notion of R/W vs R/O expanded-object
pointers?
I'm not sure I'm qualified enough to agree with you but I see your point.
> And then there is just removing the existing restriction on arrays only.
> Is any other expanded object out there really interested in being
> flattened/expanded over and over again?
I'm not sure. It seems certain that if the object is already expanded
(either R/W or R/O), the paths for that in plpgsql_exec_function could
be taken regardless of its specific type.
Agreed.
The thing that is debatable
is "if the object is in a flat state, should we forcibly expand it
here?". That could be a win if the function later does object
accesses that would benefit --- but it might never do so. We chose
to always expand arrays, and we've gotten little pushback on that,
but the tradeoffs might be different for other sorts of expanded
objects.
The OneSparse objects will always expand themselves on first use via a DatumGetExpandedArray like macro wrapper, there is no run-time benefit to their flat representation, so I'm fine with not forcing their expansion ahead of time, but once I expand it I'd like it to stay expanded until the function returns (as much as possible) the serialization costs for very large sparse matrices is heavy.
The other problem is that plpgsql only knows how to do such expansion
for arrays, and it's not obvious how to extend that part.
Perhaps a third member function for ExpandedObjectMethods that formalizes the expansion interface like found in DatumGetExpandedArray? I closely follow that same pattern in my code.
But it seems like we could get an easy win by adjusting
plpgsql_exec_function along the lines of
...
How far does that improve matters for you?
I'll give it a try in my local benchmarking code and get back to you, thank you for the fast reply and the insight!
-Michel
Michel Pelletier <pelletier.michel@gmail.com> writes: > On Sun, Oct 20, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The other problem is that plpgsql only knows how to do such expansion >> for arrays, and it's not obvious how to extend that part. > Perhaps a third member function for ExpandedObjectMethods that formalizes > the expansion interface like found in DatumGetExpandedArray? I closely > follow that same pattern in my code. The trouble is we don't have an expanded object to consult at this point --- only a flat Datum. plpgsql has hard-wired knowledge that it's okay to apply expand_array if the datatype passes the typisarray tests, but I'm pretty unclear on how to provide similar knowledge for extension datatypes. regards, tom lane
On Sun, Oct 20, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not sure. It seems certain that if the object is already expanded
(either R/W or R/O), the paths for that in plpgsql_exec_function could
be taken regardless of its specific type.
But it seems like we could get an easy win by adjusting
plpgsql_exec_function along the lines of
l. 549:
- if (!var->isnull && var->datatype->typisarray)
+ if (!var->isnull)
l. 564:
- else
+ else if (var->datatype->typisarray)
How far does that improve matters for you?
I tried this change and couldn't get it to work, on the next line:
if (!var->isnull)
{
if (VARATT_IS_EXTERNAL_EXPANDED_RW(DatumGetPointer(var->value)))
{
if (VARATT_IS_EXTERNAL_EXPANDED_RW(DatumGetPointer(var->value)))
var->value might not be a pointer, as it seems at least from my gdb scratching, but say an integer. This segfaults on non-array but non-expandable datum.
I guess this gets back into knowing if a flat thing is expandable or not. I'm going to spend some more time looking at it, I haven't been in this corner of Postgres before.
Another comment that caught my eye was this one:
Not sure what the implication is there.
-Michel
Michel Pelletier <pelletier.michel@gmail.com> writes: > On Sun, Oct 20, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But it seems like we could get an easy win by adjusting >> plpgsql_exec_function along the lines of >> ... > I tried this change and couldn't get it to work, on the next line: > if (VARATT_IS_EXTERNAL_EXPANDED_RW(DatumGetPointer(var->value))) > var->value might not be a pointer, as it seems at least from my gdb > scratching, but say an integer. This segfaults on non-array but > non-expandable datum. Oh, duh --- the typisarray test serves to eliminate pass-by-value types. We need the same test that exec_assign_value makes, !var->datatype->typbyval, before it's safe to apply DatumGetPointer. So line 549 needs to be more like - if (!var->isnull && var->datatype->typisarray) + if (!var->isnull && !var->datatype->typbyval) > Another comment that caught my eye was this one: > https://github.com/postgres/postgres/blob/master/src/pl/plpgsql/src/pl_exec.c#L8304 > Not sure what the implication is there. Yeah, that's some more unfinished business. I'm not sure if it matters to your use-case or not. BTW, we probably should move this thread to pgsql-hackers. regards, tom lane