Home > mailing lists

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
Date	June 17, 2016 18:34:56
Msg-id	CA+TgmoYkMqdGnjg7LAbxF9DgYUeJS50ecXV1HOVUAs0W_qzvKA@mail.gmail.com Whole thread Raw
In response to	Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
List	pgsql-hackers

Tree view

On Fri, Jun 17, 2016 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I don't mind if you feel the need to refactor this.
>
>> I'm not sure yet.  What I think we need to work out first is exactly
>> how polymorphic parallel aggregates ought to identify which concrete
>> data types they are using.  There were well-defined rules before,
>> but how do we adapt those to two levels of aggregate evaluation?
>
> After reviewing things a bit, it seems like the first design-level issue
> there is what about polymorphic combine functions.  So far as the type
> system is concerned, there's no issue: the declared signature is always
> going to be "combine(foo, foo) returns foo" and it doesn't matter whether
> foo is an ordinary type, a polymorphic one, or INTERNAL.  The other
> question is whether the function itself knows what it's operating on in
> the current invocation.  For regular polymorphic types this seems no
> different from any other usage.  If the transtype is INTERNAL, then the
> type system provides no help; but we could reasonably assume that the
> internal representation itself contains as much information as the combine
> func needs.  We could add extra arguments like the "finalfunc_extra"
> option does for the finalfunc, but I don't really see a need --- that hack
> is mainly to satisfy the type system that the finalfunc's signature is
> sane, not to tell the finalfunc something it has no other way to find out.
> So I think we're probably OK as far as the combine function is concerned.

OK....

> The situation is much direr as far as serialize/deserialize are concerned.
> These basically don't work at all for polymorphic transtypes: if you try
> to declare "deserialize(bytea) returns anyarray", the type system won't
> let you.  Perhaps that's not an issue because you shouldn't really need
> serialize/deserialize for anything except INTERNAL transtype.

In fact, I think that it won't even let you declare a serialize or
deserialize function for anything other than an INTERNAL transtype
right now.  There was a PostGIS use case that made it seem like we
might want to relax that, but currently I believe that's the law of
the land.

> However,
> there's a far bigger problem which is that "deserialize(bytea) returns
> internal" is a blatant security hole, which I see you ignored the defense
> against in opr_sanity.sql.  The claim in the comment there that it's okay
> if we check for aggregate context is a joke --- or haven't you heard that
> we allow users to create aggregates?  That's not nearly good enough to
> prevent unsafe usage of such functions.

That may be true, but it isn't obvious to me.  The user can't simply
say SELECT numeric_avg_deserialize(...) or whatever because the
function won't execute when called that way.  How would you arrange to
call such a function in a context where it wouldn't fail outright but
would return a value that you could then hand to some other function
that expects a type-internal input?

> Not to mention that CREATE
> FUNCTION won't allow creation of such functions, so extensions are locked
> out of using this feature.

Oops.

> This has to be redesigned or else reverted entirely.  I'm not going to
> take no for an answer.

Please don't act as if I've been refusing to fix bugs.  I've been
doing practically nothing else.

> A possible solution is to give deserialize an extra dummy argument, along
> the lines of "deserialize(bytea, internal) returns internal", thereby
> ensuring it can't be called in any non-system-originated contexts.

Seems promising.

> This
> is still rather dangerous if the other argument is variable, as somebody
> might be able to abuse an internal-taking function by naming it as the
> deserialize function for a maliciously-designed aggregate.

The complex and, as far as I can tell, largely undocumented set of
rules about what sorts of internal-taking-and-returning functions we
allow to exist strikes me as a house of cards just waiting for the
next stiff breeze to come along.

> What I'm
> inclined to do to lock it down further is to drop the "serialtype"
> argument to CREATE AGGREGATE, which seems rather pointless (what else
> would you ever use besides bytea?).  Instead, insist that
> serialize/deserialize apply *only* when the transtype is INTERNAL, and
> their signatures are exactly "serialize(internal) returns bytea" and
> "deserialize(bytea, internal) returns internal", never anything else.

I originally had the thought that you might want to make the
serialized representation something that could be human-readable.  I'd
actually like to have an SQL syntax that outputs the transition state
or, if it's internal, the serialized representation of it.  That would
allow us to do partitionwise aggregation given a query like SELECT
avg(foo) FROM parent_table_for_several_foreign_tables.  You'd need to
be able to ship something like SELECT PARTIAL avg(foo) FROM t1 to each
child.  And it might be nicer if the results came back in some
human-readable format rather than as a bytea, but it's not essential,
of course.

> A different way of locking things down, which might be cleaner in the
> long run, is to invent a new pseudo-type for the sole purpose of being
> the serialization type, that is we'd insist on the signatures being
> "serialize(internal) returns serialized_internal" and
> "deserialize(serialized_internal) returns internal", where
> serialized_internal has the representation properties of bytea but is
> usable for no other purpose than this.  Not sure it's worth the trouble
> though.

I think we should break up internal into various kinds of internal
depending on what kind of a thing we've actually got a pointer to.  We
don't need more synonyms for bytea, IMO.

> Anyway, setting aside the security angle, it doesn't seem like there is
> anything wrong with the high-level design for polymorphic cases.  I'm now
> thinking the problem I saw is just a garden variety implementation bug
> (or bugs).  I'm still not very happy with the confusion around aggtype
> though, not least because I suspect it contributed directly to this bug.
> I also feel that both the code comments and the user-facing documentation
> for this feature are well below project standards, eg where's the
> discussion in section 35.10?

Good point.  That needs to be addressed -- ideally by David, but if he
doesn't then I guess I'll have to do it.  If you have other specific
comments, please send them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Aleksey Demakov
Date: 17 June 2016, 18:34:46
Subject: Re: Experimental dynamic memory allocation of postgresql shared memory

From: "David G. Johnston"
Date: 17 June 2016, 18:37:54
Subject: Re: Experimental dynamic memory allocation of postgresql shared memory

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype - Mailing list pgsql-hackers

Previous

Next