Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype |
Date | |
Msg-id | CA+TgmoYkMqdGnjg7LAbxF9DgYUeJS50ecXV1HOVUAs0W_qzvKA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
|
List | pgsql-hackers |
On Fri, Jun 17, 2016 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I don't mind if you feel the need to refactor this. > >> I'm not sure yet. What I think we need to work out first is exactly >> how polymorphic parallel aggregates ought to identify which concrete >> data types they are using. There were well-defined rules before, >> but how do we adapt those to two levels of aggregate evaluation? > > After reviewing things a bit, it seems like the first design-level issue > there is what about polymorphic combine functions. So far as the type > system is concerned, there's no issue: the declared signature is always > going to be "combine(foo, foo) returns foo" and it doesn't matter whether > foo is an ordinary type, a polymorphic one, or INTERNAL. The other > question is whether the function itself knows what it's operating on in > the current invocation. For regular polymorphic types this seems no > different from any other usage. If the transtype is INTERNAL, then the > type system provides no help; but we could reasonably assume that the > internal representation itself contains as much information as the combine > func needs. We could add extra arguments like the "finalfunc_extra" > option does for the finalfunc, but I don't really see a need --- that hack > is mainly to satisfy the type system that the finalfunc's signature is > sane, not to tell the finalfunc something it has no other way to find out. > So I think we're probably OK as far as the combine function is concerned. OK.... > The situation is much direr as far as serialize/deserialize are concerned. > These basically don't work at all for polymorphic transtypes: if you try > to declare "deserialize(bytea) returns anyarray", the type system won't > let you. Perhaps that's not an issue because you shouldn't really need > serialize/deserialize for anything except INTERNAL transtype. In fact, I think that it won't even let you declare a serialize or deserialize function for anything other than an INTERNAL transtype right now. There was a PostGIS use case that made it seem like we might want to relax that, but currently I believe that's the law of the land. > However, > there's a far bigger problem which is that "deserialize(bytea) returns > internal" is a blatant security hole, which I see you ignored the defense > against in opr_sanity.sql. The claim in the comment there that it's okay > if we check for aggregate context is a joke --- or haven't you heard that > we allow users to create aggregates? That's not nearly good enough to > prevent unsafe usage of such functions. That may be true, but it isn't obvious to me. The user can't simply say SELECT numeric_avg_deserialize(...) or whatever because the function won't execute when called that way. How would you arrange to call such a function in a context where it wouldn't fail outright but would return a value that you could then hand to some other function that expects a type-internal input? > Not to mention that CREATE > FUNCTION won't allow creation of such functions, so extensions are locked > out of using this feature. Oops. > This has to be redesigned or else reverted entirely. I'm not going to > take no for an answer. Please don't act as if I've been refusing to fix bugs. I've been doing practically nothing else. > A possible solution is to give deserialize an extra dummy argument, along > the lines of "deserialize(bytea, internal) returns internal", thereby > ensuring it can't be called in any non-system-originated contexts. Seems promising. > This > is still rather dangerous if the other argument is variable, as somebody > might be able to abuse an internal-taking function by naming it as the > deserialize function for a maliciously-designed aggregate. The complex and, as far as I can tell, largely undocumented set of rules about what sorts of internal-taking-and-returning functions we allow to exist strikes me as a house of cards just waiting for the next stiff breeze to come along. > What I'm > inclined to do to lock it down further is to drop the "serialtype" > argument to CREATE AGGREGATE, which seems rather pointless (what else > would you ever use besides bytea?). Instead, insist that > serialize/deserialize apply *only* when the transtype is INTERNAL, and > their signatures are exactly "serialize(internal) returns bytea" and > "deserialize(bytea, internal) returns internal", never anything else. I originally had the thought that you might want to make the serialized representation something that could be human-readable. I'd actually like to have an SQL syntax that outputs the transition state or, if it's internal, the serialized representation of it. That would allow us to do partitionwise aggregation given a query like SELECT avg(foo) FROM parent_table_for_several_foreign_tables. You'd need to be able to ship something like SELECT PARTIAL avg(foo) FROM t1 to each child. And it might be nicer if the results came back in some human-readable format rather than as a bytea, but it's not essential, of course. > A different way of locking things down, which might be cleaner in the > long run, is to invent a new pseudo-type for the sole purpose of being > the serialization type, that is we'd insist on the signatures being > "serialize(internal) returns serialized_internal" and > "deserialize(serialized_internal) returns internal", where > serialized_internal has the representation properties of bytea but is > usable for no other purpose than this. Not sure it's worth the trouble > though. I think we should break up internal into various kinds of internal depending on what kind of a thing we've actually got a pointer to. We don't need more synonyms for bytea, IMO. > Anyway, setting aside the security angle, it doesn't seem like there is > anything wrong with the high-level design for polymorphic cases. I'm now > thinking the problem I saw is just a garden variety implementation bug > (or bugs). I'm still not very happy with the confusion around aggtype > though, not least because I suspect it contributed directly to this bug. > I also feel that both the code comments and the user-facing documentation > for this feature are well below project standards, eg where's the > discussion in section 35.10? Good point. That needs to be addressed -- ideally by David, but if he doesn't then I guess I'll have to do it. If you have other specific comments, please send them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: