Re: planner support functions: handle GROUP BY estimates ? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: planner support functions: handle GROUP BY estimates ? |
Date | |
Msg-id | 27578.1579032741@sss.pgh.pa.us Whole thread Raw |
In response to | Re: planner support functions: handle GROUP BY estimates ? (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: planner support functions: handle GROUP BY estimates ?
|
List | pgsql-hackers |
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote: >> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: >>> Tom implemented "Planner support functions": >>> https://www.postgresql.org/docs/12/xfunc-optimization.html >>> I wondered whether there was any consideration to extend that to allow >>> providing improved estimates of "group by". That currently requires manually >>> by creating an expression index, if the function is IMMUTABLE (which is not >>> true for eg. date_trunc of timestamptz). >> I didn't hear back so tried implementing this for date_trunc(). Currently, the >> ... >> If the input timestamps have (say) hourly granularity, rowcount will be >> *underestimated* by 3600x, which is worse than the behavior in master of >> overestimating by (for "day") 24x. While I don't have any objection in principle to extending the set of things planner support functions can do, it doesn't seem like the idea is giving you all that much traction for this problem. There isn't that much knowledge that's specific to date_trunc in this, and instead you've got a bunch of generic problems (that would have to be solved again in every other function's planner support). Another issue is that it seems like this doesn't compose nicely --- if the GROUP BY expression is "f(g(x))", how do f's support function and g's support function interact? The direction that I've been wanting to go in for this kind of problem is to allow CREATE STATISTICS on an expression, ie if you were concerned about the estimation accuracy for GROUP BY or anything else, you could do something like CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table; This would have the effect of cueing ANALYZE to gather stats on the value of that expression, which the planner could then use, very much as if you'd created an index on the expression. The advantages of doing this rather than making an index are (1) you don't have to pay the maintenance costs for an index, (2) we don't have to restrict it to immutable expressions. (Volatile expressions would have to be disallowed, if only because of fear of side-effects; but I think we could allow stable expressions just fine. Worst case problem is that the stats are stale, but so what?) With a solution like this, we don't have to solve any of the difficult problems of how the pieces of the expression interact with each other or with the statistics of the underlying column(s). We just use the stats if available, and the estimate will be as good as it'd be for a plain column reference. I'm not sure how much new infrastructure would have to be built for this. We designed the CREATE STATISTICS syntax to support this (partly at my insistence IIRC) but I do not think any of the existing plumbing is ready for it. I don't think it'd be very hard to plug this into ANALYZE or the planner, but there might be quite some work to be done on the catalog infrastructure, pg_dump, etc. cc'ing Tomas in case he has any thoughts about it. regards, tom lane
pgsql-hackers by date: