Re: multivariate statistics (v19) - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: multivariate statistics (v19) |
Date | |
Msg-id | CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com Whole thread Raw |
In response to | Re: multivariate statistics (v19) (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: multivariate statistics (v19)
|
List | pgsql-hackers |
On 3 August 2016 at 02:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Attached is v19 of the "multivariate stats" patch series Hi, I started looking at this - just at a very high level - I've not read much of the detail yet, but here are some initial review comments. I think the overall infrastructure approach for CREATE STATISTICS makes sense, and I agree with other suggestions upthread that it would be useful to be able to build statistics on arbitrary expressions, although that doesn't need to be part of this patch, it's useful to keep that in mind as a possible future extension of this initial design. I can imagine it being useful to be able to create user-defined statistics on an arbitrary list of expressions, and I think that would include univariate as well as multivariate statistics. Perhaps that's something to take into account in the naming of things, e.g., as David Rowley suggested, something like pg_statistic_ext, rather than pg_mv_statistic. I also like the idea that this might one day be extended to support statistics across multiple tables, although I think that might be challenging to achieve -- you'd need a method of taking a random sample of rows from a join between 2 or more tables. However, if the intention is to be able to support that one day, I think that needs to be accounted for in the syntax now -- specifically, I think it will be too limiting to only support things extending the current syntax of the form table1(col1, col2, ...), table2(col1, col2, ...), because that precludes building statistics on an expression referring to columns from more than one table. So I think we should plan further ahead and use a syntax giving greater flexibility in the future, for example something structured more like a query (like CREATE VIEW): CREATE STATISTICS name [ WITH (options) ] ON expression [, ...] FROM table [, ...] WHERE condition where the first version of the patch would only support expressions that are simple column references, and would require at least 2 such columns from a single table with no WHERE clause, i.e.: CREATE STATISTICS name [ WITH (options) ] ON column1, column2 [, ...] FROM table For multi-table statistics, a WHERE clause would typically be needed to specify how the tables are expected to be joined, but potentially such a clause might also be useful in single-table statistics, to build partial statistics on a commonly queried subset of the table, just like a partial index. Of course, I'm not suggesting that the current patch do any of that -- it's big enough as it is. I'm just throwing out possible future directions this might go in, so that we don't get painted into a corner when designing the syntax for the current patch. Regarding the statistics themselves, I read the description of soft functional dependencies, and I'm somewhat skeptical about that algorithm. I don't like the arbitrary thresholds or the sudden jump from independence to dependence and clause reduction. As others have said, I think this should account for a continuous spectrum of dependence from fully independent to fully dependent, and combine clause selectivities in a way based on the degree of dependence. For example, if you computed an estimate for the fraction 'f' of the table's rows for which a -> b, then it might be reasonable to combine the selectivities using P(a,b) = P(a) * (f + (1-f) * P(b)) Of course, having just a single number that tells you the columns are correlated, tells you nothing about whether the clauses on those columns are consistent with that correlation. For example, in the following table CREATE TABLE t(a int, b int); INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x); 'b' is functionally dependent on 'a' (and vice versa), but if you query the rows with a<50 and with b<50, those clauses behave essentially independently, because they're not consistent with the functional dependence between 'a' and 'b', so the best way to combine their selectivities is just to multiply them, as we currently do. So whilst it may be interesting to determine that 'b' is functionally dependent on 'a', it's not obvious whether that fact by itself should be used in the selectivity estimates. Perhaps it should, on the grounds that it's best to attempt to use all the available information, but only if there are no more detailed statistics available. In any case, knowing that there is a correlation can be used as an indicator that it may be worthwhile to build more detailed multivariate statistics like a MCV list or a histogram on those columns. Looking at the ndistinct coefficient 'q', I think it would be better if the recorded statistic were just the estimate for ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a more fundamental statistic, and it's easier to document and easier to interpret. Also, I don't believe that the coefficient 'q' is the right number to use for clause estimation: Looking at README.ndistinct, I'm skeptical about the selectivity estimation argument. In the case where a -> b, you'd have q = ndistinct(b), so then P(a=1 & b=2) would become 1/ndistinct(a), which is fine for a uniform distribution. But typically, there would be univariate statistics on a and b, so if for example a=1 were 100x more likely than average, you'd probably know that and the existing code computing P(a=1) would reflect that, whereas simply using P(a=1 & b=2) = 1/ndistinct(a) would be a significant underestimate, since it would be ignoring known information about the distribution of a. But likewise if, as is later argued, you were to use 'q' as a correction factor applied to the individual clause selectivities, you could end up with significant overestimates: if you said P(a=1 & b=2) = q * P(a=1) * P(b=2), and a=1 were 100x more likely than average, and a -> b, then b=2 would also be 100x more likely than average (assuming that b=2 was the value implied by the functional dependency), and that would also be reflected in the univariate statics on b, so then you'd end up with an overall selectivity of around 10000/ndistinct(a), which would be 100x too big. In fact, since a -> b means that q = ndistinct(b), there's a good chance of hitting data for which q * P(b) is greater than 1, so this formula would lead to a combined selectivity greater than P(a), which is obviously nonsense. Having a better estimate for ndistinct(a,b,...) looks very useful by itself for GROUP BY estimation, and there may be other places that would benefit from it, but I don't think it's the best statistic for determining functional dependence or combining clause selectivities. That's as much as I've looked at so far. It's such a big patch that it's difficult to consider all at once. I think perhaps the smallest committable self-contained unit providing a tangible benefit would be something containing the core infrastructure plus the ndistinct estimate and the improved GROUP BY estimation. Regards, Dean
pgsql-hackers by date: