Thread: improving GROUP BY estimation
Hi, while working of using correlation statistics when estimating GROUP BY (or generally whatever uses estimate_num_groups), I've noticed that we apply the selectivity in a rather naive way. After estimating number of groups in the whole table - either by reading pg_stats.n_distinct for a single column, or multiplying (and doing some additional magic) for a group of columns, we do this: reldistinct *= rel->rows / rel->tuples; which essentially applies the selectivity to the ndistinct estimate. So when you have a table with 1000 distinct values in a column, and we read 10% of the table, we expect to get 100 distinct values. But consider for example the table has 100M rows, that the column has n_distinct=10000 and we're reading 1% of the table (i.e. 1M rows). The current code will estimate the GROUP BY cardinality to be 100 (as 1% of the n_distinct), but in reality we're more likely to see all 10000 distinct values. Example: CREATE TABLE t (a INT, b FLOAT); INSERT INTO t SELECT (100000 * random())::int, random() FROM generate_series(1,10000000) s(i); ANALYZE t; SELECT n_distinct FROM pg_stats WHERE attname = 'a'; n_distinct ------------ 98882 -- 1% of the table EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.01 GROUP BY a; QUERY PLAN ------------------------------------------------------------------------- HashAggregate (cost=179514.33..179523.45 rows=912 width=4) (actual time=916.224..955.136 rows=63492 loops=1) Group Key: a -> Seq Scan on t (cost=0.00..179053.25 rows=92216 width=4) (actual time=0.103..830.104 rows=100268 loops=1) Filter: (b < '0.01'::double precision) Rows Removed by Filter: 9899732 Planning time: 1.237 ms Execution time: 982.600 ms (7 rows) -- 5% of the table EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.05 GROUP BY a; QUERY PLAN ------------------------------------------------------------------------- HashAggregate (cost=180296.06..180345.22 rows=4916 width=4) (actual time=1379.519..1440.005 rows=99348 loops=1) Group Key: a -> Seq Scan on t (cost=0.00..179053.25 rows=497123 width=4) (actual time=0.096..1000.494 rows=500320 loops=1) Filter: (b < '0.05'::double precision) Rows Removed by Filter: 9499680 Planning time: 0.129 ms Execution time: 1480.986 ms (7 rows) This is getting especially bad when reading only a small fraction of a huge table - it's easy to construct cases with arbitrarily large error. And it's worth noting that the error is almost exclusively massive under-estimate, so the "wrong" direction as HashAggregate is still vulnerable to OOM (again, the larger the table the worse). So I think a more appropriate way to estimate this would be to find the expected number of distinct values when sampling with replacements, as explained for example in [1]. Applied to the code, it means changing the line from reldistinct *= rel->rows / rel->tuples; to reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); With this change, the estimates for the examples look like this: QUERY PLAN ------------------------------------------------------------------------- HashAggregate (cost=179283.79..179883.48 rows=59969 width=4) (actual time=897.071..934.764 rows=63492 loops=1) Group Key: a -> Seq Scan on t (cost=0.00..179053.25 rows=92216 width=4) (actual time=0.104..820.276 rows=100268 loops=1) Filter: (b < '0.01'::double precision) Rows Removed by Filter: 9899732 Planning time: 1.261 ms Execution time: 962.812 ms (7 rows) and QUERY PLAN ------------------------------------------------------------------------- Group (cost=232886.15..235371.76 rows=98234 width=4) (actual time=1519.545..2104.447 rows=99348 loops=1) Group Key: a -> Sort (cost=232886.15..234128.95 rows=497123 width=4) (actual time=1519.540..1838.575 rows=500320 loops=1) Sort Key: a Sort Method: external merge Disk: 6824kB -> Seq Scan on t (cost=0.00..179053.25 rows=497123 width=4) (actual time=0.090..1044.099 rows=500320 loops=1) Filter: (b < '0.05'::double precision) Rows Removed by Filter: 9499680 Planning time: 0.133 ms Execution time: 2147.340 ms (10 rows) So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values are somehow correlated. But I'd argue over-estimating is better because of the OOM issues in Hash Aggregate. regards [1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > <snip> > > So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values aresomehow correlated. > I applied your patch, which caused a few regression tests to fail. Attached is a patch that includes the necessary changes to the expected test results. It is not hard to write test cases where your patched version overestimates the number of rows by a very similar factor as the old code underestimates them. My very first test, which was not specifically designed to demonstrate this, happens to be one such example: CREATE TABLE t (a INT, b int); INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; ANALYZE t; EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; QUERY PLAN --------------------------------------------------------------- HashAggregate (cost=169250.21..169258.71 rows=850 width=4) Group Key: a -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) Filter: (b < 1000) (4 rows) SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; count ------- 32 (1 row) So, it estimates 850 rows where only 32 are returned . Without applying your patch, it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times, rather than an underestimate of 32 times. As a patch review, I'd say that your patch does what you claim it does, and it applies cleanly, and passes the regression tests with my included modifications. I think there needs to be some discussion on the list about whether the patch is a good idea. Mark Dilger
Attachment
> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote: > > >> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> <snip> >> >> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values aresomehow correlated. >> > > I applied your patch, which caused a few regression tests to fail. Attached > is a patch that includes the necessary changes to the expected test results. > > It is not hard to write test cases where your patched version overestimates > the number of rows by a very similar factor as the old code underestimates > them. My very first test, which was not specifically designed to demonstrate > this, happens to be one such example: > > > CREATE TABLE t (a INT, b int); > INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; > ANALYZE t; > EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; > QUERY PLAN > --------------------------------------------------------------- > HashAggregate (cost=169250.21..169258.71 rows=850 width=4) > Group Key: a > -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) > Filter: (b < 1000) > (4 rows) > > SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; > count > ------- > 32 > (1 row) > > > > So, it estimates 850 rows where only 32 are returned . Without applying your patch, > it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times, > rather than an underestimate of 32 times. > > As a patch review, I'd say that your patch does what you claim it does, and it applies > cleanly, and passes the regression tests with my included modifications. I think there > needs to be some discussion on the list about whether the patch is a good idea. > > Mark Dilger > > > <estimate-num-groups-v2.txt> Assuming the goal is to minimize the degree to which the estimates are inaccurate, I get better results by a kind of averaging of the old values from the current code base and the new values from Tomas's patch. I tried this and at least for the few examples we've been throwing around, I found the results were not as wildly inaccurate in the worst case examples than in either of the other two approaches: diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 46c95b0..c83135a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, Assert(rel->reloptkind== RELOPT_BASEREL); if (rel->tuples > 0) { + double old_factor; /* The way it is currently done on master */ + double new_factor; /* The way Tomas Vondra proposes doing it */ /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. The * fudge factor is because the Vars are probably correlated but we @@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, /* * Multiply by restriction selectivity. */ - reldistinct *= rel->rows / rel->tuples; + old_factor = rel->rows / rel->tuples; /* old way */ + new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* Tomas's way */ + + reldistinct *= sqrt(old_factor * new_factor); /* average of old way and new way, sort of */ /* * Update estimate of total distinct groups.
Hi, On 02/26/2016 12:16 AM, Mark Dilger wrote: > > It is not hard to write test cases where your patched version overestimates > the number of rows by a very similar factor as the old code underestimates > them. My very first test, which was not specifically designed to demonstrate > this, happens to be one such example: > > > CREATE TABLE t (a INT, b int); > INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; > ANALYZE t; > EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; > QUERY PLAN > --------------------------------------------------------------- > HashAggregate (cost=169250.21..169258.71 rows=850 width=4) > Group Key: a > -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) > Filter: (b < 1000) > (4 rows) > > SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; > count > ------- > 32 > (1 row) > > > > So, it estimates 850 rows where only 32 are returned . Without > applying your patch, it estimates just 1 row where 32 are returned. > That's an overestimate of roughly 26 times, rather than an > underestimate of 32 times. The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you've generated the data. In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which of the estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and it's not just this particular place of the planner. And it won't change until we add some sort of smartness about dependencies between columns. I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades due to under-estimates recently, but that's a bit unrelated. > As a patch review, I'd say that your patch does what you claim it > does, and it applies cleanly, and passes the regression tests with > my included modifications. I think there needs to be some discussion > on the list about whether the patch is agood idea. Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > On 02/26/2016 12:16 AM, Mark Dilger wrote: >> >> It is not hard to write test cases where your patched version overestimates >> the number of rows by a very similar factor as the old code underestimates >> them. My very first test, which was not specifically designed to demonstrate >> this, happens to be one such example: >> >> >> CREATE TABLE t (a INT, b int); >> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; >> ANALYZE t; >> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; >> QUERY PLAN >> --------------------------------------------------------------- >> HashAggregate (cost=169250.21..169258.71 rows=850 width=4) >> Group Key: a >> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) >> Filter: (b < 1000) >> (4 rows) >> >> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; >> count >> ------- >> 32 >> (1 row) >> >> >> >> So, it estimates 850 rows where only 32 are returned . Without >> applying your patch, it estimates just 1 row where 32 are returned. >> That's an overestimate of roughly 26 times, rather than an >> underestimate of 32 times. > > The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you'vegenerated the data. Yes, that was intentional. Your formula is exactly correct, so far as i can tell, for completely uncorrelated data. I don't have any tables with completely uncorrelated data, and was therefore interested in what happens when the data is correlated and your patch is applied. BTW, the whole reason I responded to your post is that I think I would like to have your changes in the code base. I'm just playing Devil's Advocate here, to see if there is room for any improvement. > In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which ofthe estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and it'snot just this particular place of the planner. And it won't change until we add some sort of smartness about dependenciesbetween columns. > > I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've seentoo many nested loop cascades due to under-estimates recently, but that's a bit unrelated. I have seen similar problems in systems I maintain, hence my interest in your patch. >> As a patch review, I'd say that your patch does what you claim it >> does, and it applies cleanly, and passes the regression tests with >> my included modifications. I think there needs to be some discussion >> on the list about whether the patch is agood idea. > > Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually. Right, but for the group as a whole, I thought it might generate some feedback if people saw what else changed in the regression results. If you look, the changes have to do with plans chosen and row estimates -- exactly the sort of stuff you would expect to change. Thanks for the patch submission. I hope my effort to review it is on the whole more helpful than harmful. Mark Dilger
Hi, On 02/26/2016 04:32 AM, Mark Dilger wrote: > >> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> ... >> >> The culprit here is that the two columns are not independent, but >> are (rather strongly) correlated due to the way you've generated >> the data. > > Yes, that was intentional. Your formula is exactly correct, so far as > i can tell, for completely uncorrelated data. I don't have any tables > with completely uncorrelated data, and was therefore interested in > what happens when the data is correlated and your patch is applied. > > BTW, the whole reason I responded to your post is that I think I would like > to have your changes in the code base. I'm just playing Devil's Advocate > here, to see if there is room for any improvement. Sure, that's how I understood it. I just wanted to point out the correlation, as that might not have been obvious to others. > Thanks for the patch submission. I hope my effort to review it is on > the whole more helpful than harmful. Thanks for the feedback! regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Tomas!
I've assigned to review this patch.
I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.
About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent.
I think we should follow this assumption in this case also until we have fundamentally better option (like your multivariate statistics).
@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
reldistinct = clamp;
/*
- * Multiply by restriction selectivity.
+ * Estimate number of distinct values expected in given number of rows.
*/
- reldistinct *= rel->rows / rel->tuples;
+ reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.
I think we need way more explanation in comments here (possibly with external links). For now, it looks like formula which appears from nowhere.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi, On 03/03/2016 12:53 PM, Alexander Korotkov wrote: > Hi, Tomas! > > I've assigned to review this patch. > > I've checked version estimate-num-groups-v2.txt by Mark Dilger. > It applies to head cleanly, passes corrected regression tests. > > About correlated/uncorrelated cases. I think now optimizer mostly assumes > all distributions to be independent. I think we should follow this > assumption in this case also until we have fundamentally better option (like > your multivariate statistics). > > @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, > double input_rows, > reldistinct = clamp; > > /* > -* Multiply by restriction selectivity. > +* Estimate number of distinct values expected in given number of rows. > */ > -reldistinct *= rel->rows / rel->tuples; > +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); > > /* > * Update estimate of total distinct groups. > > I think we need way more explanation in comments here (possibly with > external links). For now, it looks like formula which appears from nowhere. I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate. The current formula reldistinct *= rel->rows / rel->tuples; simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample. And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements: count(k, d) = d * (1 - ((d-1)/d) ^ k) It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement"). I guess we could relax those assumptions and for example use the MCV statistics to further improve the estimate, and also relax the 'with replacement' but that'd make the formula way more complicated. [1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers > > Also, note that rel->tuples gone away from formula parameters. So, we > actually don't care about how may tuples are in the table. This is because > this formula assumes that same tuple could be selected multiple times. For > low numbers this can lead to some errors. Consider selecting 2 from 3 > distinct tuples. If you can't select same tuple twice then you always > selecting 2 distinct tuples. But according to formula you will select 5/3 in > average. I think this error in small numbers is negligible, especially > because getting rid of this error would require way more computations. But > it worth mentioning in comments though. Well, yeah. That's the consequence of 'with replacement' assumption. I guess we could handle this somehow, though. For example we can compute the minimum possible number of distinct values like this: average_rows_per_value = (tuples / ndistinct); min_estimate = ceil(rows / average_rows_per_value); and use that as a minimum for the estimate. In the example you've given this would say average_rows_per_value = (3 / 3) = 1; min_estimate = ceil(2 / 1) = 2; So it serves as a correction for the 'with replacement' assumption. With only 2 distinct values we'd get this: average_rows_per_value = (3 / 2) = 1.5; min_estimate = ceil(2 / 1.5) = 2; Of course, it's just a heuristics, so this may fail in some other cases. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > On 03/03/2016 12:53 PM, Alexander Korotkov wrote: >> Hi, Tomas! >> >> I've assigned to review this patch. >> >> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >> It applies to head cleanly, passes corrected regression tests. >> >> About correlated/uncorrelated cases. I think now optimizer mostly assumes >> all distributions to be independent. I think we should follow this >> assumption in this case also until we have fundamentally better option (like >> your multivariate statistics). >> >> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, >> double input_rows, >> reldistinct = clamp; >> >> /* >> -* Multiply by restriction selectivity. >> +* Estimate number of distinct values expected in given number of rows. >> */ >> -reldistinct *= rel->rows / rel->tuples; >> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); >> >> /* >> * Update estimate of total distinct groups. >> >> I think we need way more explanation in comments here (possibly with >> external links). For now, it looks like formula which appears from nowhere. > > I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate. > > The current formula > > reldistinct *= rel->rows / rel->tuples; > > simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'llsee 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and youselect 10M rows, you should expect nearly all of them in the sample. The current formula assumes total dependence between the columns. You can create tables where this is true, and the formula gives precisely the right estimate. I'm not claiming this is common, or that it should be the default assumption, but it is one end of the spectrum of possible correct estimates. > And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from'd' distinct values with replacements: > > count(k, d) = d * (1 - ((d-1)/d) ^ k) > > It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change(that's the "with replacement"). The new formula assumes total independence between the columns. You can likewise create tables where this is true, and you did so upthread, and for those tables it gives precisely the right estimate. This is the other end of the spectrum of possible correct estimates. In the absence of multivariate statistics, either because you haven't committed that work yet, or because the multivariate data the planner needs hasn't been collected, choosing an estimate exactly in the middle of the spectrum (whatever that would mean mathematically) would minimize the maximum possible error in the estimate. I have the sense that my point of view is in the minority, because the expectation is the problems caused by independence assumptions will only be fixed when multivariate statistics are implemented and available, and so we should just keep the independence assumptions everywhere until that time. I tend to disagree with that, on the grounds that even when that work is finished, we'll never have complete coverage of every possible set of columns and what their degree of dependence is. Perhaps I am badly mistaken. Can somebody more familiar with the issues than I am please disabuse me of my wrongheadedness? Mark Dilger
On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On 03/03/2016 12:53 PM, Alexander Korotkov wrote:I've assigned to review this patch.
I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.
About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).
@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
/*
* Update estimate of total distinct groups.
I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.
I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate.
The current formula
reldistinct *= rel->rows / rel->tuples;
simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample.
And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements:
count(k, d) = d * (1 - ((d-1)/d) ^ k)
It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement").
I guess we could relax those assumptions and for example use the MCV statistics to further improve the estimate, and also relax the 'with replacement' but that'd make the formula way more complicated.
[1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
Your explanation in the first mail was good enough. Not it's even better. But actually I mean that we need to include some brief explanation into source code (either comments or readme). It would be good if one who want to understand this could do without searching mailing list archives or git history.
Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3 in
average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.
Well, yeah. That's the consequence of 'with replacement' assumption.
I guess we could handle this somehow, though. For example we can compute the minimum possible number of distinct values like this:
average_rows_per_value = (tuples / ndistinct);
min_estimate = ceil(rows / average_rows_per_value);
and use that as a minimum for the estimate. In the example you've given this would say
average_rows_per_value = (3 / 3) = 1;
min_estimate = ceil(2 / 1) = 2;
So it serves as a correction for the 'with replacement' assumption. With only 2 distinct values we'd get this:
average_rows_per_value = (3 / 2) = 1.5;
min_estimate = ceil(2 / 1.5) = 2;
Of course, it's just a heuristics, so this may fail in some other cases.
I'm not sure we actually need it. Even in worst case error doesn't seems to be very big. But feel free to add this heuristics, it looks OK for me.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi, On 03/03/2016 06:40 PM, Mark Dilger wrote: > >> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Hi, >> >> On 03/03/2016 12:53 PM, Alexander Korotkov wrote: >>> Hi, Tomas! >>> >>> I've assigned to review this patch. >>> >>> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >>> It applies to head cleanly, passes corrected regression tests. >>> >>> About correlated/uncorrelated cases. I think now optimizer mostly assumes >>> all distributions to be independent. I think we should follow this >>> assumption in this case also until we have fundamentally better option (like >>> your multivariate statistics). >>> >>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, >>> double input_rows, >>> reldistinct = clamp; >>> >>> /* >>> -* Multiply by restriction selectivity. >>> +* Estimate number of distinct values expected in given number of rows. >>> */ >>> -reldistinct *= rel->rows / rel->tuples; >>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); >>> >>> /* >>> * Update estimate of total distinct groups. >>> >>> I think we need way more explanation in comments here (possibly with >>> external links). For now, it looks like formula which appears from nowhere. >> >> I thought the details (particularly the link to stackexchange, discussing >> the formula) would be enough, but let me elaborate. >> >> The current formula >> >> reldistinct *= rel->rows / rel->tuples; >> >> simply multiplies the ndistinct estimate with selectivity. So when you >> select 1% of the table, the estimate says we'll see 1% of ndistinct >> values. But clearly that's naive, because for example when you have 10k >> distinct values and you select 10M rows, you should expect nearly all of >> them in the sample. > > The current formula assumes total dependence between the columns. You can > create tables where this is true, and the formula gives precisely the right > estimate. I'm not claiming this is common, or that it should be the default > assumption, but it is one end of the spectrum of possible correct > estimates. I'm not really sure what you mean by dependence between columns, as both the old and new formula only works with total cardinality and selectivity, and has absolutely no idea about multiple columns. In case you mean dependence between columns in the GROUP BY and columns used to compute the selectivity (i.e. referenced in WHERE), then perhaps you could say it like that, although I think there's no such explicit assumption. > >> And that's what the formula does - it gives you the expected number of >> distinct values, when selecting 'k' values from 'd' distinct values with >> replacements: >> >> count(k, d) = d * (1 - ((d-1)/d) ^ k) >> >> It's assuming the distinct values are equally probable (uniform >> distribution) and that the probabilities do not change (that's the "with >> replacement"). > > The new formula assumes total independence between the columns. You can > likewise create tables where this is true, and you did so upthread, and for > those tables it gives precisely the right estimate. This is the other end of > the spectrum of possible correct estimates. > > In the absence of multivariate statistics, either because you haven't > committed that work yet, or because the multivariate data the planner needs > hasn't been collected, choosing an estimate exactly in the middle of the > spectrum (whatever that would mean mathematically) would minimize the > maximum possible error in the estimate. FWIW the current version of multivariate statistics can't really help in this case. Perhaps it will help in the future, but it's way more complicated that it might seem as it requires transferring information from the WHERE clause to the GROUP BY clause. > > I have the sense that my point of view is in the minority, because the > expectation is the problems caused by independence assumptions will only be > fixed when multivariate statistics are implemented and available, and so we > should just keep the independence assumptions everywhere until that time. I > tend to disagree with that, on the grounds that even when that work is > finished, we'll never have complete coverage of every possible set of > columns and what their degree of dependence is. Well, in a sense you're right that those estimates work best in different situations. The problem with constructing an estimator the way you propose (simply returning an average) is that in both the extreme cases (perfect dependence or independence) one of those estimates is really bad. Moreover the new formula typically produces higher values than the old one, Consider for example this: CREATE TABLE t AS SELECT (10000 * random())::int a, (10000 * random())::int b FROM generate_series(1,1000000) s(i); and let's see estimates for queries like this: SELECT a FROM t WHERE b < $1 GROUP BY a; Then for different values of $1 we get this: | 10 | 50 | 100 | 500 | 1000 | 5000 -------------------------------------------------------- actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001 current | 10 | 50 | 102 | 516 | 1018 | 4996 new| 973 | 4001 | 6382 | 9897 | 9951 | 9951 average | 491 | 2025 | 3205 | 5206 | 5484 | 7473 and for a table with perfectly dependent columns: CREATE TABLE t AS SELECT mod(i,10000) a, mod(i,10000) b FROM generate_series(1,1000000)s(i); we get this: | 10 | 50 | 100 | 500 | 1000 | 5000 -------------------------------------------------------- actual | 10 | 50 | 100 | 500 | 1000 | 5000 current | 10 | 53 | 105 | 508 | 1016 | 5014 new | 880 | 4105 | 6472 | 9955 | 10018 | 10018 average | 445 | 2079 | 3288 | 5231 | 5517 | 7516 So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences). The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because (small number) + (huge number) ------------------------------ 2 is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude, so whether it's K or K/2 makes pretty much no difference. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).
The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because
(small number) + (huge number)
------------------------------
2
is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude, so whether it's K or K/2 makes pretty much no difference.
I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formulais much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to constructexamples of much higher differences). > > The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closerto the new value, simply because > > (small number) + (huge number) > ------------------------------ > 2 > > is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude,so whether it's K or K/2 makes pretty much no difference. > > I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)). Yes, that is what I proposed upthread. I'm not wedded to that, though. In general, I am with Tomas on this one, believing that his estimate will be much better than the current estimate. But I believe the *best* estimate will be somewhere between his and the current one, and I'm fishing for any decent way of coming up with a weighted average that is closer to his than to the current, but not simply equal to his proposal. The reason I want the formula to be closer to Tomas's than to the current is that I think that on average, across all tables, across all databases, in practice it will be closer to the right estimate than the current formula. That's just my intuition, and so I can't defend it. But if my intuition is right, the best formula we can adopt would be one that is moderated from his by a little bit, and in the direction of the estimate that the current code generates. I could easily lose this debate merely for lack of a principled basis for saying how far toward the current estimate the new estimate should be adjusted. The geometric average is one suggestion, but I don't have a principled argument for it. Like I said above, I'm fishing for a decent formula here. Mark Dilger
On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote: > > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the newformula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to constructexamples of much higher differences). > > > > The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closerto the new value, simply because > > > > (small number) + (huge number) > > ------------------------------ > > 2 > > > > is always much closer to the huge number. We're usually quite happy > > when the estimates are within the same order of magnitude, so whether > > it's K or K/2 makes pretty much no difference. > > > > I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)). Ah, OK. Haven't realized you meant geometric mean. With that it looks like this: 1) independent 10 50 100 500 1000 5000 ------------------------------------------------------------------ actual 919 3829 6244 9944 10001 10001 current 10 50 102 516 1018 4996 new 973 4001 6382 9897 9951 9951 average 491 2025 3205 5206 5484 7473 geom. mean 99 447 807 2260 3183 7051 2) dependent 10 50 100 500 1000 5000 ------------------------------------------------------------------ actual 10 50 100 500 1000 5000 current 10 53 105 508 1016 5014 new 880 4105 6472 9955 10018 10018 average 445 2079 3288 5231 5517 7516 geom. mean 94 466 824 2249 3190 7087 To better illustrate the differences, we may use the "ratio error" defined as err(a,b) = MAX(a/b, b/a) where 'a' is the actual value and 'b' is the estimate. That gives us: 1) independent 10 50 100 500 1000 5000 ------------------------------------------------------------------ current 91.90 76.58 61.22 19.27 9.82 2.00 new 1.06 1.04 1.02 1.00 1.01 1.01 average 1.87 1.89 1.95 1.91 1.82 1.34 geom. mean 9.32 8.56 7.74 4.40 3.14 1.42 2) dependent 10 50 100 500 1000 5000 ------------------------------------------------------------------ current 1.00 1.06 1.05 1.02 1.02 1.00 new 88.00 82.10 64.72 19.91 10.02 2.00 average 44.50 41.58 32.88 10.46 5.52 1.50 geom. mean 9.38 9.33 8.24 4.50 3.19 1.42 So the geometric mean seems to be working better than plain average. But I don't think we can simply conclude we should use the geometric mean of the estimates, as it assumes the risks of over- and under-estimating the cardinality are the same. Because they aren't. > > Yes, that is what I proposed upthread. I'm not wedded to that, though. > In general, I am with Tomas on this one, believing that his estimate > will be much better than the current estimate. But I believe the *best* > estimate will be somewhere between his and the current one, and I'm > fishing for any decent way of coming up with a weighted average that > is closer to his than to the current, but not simply equal to his proposal. > > The reason I want the formula to be closer to Tomas's than to the > current is that I think that on average, across all tables, across all > databases, in practice it will be closer to the right estimate than the > current formula. That's just my intuition, and so I can't defend it. > But if my intuition is right, the best formula we can adopt would be one > that is moderated from his by a little bit, and in the direction of the > estimate that the current code generates. I think looking merely at what fraction of data sets (or queries) uses dependent GROUP BY and WHERE clauses is not sufficient. The general behavior of the two GROUP BY estimators is that the current one tends to under-estimate, while the new one tends to over-estimate. (It's not difficult to construct counter-examples by using more complex dependencies between the columns etc. but let's ignore that.) The risk associated with over-estimation is that we may switch from HashAggregate to GroupAggregate, and generally select paths better suited for large number of rows. Not great, because the plan may not be optimal and thus run much slower - but that usually only happens on small tables, because on large tables we would probably end up using those same paths anyway. With under-estimation, the main risks are much graver, as we may end up using HashAggregate only to get killed by OOM. When we're lucky not to hit that, my experience this often leads to a cascade of NestedLoop nodes processing orders of magnitude more tuples than expected. Awful. So I think the risk is asymmetric, and that's why I like the new estimator more, because it tends to over-estimate, but in a very reasonable way. > > I could easily lose this debate merely for lack of a principled basis > for saying how far toward the current estimate the new estimate should > be adjusted. The geometric average is one suggestion, but I don't have > a principled argument for it. > > Like I said above, I'm fishing for a decent formula here. Thanks for the valuable feedback! > > Mark Dilger regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The risk associated with over-estimation is that we may switch from > HashAggregate to GroupAggregate, and generally select paths better > suited for large number of rows. Not great, because the plan may not be > optimal and thus run much slower - but that usually only happens on > small tables, because on large tables we would probably end up using > those same paths anyway. > > With under-estimation, the main risks are much graver, as we may end up > using HashAggregate only to get killed by OOM. When we're lucky not to > hit that, my experience this often leads to a cascade of NestedLoop > nodes processing orders of magnitude more tuples than expected. Awful. > > So I think the risk is asymmetric, and that's why I like the new > estimator more, because it tends to over-estimate, but in a very > reasonable way. > Agreed. Under-estimating is worse than over-estimating. - reldistinct *= rel->rows / rel->tuples; + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows) Looking at this, I agree that this new formula from [1] is generally better than the old one in most (but not all) cases. One problem with it is that it no longer takes into account rel->tuples, which means that when returning all the tuples from the table, it no longer gives an estimate equal to the total number of distinct values in the table. In fact it tends to underestimate when returning nearly all the rows from the table. The old formula is clearly not a good general-purpose estimate. However, there is one case where it does return an accurate estimate -- when all the values in the underlying table are distinct. In this case the new formula consistently produces underestimates, while the old formula works well. For example: CREATE TABLE t AS SELECT (100000000 * random())::int a, i::int b FROM generate_series(1,1000000)s(i); ANALYSE t; EXPLAIN ANALYSE SELECT a FROM t GROUP BY a; So there are clearly around 1 million distinct values for "a", but the new formula gives an estimate of around 630,000. That's not a massive underestimate, but it's sufficiently obvious and visible that it would be good to do better if we can. I think that a better formula to use would be reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct) This comes from [2], which gives a general formula for selection without replacement, and then gives the above as an approximation to the uniform distribution case. It's not really made clear in that paper, but that approximation is actually the "with replacement" approximation, but starting from different initial assumptions to give a formula that has better boundary conditions and special-case behaviour, as described below. For the record, here's a quick analysis of how the 2 formulae come about: Assume there are: N rows in the table n distinct values (n <= N) p rows are chosen at random (p <= N) so the problem is to estimate how many distinct values there will be in the p rows chosen. Both approaches work by first estimating the probability that some particular value x is *not* chosen. [1] starts from the assumption that each row of the table has a probability of 1/n of being equal to x. So the probability that x is not the first value chosen is P(not x, 1) = 1 - 1/n Then, if the value chosen first is replaced, the probability that x is not the second value chosen is the same. The "with replacement" approximation makes each choice an independent probability, so the overall probability that x is not in any of the p rows chosen is just the product of the p individual probabilities, which is just P(not x, p) = (1 - 1/n)^p Then of course the probability that x *is* chosen at least once in the p rows is P(x, p) = 1 - (1 - 1/n)^p Finally, since there are n possible values of x, and they're all equally likely in the table, the expected number of distinct values is D(p) = n * (1 - (1 - 1/n)^p) The flaw in this approach is that for large p the "with replacement" approximation gets worse and worse, and the probabilities P(x, p) systematically under-estimate the actual probability that x is chosen, which increases as more values not equal to x are chosen. By the time the last value is chosen P(x, p=N) should actually be 1. [2] starts from a different initial assumption (uniform distribution case) -- for any given value x, assume that the table contains N/n instances of x (ignoring the fact that that might not be an integer). Note that with this assumption there is guaranteed to be at least one instance of x, which is not the case with the above approach. Now consider the first instance of x in the table. If we choose p rows from the table, the probability that that first instance of x is not chosen is P(not x[1], p) = 1 - p / N Making the same "with replacement" simplification, the probability that the second instance of x is not chosen is the same, and they're independent probabilities again. So, if there are N/n instances of x, the overall probability that none of them is chosen is P(not x, p) = (1 - p / N)^(N/n) So then the formula for the expected number of distinct values becomes D(p) = n * (1 - (1 - p / N)^(N/n)) This alternate formula has a few nice properties: 1). D(p=0) = 0 2). D(p=N) = n (choosing all the rows results in all the distinct values being chosen) 3). When n=N, D(p) = p (when all the values in the table are distinct, so are all the values chosen). This case matches the current formula. The first formula for D(p) lacks properties (2) and (3). The reason this second formula generally works better is that the "with replacement" approximation is only being made in for up to (N/n) selections, rather than up to N selections. So it remains a good approximation as long as (N/n) is large compared to N, i.e., as long as n is fairly large. In fact even for n=1 or 2, it still works adequately if the results are rounded to the nearest integer. The following snipet can be used in gnuplot to visualise the results for various values of n and N. I've included the exact "without replacement" formula too, by rewriting it in terms of the gamma function: N=1000000.0; n=500000.0; plot [p=0:N] [0:n] p*n/N, n*(1-(1-1/n)**p), n*(1-(1-p/N)**(N/n)), n*(1 - exp(lgamma(N-N/n+1) + lgamma(N-p+1) - lgamma(N+1) - lgamma(N-N/n-p+1))) For most values of N and n, the approximation from [2] is almost indistinguishable from the exact formula, whereas the formula from [1] tends to underestimate when selecting a large number of distinct values (e.g., try setting n=900000 in the plot above). Regards, Dean [1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers [2] http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf
Hi, On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote: > On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> > wrote: > > > > The risk associated with over-estimation is that we may switch from > > HashAggregate to GroupAggregate, and generally select paths better > > suited for large number of rows. Not great, because the plan may > > not be > > optimal and thus run much slower - but that usually only happens on > > small tables, because on large tables we would probably end up > > using > > those same paths anyway. > > > > With under-estimation, the main risks are much graver, as we may > > end up > > using HashAggregate only to get killed by OOM. When we're lucky not > > to > > hit that, my experience this often leads to a cascade of NestedLoop > > nodes processing orders of magnitude more tuples than expected. > > Awful. > > > > So I think the risk is asymmetric, and that's why I like the new > > estimator more, because it tends to over-estimate, but in a very > > reasonable way. > > > Agreed. Under-estimating is worse than over-estimating. > > > - reldistinct *= rel->rows / rel->tuples; > + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, > rel->rows) > > Looking at this, I agree that this new formula from [1] is generally > better than the old one in most (but not all) cases. > > One problem with it is that it no longer takes into account > rel->tuples, which means that when returning all the tuples from the > table, it no longer gives an estimate equal to the total number of > distinct values in the table. In fact it tends to underestimate when > returning nearly all the rows from the table. Yes, that's a good point. > > The old formula is clearly not a good general-purpose estimate. > However, there is one case where it does return an accurate estimate > -- when all the values in the underlying table are distinct. In this > case the new formula consistently produces underestimates, while the > old formula works well. For example: > > CREATE TABLE t AS SELECT (100000000 * random())::int a, > i::int b > FROM generate_series(1,1000000) s(i); > ANALYSE t; > > EXPLAIN ANALYSE > SELECT a FROM t GROUP BY a; > > So there are clearly around 1 million distinct values for "a", but > the new formula gives an estimate of around 630,000. That's not a > massive underestimate, but it's sufficiently obvious and visible that > it would be good to do better if we can. > > > I think that a better formula to use would be > > reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / > reldistinct) > > This comes from [2], which gives a general formula for selection > without replacement, and then gives the above as an approximation to > the uniform distribution case. It's not really made clear in that > paper, but that approximation is actually the "with replacement" > approximation, but starting from different initial assumptions to > give a formula that has better boundary conditions and special-case > behaviour, as described below. > > ... > > For most values of N and n, the approximation from [2] is almost > indistinguishable from the exact formula, whereas the formula from > [1] tends to underestimate when selecting a large number of distinct > values (e.g., try setting n=900000 in the plot above). Yes, I agree that formula you propose seems to behave better. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/13/2016 11:09 PM, Tomas Vondra wrote: > Hi, > > On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote: >> On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> >> wrote: >>> ... >>> >> >> I think that a better formula to use would be >> >> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / >> reldistinct) Attached is a v3 of the patch using this formula instead of the original one. Interestingly, that apparently reduces the number of regression tests that get broken to a single one. I'm not sure whether we need to provide a link to the PDF the formula comes from - perhaps we should? I've also repeated the tests for the two tables (dependent and independent columns), comparing the actual number of groups and different estimates, and the results look like this (v3 is the formula used in this patch): 1) independent | 10 | 50 | 100 | 500 | 1000 | 5000 --------------------------------------------------------- actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001 current | 10 | 50 | 102 | 516 | 1018 | 4996 new (v1) | 973 | 4001 | 6382 | 9897 | 9951 | 9951 new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004 2) dependent | 10 | 50 | 100 | 500 | 1000 | 5000 -------------------------------------------------------- actual | 10 | 50 | 100 | 500 | 1000 | 5000 current | 10 | 53 | 105 | 508 | 1016 | 5014 new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018 new (v3) | 807 | 3680 | 6050 | 9916 | 9983 | 9983 I only collected numbers for the new estimator, the other numbers are just a copy from the previous message. So there might be minor differences due to slightly different ndistinct estimates etc. Anyway, the numbers are obviously quite close to the formula from v1 of the patch, plus the formula gives better estimates when scanning nearly all rows. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 18 March 2016 at 00:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote: >>> I think that a better formula to use would be >>> >>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / >>> reldistinct) > > Attached is a v3 of the patch using this formula instead of the original > one. Interestingly, that apparently reduces the number of regression tests > that get broken to a single one. > Cool. > I'm not sure whether we need to provide a link to the PDF the formula comes > from - perhaps we should? > Probably a better URL to give is http://www.adellera.it/investigations/distinct_balls/ which has a link to the PDF version of the paper and also some supporting material. However, while that paper is in general very clear, I don't think it gives a very clear explanation of that particular formula, and it doesn't explain what it represents. It merely says that if "i" can be ignored "for some reason (e.g. i << Nr)", then that formula is an approximation to the exact "without replacement" formula, which is the subject of that paper. But actually, that formula is the exact formula for the expected number of distinct values when selecting *with replacement* from a collection where each of the distinct values occurs an equal number of times. So I think we should say that. Perhaps something along the lines of: /* * Update the estimate based on the restriction selectivity. * * This usesthe formula for the expected number of distinct values * when selecting with replacement from a collectionwhere each of * the distinct values occurs an equal number of times (a uniform * distributionof values). This is a very close approximation to * the more correct method of selecting without replacementwhen * the number of distinct values is quite large --- for example, * see http://www.adellera.it/investigations/distinct_balls/. * Additionally, it also turns out to work very well evenwhen the * number of distinct values is small. */ Regards, Dean
Hi, Dean!
On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.
However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.
But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.
Perhaps something along the lines of:
/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct values
* when selecting with replacement from a collection where each of
* the distinct values occurs an equal number of times (a uniform
* distribution of values). This is a very close approximation to
* the more correct method of selecting without replacement when
* the number of distinct values is quite large --- for example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when the
* number of distinct values is small.
*/
+1
Thank you for work on this patch. The formula you propose and explanation look great!
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi, Tomas!
On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.
However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.
But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.
Perhaps something along the lines of:
/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct values
* when selecting with replacement from a collection where each of
* the distinct values occurs an equal number of times (a uniform
* distribution of values). This is a very close approximation to
* the more correct method of selecting without replacement when
* the number of distinct values is quite large --- for example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when the
* number of distinct values is small.
*/+1Thank you for work on this patch. The formula you propose and explanation look great!
I think you should send a revision of patch including comments proposed by Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Thomas, On 3/22/16 10:40 AM, Alexander Korotkov wrote: > I think you should send a revision of patch including comments proposed > by Deam Rasheed. > I'm switching patch status to waiting on author in commitfest. We're getting pretty close to the end of the CF. Do you know when you will have a new patch ready? Thanks, -- -David david@pgmasters.net
Hi, On 03/22/2016 03:40 PM, Alexander Korotkov wrote: > > I think you should send a revision of patch including comments proposed > by Deam Rasheed. > I'm switching patch status to waiting on author in commitfest. > Attached is v4 of the patch - the only difference w.r.t. v3 is that I've used the comment as proposed by Dean. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Attached is v4 of the patch Thanks, I think this is good to go, except that I think we need to use pow() rather than powl() because AIUI powl() is new to C99, and so won't necessarily be available on all supported platforms. I don't think we need worry about loss of precision, since that would only be an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or so, and it seems unlikely we'll get anywhere near that any time soon. I think this is a good, well thought-out change, so unless anyone objects I'll push it (probably this weekend). Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Attached is v4 of the patch > Thanks, I think this is good to go, except that I think we need to use > pow() rather than powl() because AIUI powl() is new to C99, and so > won't necessarily be available on all supported platforms. I don't > think we need worry about loss of precision, since that would only be > an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or > so, and it seems unlikely we'll get anywhere near that any time soon. I took a quick look. I concur with using pow() not powl(); the latter is not in SUS v2 which is our baseline portability expectation, and in fact there is *noplace* where we expect long double to work. Moreover, I don't believe that any of the estimates we're working with are so accurate that a double-width power result would be a useful improvement. Also, I wonder if it'd be a good idea to provide a guard against division by zero --- we know rel->tuples > 0 at this point, but I'm less sure that reldistinct can't be zero. In the same vein, I'm worried about the first argument of pow() being slightly negative due to roundoff error, leading to a NaN result. Maybe we should also consider clamping the final reldistinct estimate to an integer with clamp_row_est(). The existing code doesn't do that but it seems like a good idea on general principles. Another minor gripe is the use of a random URL as justification. This code will still be around when that URL exists nowhere but the Wayback Machine. Can't we find a more formal citation to use? regards, tom lane
Tom Lane wrote: > Another minor gripe is the use of a random URL as justification. This > code will still be around when that URL exists nowhere but the Wayback > Machine. Can't we find a more formal citation to use? The article text refers to this 1977 S. B. Yao paper "Approximating block accesses in database organizations" which doesn't appear to be available online, except behind ACM's paywall at http://dl.acm.org/citation.cfm?id=359475 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Another minor gripe is the use of a random URL as justification. This >> code will still be around when that URL exists nowhere but the Wayback >> Machine. Can't we find a more formal citation to use? > The article text refers to this 1977 S. B. Yao paper "Approximating > block accesses in database organizations" which doesn't appear to be > available online, except behind ACM's paywall at > http://dl.acm.org/citation.cfm?id=359475 Well, a CACM citation is perfectly fine by my lights (especially one that's that far back and therefore certainly patent-free ...) Let's use something like this: See "Approximating block accesses in database organizations", S. B. Yao, Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 regards, tom lane
On 31 March 2016 at 20:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, I wonder if it'd be a good idea to provide a guard against division > by zero --- we know rel->tuples > 0 at this point, but I'm less sure that > reldistinct can't be zero. In the same vein, I'm worried about the first > argument of pow() being slightly negative due to roundoff error, leading > to a NaN result. Yeah, that makes sense. In fact, if we only apply the adjustment when reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it is guaranteed to be non-negative. If rel->rows >= rel->tuples (not sure if it can be greater), then we just want the original reldistinct. > Maybe we should also consider clamping the final reldistinct estimate to > an integer with clamp_row_est(). The existing code doesn't do that but > it seems like a good idea on general principles. OK, that seems sensible. Regards, Dean
On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> Another minor gripe is the use of a random URL as justification. This >>> code will still be around when that URL exists nowhere but the Wayback >>> Machine. Can't we find a more formal citation to use? > >> The article text refers to this 1977 S. B. Yao paper "Approximating >> block accesses in database organizations" which doesn't appear to be >> available online, except behind ACM's paywall at >> http://dl.acm.org/citation.cfm?id=359475 > > Well, a CACM citation is perfectly fine by my lights (especially one > that's that far back and therefore certainly patent-free ...) > > Let's use something like this: > > See "Approximating block accesses in database organizations", S. B. Yao, > Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 > Sounds good. Regards, Dean
Tom Lane wrote: > > The article text refers to this 1977 S. B. Yao paper "Approximating > > block accesses in database organizations" which doesn't appear to be > > available online, except behind ACM's paywall at > > http://dl.acm.org/citation.cfm?id=359475 > > Well, a CACM citation is perfectly fine by my lights (especially one > that's that far back and therefore certainly patent-free ...) > > Let's use something like this: > > See "Approximating block accesses in database organizations", S. B. Yao, > Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 That sounds nice all right, but I'm not sure it's actually helpful, because the article text is not available anywhere. I doubt most people will spend 15 bucks to buy that paper ... so we don't actually know whether the paper supports the chosen formula :-) unless you have a CACM subscription and can verify it. I think it's good to have the ACM reference anyway, for posterity, but it'd be good to (additionally) have something that people can read. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Let's use something like this: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 Actually, having now looked at both the Dellera paper and the Yao one, what the latter shows seems to be equivalent to Dellera's equation (2) (the first one in his section 2.2). But what the code is actually using is the approximation in the second equation in 2.2, and that one is certainly not in Yao, although it's not hard to see how you get from the first to the second if you assume i << Nr. I think it'd be worth writing out equation (2), attributing it to the Yao paper, and then saying something like "Alberto Dellera points out that if Nd is large, so that all the values of i are much less than Nr, then this formula can be approximated by [the formula used in the code]. It turns out that that formula also works well even when Nd is small." regards, tom lane
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > Yeah, that makes sense. In fact, if we only apply the adjustment when > reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first > argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it > is guaranteed to be non-negative. If rel->rows >= rel->tuples (not > sure if it can be greater), then we just want the original > reldistinct. Works for me. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 > That sounds nice all right, but I'm not sure it's actually helpful, > because the article text is not available anywhere. I doubt most people > will spend 15 bucks to buy that paper ... so we don't actually know > whether the paper supports the chosen formula :-) unless you have a > CACM subscription and can verify it. Well, I do and I did, see my previous response. > I think it's good to have the ACM reference anyway, for posterity, but > it'd be good to (additionally) have something that people can read. I'm just concerned about what happens when the Dellera paper stops being available. I don't mind including that URL as a backup to the written-out argument I just suggested. regards, tom lane
On 31 March 2016 at 22:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm just concerned about what happens when the Dellera paper stops being > available. I don't mind including that URL as a backup to the written-out > argument I just suggested. > Here's an updated patch with references to both papers, and a more detailed justification for the formula, along with the other changes discussed. Note that although equation (2) in the Dell'Era paper looks different from the Yao formula, it's actually the same. Regards, Dean
Attachment
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > Here's an updated patch with references to both papers, and a more > detailed justification for the formula, along with the other changes > discussed. Note that although equation (2) in the Dell'Era paper looks > different from the Yao formula, it's actually the same. Looks good to me. regards, tom lane