Thread: improving GROUP BY estimation

improving GROUP BY estimation

From

Tomas Vondra

Date:

24 February 2016, 01:12:42

Hi,

while working of using correlation statistics when estimating GROUP BY
(or generally whatever uses estimate_num_groups), I've noticed that we
apply the selectivity in a rather naive way.

After estimating number of groups in the whole table - either by reading
pg_stats.n_distinct for a single column, or multiplying (and doing some
additional magic) for a group of columns, we do this:

    reldistinct *= rel->rows / rel->tuples;

which essentially applies the selectivity to the ndistinct estimate.

So when you have a table with 1000 distinct values in a column, and we
read 10% of the table, we expect to get 100 distinct values.

But consider for example the table has 100M rows, that the column has
n_distinct=10000 and we're reading 1% of the table (i.e. 1M rows). The
current code will estimate the GROUP BY cardinality to be 100 (as 1% of
the n_distinct), but in reality we're more likely to see all 10000
distinct values.

Example:

CREATE TABLE t (a INT, b FLOAT);
INSERT INTO t SELECT (100000 * random())::int, random()
                 FROM generate_series(1,10000000) s(i);
ANALYZE t;

SELECT n_distinct FROM pg_stats WHERE attname = 'a';

  n_distinct
------------
       98882

-- 1% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.01 GROUP BY a;

                                QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=179514.33..179523.45 rows=912 width=4)
                 (actual time=916.224..955.136 rows=63492 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
                       (actual time=0.103..830.104 rows=100268 loops=1)
          Filter: (b < '0.01'::double precision)
          Rows Removed by Filter: 9899732
  Planning time: 1.237 ms
  Execution time: 982.600 ms
(7 rows)

-- 5% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.05 GROUP BY a;

                                 QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=180296.06..180345.22 rows=4916 width=4)
                 (actual time=1379.519..1440.005 rows=99348 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
                       (actual time=0.096..1000.494 rows=500320 loops=1)
          Filter: (b < '0.05'::double precision)
          Rows Removed by Filter: 9499680
  Planning time: 0.129 ms
  Execution time: 1480.986 ms
(7 rows)

This is getting especially bad when reading only a small fraction of a
huge table - it's easy to construct cases with arbitrarily large error.
And it's worth noting that the error is almost exclusively massive
under-estimate, so the "wrong" direction as HashAggregate is still
vulnerable to OOM (again, the larger the table the worse).

So I think a more appropriate way to estimate this would be to find the
expected number of distinct values when sampling with replacements, as
explained for example in [1].

Applied to the code, it means changing the line from

   reldistinct *= rel->rows / rel->tuples;

to

   reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

With this change, the estimates for the examples look like this:

                                QUERY PLAN
-------------------------------------------------------------------------
  HashAggregate  (cost=179283.79..179883.48 rows=59969 width=4)
                 (actual time=897.071..934.764 rows=63492 loops=1)
    Group Key: a
    ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
                       (actual time=0.104..820.276 rows=100268 loops=1)
          Filter: (b < '0.01'::double precision)
          Rows Removed by Filter: 9899732
  Planning time: 1.261 ms
  Execution time: 962.812 ms
(7 rows)

and

                                 QUERY PLAN
-------------------------------------------------------------------------
  Group  (cost=232886.15..235371.76 rows=98234 width=4)
         (actual time=1519.545..2104.447 rows=99348 loops=1)
    Group Key: a
    ->  Sort  (cost=232886.15..234128.95 rows=497123 width=4)
              (actual time=1519.540..1838.575 rows=500320 loops=1)
          Sort Key: a
          Sort Method: external merge  Disk: 6824kB
          ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
                      (actual time=0.090..1044.099 rows=500320 loops=1)
                Filter: (b < '0.05'::double precision)
                Rows Removed by Filter: 9499680
  Planning time: 0.133 ms
  Execution time: 2147.340 ms
(10 rows)

So much better. Clearly, there are cases where this will over-estimate
the cardinality - for example when the values are somehow correlated.

But I'd argue over-estimating is better because of the OOM issues in
Hash Aggregate.

regards

[1]

http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

estimate-num-groups-v1.patch

Re: improving GROUP BY estimation

From

Mark Dilger

Date:

25 February 2016, 23:16:16

> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> <snip>
>
> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values
aresomehow correlated. 
>

I applied your patch, which caused a few regression tests to fail.  Attached
is a patch that includes the necessary changes to the expected test results.

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them.  My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:

CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
                          QUERY PLAN
---------------------------------------------------------------
 HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
         Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
 count
-------
    32
(1 row)

So, it estimates 850 rows where only 32 are returned .  Without applying your patch,
it estimates just 1 row where 32 are returned.  That's an overestimate of roughly 26 times,
rather than an underestimate of 32 times.

As a patch review, I'd say that your patch does what you claim it does, and it applies
cleanly, and passes the regression tests with my included modifications.  I think there
needs to be some discussion on the list about whether the patch is a good idea.

Mark Dilger

Attachment

estimate-num-groups-v2.txt

Re: improving GROUP BY estimation

From

Mark Dilger

Date:

26 February 2016, 00:19:44

> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
>
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> <snip>
>>
>> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values
aresomehow correlated. 
>>
>
> I applied your patch, which caused a few regression tests to fail.  Attached
> is a patch that includes the necessary changes to the expected test results.
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them.  My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>                          QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>   Group Key: a
>   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>         Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count
> -------
>    32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned .  Without applying your patch,
> it estimates just 1 row where 32 are returned.  That's an overestimate of roughly 26 times,
> rather than an underestimate of 32 times.
>
> As a patch review, I'd say that your patch does what you claim it does, and it applies
> cleanly, and passes the regression tests with my included modifications.  I think there
> needs to be some discussion on the list about whether the patch is a good idea.
>
> Mark Dilger
>
>
> <estimate-num-groups-v2.txt>

Assuming the goal is to minimize the degree to which the estimates are inaccurate, I
get better results by a kind of averaging of the old values from the current code base
and the new values from Tomas's patch.  I tried this and at least for the few examples
we've been throwing around, I found the results were not as wildly inaccurate in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
Assert(rel->reloptkind== RELOPT_BASEREL);               if (rel->tuples > 0)               { 
+                       double          old_factor;             /* The way it is currently done on master */
+                       double          new_factor;             /* The way Tomas Vondra proposes doing it */
          /*                        * Clamp to size of rel, or size of rel / 10 if multiple Vars. The
    * fudge factor is because the Vars are probably correlated but we 
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
/*                        * Multiply by restriction selectivity.                        */ 
-                       reldistinct *= rel->rows / rel->tuples;
+                       old_factor = rel->rows / rel->tuples;           /* old way */
+                       new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows));  /* Tomas's way */
+
+                       reldistinct *= sqrt(old_factor * new_factor);   /* average of old way and new way, sort of */
                   /*                        * Update estimate of total distinct groups.

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

26 February 2016, 00:59:40

Hi,

On 02/26/2016 12:16 AM, Mark Dilger wrote:
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them.  My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>                            QUERY PLAN
> ---------------------------------------------------------------
>   HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>     Group Key: a
>     ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>           Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>   count
> -------
>      32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned . Without
> applying your patch, it estimates just 1 row where 32 are returned.
> That's an overestimate of roughly 26 times, rather than an
> underestimate of 32 times.

The culprit here is that the two columns are not independent, but are 
(rather strongly) correlated due to the way you've generated the data.

In cases like this (with correlated columns), it's mostly just luck how 
accurate estimates you get, no matter which of the estimators you use. 
It's pretty easy to generate arbitrary errors by breaking the 
independence assumption, and it's not just this particular place of the 
planner. And it won't change until we add some sort of smartness about 
dependencies between columns.

I think over-estimates are a bit more acceptable in this case, e.g. 
because of the HashAggregate OOM risk. Also, I've seen too many nested 
loop cascades due to under-estimates recently, but that's a bit unrelated.

> As a patch review, I'd say that your patch does what you claim it
> does, and it applies cleanly, and passes the regression tests with
> my included modifications. I think there needs to be some discussion
> on the list about whether the patch is agood idea.

Yes, that's why I haven't bothered with fixing the regression tests in 
the patch, actually.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Mark Dilger

Date:

26 February 2016, 03:32:21

> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> On 02/26/2016 12:16 AM, Mark Dilger wrote:
>>
>> It is not hard to write test cases where your patched version overestimates
>> the number of rows by a very similar factor as the old code underestimates
>> them.  My very first test, which was not specifically designed to demonstrate
>> this, happens to be one such example:
>>
>>
>> CREATE TABLE t (a INT, b int);
>> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
>> ANALYZE t;
>> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>>                           QUERY PLAN
>> ---------------------------------------------------------------
>>  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>>    Group Key: a
>>    ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>>          Filter: (b < 1000)
>> (4 rows)
>>
>> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>>  count
>> -------
>>     32
>> (1 row)
>>
>>
>>
>> So, it estimates 850 rows where only 32 are returned . Without
>> applying your patch, it estimates just 1 row where 32 are returned.
>> That's an overestimate of roughly 26 times, rather than an
>> underestimate of 32 times.
>
> The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way
you'vegenerated the data. 

Yes, that was intentional.  Your formula is exactly correct, so far as i can tell,
for completely uncorrelated data.  I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base.  I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

> In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which
ofthe estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and
it'snot just this particular place of the planner. And it won't change until we add some sort of smartness about
dependenciesbetween columns. 
>
> I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've
seentoo many nested loop cascades due to under-estimates recently, but that's a bit unrelated. 

I have seen similar problems in systems I maintain, hence my interest
in your patch.


>> As a patch review, I'd say that your patch does what you claim it
>> does, and it applies cleanly, and passes the regression tests with
>> my included modifications. I think there needs to be some discussion
>> on the list about whether the patch is agood idea.
>
> Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually.

Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.

Thanks for the patch submission.  I hope my effort to review it is on the
whole more helpful than harmful.

Mark Dilger

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

26 February 2016, 04:03:45

Hi,

On 02/26/2016 04:32 AM, Mark Dilger wrote:
>
>> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
...
>>
>> The culprit here is that the two columns are not independent, but
>> are (rather strongly) correlated due to the way you've generated
>> the data.
>
> Yes, that was intentional. Your formula is exactly correct, so far as
> i can tell, for completely uncorrelated data. I don't have any tables
> with completely uncorrelated data, and was therefore interested in
> what happens when the data is correlated and your patch is applied.
>
> BTW, the whole reason I responded to your post is that I think I would like
> to have your changes in the code base.  I'm just playing Devil's Advocate
> here, to see if there is room for any improvement.

Sure, that's how I understood it. I just wanted to point out the 
correlation, as that might not have been obvious to others.

> Thanks for the patch submission. I hope my effort to review it is on
> the whole more helpful than harmful.

Thanks for the feedback!

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Alexander Korotkov

Date:

03 March 2016, 11:53:43

Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.

It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent.

I think we should follow this assumption in this case also until we have fundamentally better option (like your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,

reldistinct = clamp;

- * Multiply by restriction selectivity.

+ * Estimate number of distinct values expected in given number of rows.

- reldistinct *= rel->rows / rel->tuples;

+ reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with external links). For now, it looks like formula which appears from nowhere.

Also, note that rel->tuples gone away from formula parameters. So, we actually don't care about how may tuples are in the table. This is because this formula assumes that same tuple could be selected multiple times. For low numbers this can lead to some errors. Consider selecting 2 from 3 distinct tuples. If you can't select same tuple twice then you always selecting 2 distinct tuples. But according to formula you will select 5/3 in average. I think this error in small numbers is negligible, especially because getting rid of this error would require way more computations. But it worth mentioning in comments though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

03 March 2016, 16:35:29

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
> Hi, Tomas!
>
> I've assigned to review this patch.
>
> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
> It applies to head cleanly, passes corrected regression tests.
>
> About correlated/uncorrelated cases. I think now optimizer mostly assumes
> all distributions to be independent. I think we should follow this
> assumption in this case also until we have fundamentally better option (like
> your multivariate statistics).
>
> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
> double input_rows,
> reldistinct = clamp;
>
> /*
> -* Multiply by restriction selectivity.
> +* Estimate number of distinct values expected in given number of rows.
> */
> -reldistinct *= rel->rows / rel->tuples;
> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>
> /*
> * Update estimate of total distinct groups.
>
> I think we need way more explanation in comments here (possibly with
> external links). For now, it looks like formula which appears from nowhere.

I thought the details (particularly the link to stackexchange, discussing the 
formula) would be enough, but let me elaborate.

The current formula
    reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select 
1% of the table, the estimate says we'll see 1% of ndistinct values. But 
clearly that's naive, because for example when you have 10k distinct values and 
you select 10M rows, you should expect nearly all of them in the sample.

And that's what the formula does - it gives you the expected number of distinct 
values, when selecting 'k' values from 'd' distinct values with replacements:
    count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution) 
and that the probabilities do not change (that's the "with replacement").

I guess we could relax those assumptions and for example use the MCV statistics 
to further improve the estimate, and also relax the 'with replacement' but 
that'd make the formula way more complicated.

[1] 

http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

>
> Also, note that rel->tuples gone away from formula parameters. So, we
> actually don't care about how may tuples are in the table. This is because
> this formula assumes that same tuple could be selected multiple times. For
> low numbers this can lead to some errors. Consider selecting 2 from 3
> distinct tuples. If you can't select same tuple twice then you always
> selecting 2 distinct tuples. But according to formula you will select 5/3 in
> average. I think this error in small numbers is negligible, especially
> because getting rid of this error would require way more computations. But
> it worth mentioning in comments though.

Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute the 
minimum possible number of distinct values like this:
    average_rows_per_value = (tuples / ndistinct);    min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given this 
would say
    average_rows_per_value = (3 / 3) = 1;    min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With only 2 
distinct values we'd get this:
    average_rows_per_value = (3 / 2) = 1.5;    min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Mark Dilger

Date:

03 March 2016, 17:40:21

> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>> Hi, Tomas!
>>
>> I've assigned to review this patch.
>>
>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>> It applies to head cleanly, passes corrected regression tests.
>>
>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>> all distributions to be independent. I think we should follow this
>> assumption in this case also until we have fundamentally better option (like
>> your multivariate statistics).
>>
>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
>> double input_rows,
>> reldistinct = clamp;
>>
>> /*
>> -* Multiply by restriction selectivity.
>> +* Estimate number of distinct values expected in given number of rows.
>> */
>> -reldistinct *= rel->rows / rel->tuples;
>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>>
>> /*
>> * Update estimate of total distinct groups.
>>
>> I think we need way more explanation in comments here (possibly with
>> external links). For now, it looks like formula which appears from nowhere.
>
> I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me
elaborate.
>
> The current formula
>
>    reldistinct *= rel->rows / rel->tuples;
>
> simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says
we'llsee 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and
youselect 10M rows, you should expect nearly all of them in the sample. 

The current formula assumes total dependence between the
columns.  You can create tables where this is true, and the
formula gives precisely the right estimate.  I'm not claiming this
is common, or that it should be the default assumption, but it
is one end of the spectrum of possible correct estimates.

> And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values
from'd' distinct values with replacements: 
>
>    count(k, d) = d * (1 - ((d-1)/d) ^ k)
>
> It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not
change(that's the "with replacement"). 

The new formula assumes total independence between the
columns.  You can likewise create tables where this is true,
and you did so upthread, and for those tables it gives precisely
the right estimate.  This is the other end of the spectrum of
possible correct estimates.

In the absence of multivariate statistics, either because you
haven't committed that work yet, or because the multivariate
data the planner needs hasn't been collected, choosing an
estimate exactly in the middle of the spectrum (whatever
that would mean mathematically) would minimize the
maximum possible error in the estimate.

I have the sense that my point of view is in the minority, because
the expectation is the problems caused by independence
assumptions will only be fixed when multivariate statistics are
implemented and available, and so we should just keep the
independence assumptions everywhere until that time.  I
tend to disagree with that, on the grounds that even when that
work is finished, we'll never have complete coverage of every
possible set of columns and what their degree of dependence
is.

Perhaps I am badly mistaken.

Can somebody more familiar with the issues than I am please
disabuse me of my wrongheadedness?

Mark Dilger

Re: improving GROUP BY estimation

From

Alexander Korotkov

Date:

03 March 2016, 19:05:41

On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;

/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.

I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample.

And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement").

I guess we could relax those assumptions and for example use the MCV statistics to further improve the estimate, and also relax the 'with replacement' but that'd make the formula way more complicated.

[1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

Your explanation in the first mail was good enough. Not it's even better. But actually I mean that we need to include some brief explanation into source code (either comments or readme). It would be good if one who want to understand this could do without searching mailing list archives or git history.

Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3 in
average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.

Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute the minimum possible number of distinct values like this:

average_rows_per_value = (tuples / ndistinct);
min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given this would say

average_rows_per_value = (3 / 3) = 1;
min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With only 2 distinct values we'd get this:

average_rows_per_value = (3 / 2) = 1.5;
min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

I'm not sure we actually need it. Even in worst case error doesn't seems to be very big. But feel free to add this heuristics, it looks OK for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

03 March 2016, 19:16:44

Hi,

On 03/03/2016 06:40 PM, Mark Dilger wrote:
>
>> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>>> Hi, Tomas!
>>>
>>> I've assigned to review this patch.
>>>
>>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>>> It applies to head cleanly, passes corrected regression tests.
>>>
>>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>>> all distributions to be independent. I think we should follow this
>>> assumption in this case also until we have fundamentally better option (like
>>> your multivariate statistics).
>>>
>>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
>>> double input_rows,
>>> reldistinct = clamp;
>>>
>>> /*
>>> -* Multiply by restriction selectivity.
>>> +* Estimate number of distinct values expected in given number of rows.
>>> */
>>> -reldistinct *= rel->rows / rel->tuples;
>>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>>>
>>> /*
>>> * Update estimate of total distinct groups.
>>>
>>> I think we need way more explanation in comments here (possibly with
>>> external links). For now, it looks like formula which appears from nowhere.
>>
>> I thought the details (particularly the link to stackexchange, discussing
>> the formula) would be enough, but let me elaborate.
>>
>> The current formula
>>
>>     reldistinct *= rel->rows / rel->tuples;
>>
>> simply multiplies the ndistinct estimate with selectivity. So when you
>> select 1% of the table, the estimate says we'll see 1% of ndistinct
>> values. But clearly that's naive, because for example when you have 10k
>> distinct values and you select 10M rows, you should expect nearly all of
>> them in the sample.
>
> The current formula assumes total dependence between the columns.  You can
> create tables where this is true, and the formula gives precisely the right
> estimate.  I'm not claiming this is common, or that it should be the default
> assumption, but it is one end of the spectrum of possible correct
> estimates.

I'm not really sure what you mean by dependence between columns, as both the 
old and new formula only works with total cardinality and selectivity, and has 
absolutely no idea about multiple columns.

In case you mean dependence between columns in the GROUP BY and columns used to 
compute the selectivity (i.e. referenced in WHERE), then perhaps you could say 
it like that, although I think there's no such explicit assumption.

>
>> And that's what the formula does - it gives you the expected number of
>> distinct values, when selecting 'k' values from 'd' distinct values with
>> replacements:
>>
>>     count(k, d) = d * (1 - ((d-1)/d) ^ k)
>>
>> It's assuming the distinct values are equally probable (uniform
>> distribution) and that the probabilities do not change (that's the "with
>> replacement").
>
> The new formula assumes total independence between the columns. You can
> likewise create tables where this is true, and you did so upthread, and for
> those tables it gives precisely the right estimate. This is the other end of
> the spectrum of possible correct estimates.
>
> In the absence of multivariate statistics, either because you haven't
> committed that work yet, or because the multivariate data the planner needs
> hasn't been collected, choosing an estimate exactly in the middle of the
> spectrum (whatever that would mean mathematically) would minimize the
> maximum possible error in the estimate.

FWIW the current version of multivariate statistics can't really help in this 
case. Perhaps it will help in the future, but it's way more complicated that it 
might seem as it requires transferring information from the WHERE clause to the 
GROUP BY clause.

>
> I have the sense that my point of view is in the minority, because the
> expectation is the problems caused by independence assumptions will only be
> fixed when multivariate statistics are implemented and available, and so we
> should just keep the independence assumptions everywhere until that time. I
>  tend to disagree with that, on the grounds that even when that work is
> finished, we'll never have complete coverage of every possible set of
> columns and what their degree of dependence is.

Well, in a sense you're right that those estimates work best in different 
situations. The problem with constructing an estimator the way you propose 
(simply returning an average) is that in both the extreme cases (perfect 
dependence or independence) one of those estimates is really bad. Moreover the 
new formula typically produces higher values than the old one,

Consider for example this:
    CREATE TABLE t AS SELECT (10000 * random())::int a,                             (10000 * random())::int b
            FROM generate_series(1,1000000) s(i);
 

and let's see estimates for queries like this:
    SELECT a FROM t WHERE b < $1 GROUP BY a;

Then for different values of $1 we get this:
               |  10 |   50 |  100 |  500 |  1000 |  5000   --------------------------------------------------------
   actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001       current |  10 |   50 |  102 |  516 |  1018 |  4996
new| 973 | 4001 | 6382 | 9897 |  9951 |  9951       average | 491 | 2025 | 3205 | 5206 |  5484 |  7473
 

and for a table with perfectly dependent columns:
    CREATE TABLE t AS SELECT mod(i,10000) a,                             mod(i,10000) b                       FROM
generate_series(1,1000000)s(i);
 

we get this:
                |  10 |   50 |  100 |  500 |  1000 |  5000     --------------------------------------------------------
       actual |  10 |   50 |  100 |  500 |  1000 |  5000        current |  10 |   53 |  105 |  508 |  1016 |  5014
     new | 880 | 4105 | 6472 | 9955 | 10018 | 10018        average | 445 | 2079 | 3288 | 5231 |  5517 |  7516
 

So yes, each estimator works great for exactly the opposite cases. But notice 
that typically, the results of the new formula is much higher than the old one, 
sometimes by two orders of magnitude (and it shouldn't be difficult to 
construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather 
obvious that the result is always much closer to the new value, simply because
   (small number) + (huge number)   ------------------------------                  2

is always much closer to the huge number. We're usually quite happy when the 
estimates are within the same order of magnitude, so whether it's K or K/2 
makes pretty much no difference.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Alexander Korotkov

Date:

03 March 2016, 19:27:57

On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new formula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much closer to the new value, simply because

(small number) + (huge number)
------------------------------
2

is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of magnitude, so whether it's K or K/2 makes pretty much no difference.

I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: improving GROUP BY estimation

From

Mark Dilger

Date:

03 March 2016, 19:43:13

> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the new
formulais much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to
constructexamples of much higher differences). 
>
> The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much
closerto the new value, simply because 
>
>    (small number) + (huge number)
>    ------------------------------
>                   2
>
> is always much closer to the huge number. We're usually quite happy when the estimates are within the same order of
magnitude,so whether it's K or K/2 makes pretty much no difference. 
>
> I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Yes, that is what I proposed upthread.  I'm not wedded to that, though.
In general, I am with Tomas on this one, believing that his estimate
will be much better than the current estimate.  But I believe the *best*
estimate will be somewhere between his and the current one, and I'm
fishing for any decent way of coming up with a weighted average that
is closer to his than to the current, but not simply equal to his proposal.

The reason I want the formula to be closer to Tomas's than to the
current is that I think that on average, across all tables, across all
databases, in practice it will be closer to the right estimate than the
current formula.  That's just my intuition, and so I can't defend it.
But if my intuition is right, the best formula we can adopt would be one
that is moderated from his by a little bit, and in the direction of the
estimate that the current code generates.

I could easily lose this debate merely for lack of a principled basis
for saying how far toward the current estimate the new estimate should
be adjusted.  The geometric average is one suggestion, but I don't have
a principled argument for it.

Like I said above, I'm fishing for a decent formula here.

Mark Dilger

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

04 March 2016, 13:10:49

On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:
> > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
> > 
> > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > So yes, each estimator works great for exactly the opposite cases. But notice that typically, the results of the
newformula is much higher than the old one, sometimes by two orders of magnitude (and it shouldn't be difficult to
constructexamples of much higher differences).
 
> > 
> > The table also includes the 'average' estimator you propose, but it's rather obvious that the result is always much
closerto the new value, simply because
 
> > 
> >    (small number) + (huge number)
> >    ------------------------------
> >                   2
> > 
> > is always much closer to the huge number. We're usually quite happy
> > when the estimates are within the same order of magnitude, so whether
> > it's K or K/2 makes pretty much no difference.
> > 
> > I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent
                          10      50     100     500    1000    5000
------------------------------------------------------------------   actual               919    3829    6244    9944
10001  10001    current               10      50     102     516    1018    4996    new                  973    4001
6382   9897    9951    9951    average              491    2025    3205    5206    5484    7473    geom. mean
99     447     807    2260    3183    7051
 

2) dependent
                          10      50     100     500    1000    5000
------------------------------------------------------------------   actual                10      50     100     500
1000    5000    current               10      53     105     508    1016    5014    new                  880    4105
6472   9955   10018   10018    average              445    2079    3288    5231    5517    7516    geom. mean
94     466     824    2249    3190    7087
 

To better illustrate the differences, we may use the "ratio error"
defined as
   err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent
                       10       50      100      500    1000    5000
------------------------------------------------------------------     current       91.90    76.58    61.22    19.27
9.82    2.00      new            1.06     1.04     1.02     1.00    1.01    1.01      average        1.87     1.89
1.95    1.91    1.82    1.34      geom. mean     9.32     8.56     7.74     4.40    3.14    1.42
 

2) dependent
                       10       50      100      500    1000    5000
------------------------------------------------------------------     current        1.00     1.06     1.05     1.02
1.02    1.00      new           88.00    82.10    64.72    19.91   10.02    2.00      average       44.50    41.58
32.88   10.46    5.52    1.50      geom. mean     9.38     9.33     8.24     4.50    3.19    1.42
 

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

> 
> Yes, that is what I proposed upthread.  I'm not wedded to that, though.
> In general, I am with Tomas on this one, believing that his estimate
> will be much better than the current estimate.  But I believe the *best*
> estimate will be somewhere between his and the current one, and I'm
> fishing for any decent way of coming up with a weighted average that
> is closer to his than to the current, but not simply equal to his proposal.
> 
> The reason I want the formula to be closer to Tomas's than to the
> current is that I think that on average, across all tables, across all
> databases, in practice it will be closer to the right estimate than the
> current formula.  That's just my intuition, and so I can't defend it.
> But if my intuition is right, the best formula we can adopt would be one
> that is moderated from his by a little bit, and in the direction of the
> estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

> 
> I could easily lose this debate merely for lack of a principled basis
> for saying how far toward the current estimate the new estimate should
> be adjusted.  The geometric average is one suggestion, but I don't have
> a principled argument for it.
> 
> Like I said above, I'm fishing for a decent formula here.

Thanks for the valuable feedback!

> 
> Mark Dilger

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

13 March 2016, 15:24:24

On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> The risk associated with over-estimation is that we may switch from
> HashAggregate to GroupAggregate, and generally select paths better
> suited for large number of rows. Not great, because the plan may not be
> optimal and thus run much slower - but that usually only happens on
> small tables, because on large tables we would probably end up using
> those same paths anyway.
>
> With under-estimation, the main risks are much graver, as we may end up
> using HashAggregate only to get killed by OOM. When we're lucky not to
> hit that, my experience this often leads to a cascade of NestedLoop
> nodes processing orders of magnitude more tuples than expected. Awful.
>
> So I think the risk is asymmetric, and that's why I like the new
> estimator more, because it tends to over-estimate, but in a very
> reasonable way.
>

Agreed. Under-estimating is worse than over-estimating.

-           reldistinct *= rel->rows / rel->tuples;
+           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)

Looking at this, I agree that this new formula from [1] is generally
better than the old one in most (but not all) cases.

One problem with it is that it no longer takes into account
rel->tuples, which means that when returning all the tuples from the
table, it no longer gives an estimate equal to the total number of
distinct values in the table. In fact it tends to underestimate when
returning nearly all the rows from the table.

The old formula is clearly not a good general-purpose estimate.
However, there is one case where it does return an accurate estimate
-- when all the values in the underlying table are distinct. In this
case the new formula consistently produces underestimates, while the
old formula works well. For example:

CREATE TABLE t AS SELECT (100000000 * random())::int a,                        i::int b                   FROM
generate_series(1,1000000)s(i);

ANALYSE t;

EXPLAIN ANALYSE
SELECT a FROM t GROUP BY a;

So there are clearly around 1 million distinct values for "a", but the
new formula gives an estimate of around 630,000. That's not a massive
underestimate, but it's sufficiently obvious and visible that it would
be good to do better if we can.

I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct)

This comes from [2], which gives a general formula for selection
without replacement, and then gives the above as an approximation to
the uniform distribution case. It's not really made clear in that
paper, but that approximation is actually the "with replacement"
approximation, but starting from different initial assumptions to give
a formula that has better boundary conditions and special-case
behaviour, as described below.

For the record, here's a quick analysis of how the 2 formulae come about:

Assume there are: N rows in the table n distinct values (n <= N) p rows are chosen at random (p <= N)

so the problem is to estimate how many distinct values there will be
in the p rows chosen. Both approaches work by first estimating the
probability that some particular value x is *not* chosen.

[1] starts from the assumption that each row of the table has a
probability of 1/n of being equal to x. So the probability that x is
not the first value chosen is
 P(not x, 1) = 1 - 1/n

Then, if the value chosen first is replaced, the probability that x is
not the second value chosen is the same. The "with replacement"
approximation makes each choice an independent probability, so the
overall probability that x is not in any of the p rows chosen is just
the product of the p individual probabilities, which is just
 P(not x, p) = (1 - 1/n)^p

Then of course the probability that x *is* chosen at least once in the p rows is
 P(x, p) = 1 - (1 - 1/n)^p

Finally, since there are n possible values of x, and they're all
equally likely in the table, the expected number of distinct values is
 D(p) = n * (1 - (1 - 1/n)^p)

The flaw in this approach is that for large p the "with replacement"
approximation gets worse and worse, and the probabilities P(x, p)
systematically under-estimate the actual probability that x is chosen,
which increases as more values not equal to x are chosen. By the time
the last value is chosen P(x, p=N) should actually be 1.

[2] starts from a different initial assumption (uniform distribution
case) -- for any given value x, assume that the table contains N/n
instances of x (ignoring the fact that that might not be an integer).
Note that with this assumption there is guaranteed to be at least one
instance of x, which is not the case with the above approach.

Now consider the first instance of x in the table. If we choose p rows
from the table, the probability that that first instance of x is not
chosen is
 P(not x[1], p) = 1 - p / N

Making the same "with replacement" simplification, the probability
that the second instance of x is not chosen is the same, and they're
independent probabilities again. So, if there are N/n instances of x,
the overall probability that none of them is chosen is
 P(not x, p) = (1 - p / N)^(N/n)

So then the formula for the expected number of distinct values becomes
 D(p) = n * (1 - (1 - p / N)^(N/n))

This alternate formula has a few nice properties:

1). D(p=0) = 0

2). D(p=N) = n (choosing all the rows results in all the distinct
values being chosen)

3). When n=N, D(p) = p (when all the values in the table are distinct,
so are all the values chosen). This case matches the current formula.

The first formula for D(p) lacks properties (2) and (3).

The reason this second formula generally works better is that the
"with replacement" approximation is only being made in for up to (N/n)
selections, rather than up to N selections. So it remains a good
approximation as long as (N/n) is large compared to N, i.e., as long
as n is fairly large. In fact even for n=1 or 2, it still works
adequately if the results are rounded to the nearest integer.

The following snipet can be used in gnuplot to visualise the results
for various values of n and N. I've included the exact "without
replacement" formula too, by rewriting it in terms of the gamma
function:

N=1000000.0; n=500000.0; plot [p=0:N] [0:n] p*n/N, n*(1-(1-1/n)**p),
n*(1-(1-p/N)**(N/n)), n*(1 - exp(lgamma(N-N/n+1) + lgamma(N-p+1) -
lgamma(N+1) - lgamma(N-N/n-p+1)))

For most values of N and n, the approximation from [2] is almost
indistinguishable from the exact formula, whereas the formula from [1]
tends to underestimate when selecting a large number of distinct
values (e.g., try setting n=900000 in the plot above).

Regards,
Dean

[1]
http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers
[2] http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

13 March 2016, 22:09:53

Hi,

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
> On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > 
> > The risk associated with over-estimation is that we may switch from
> > HashAggregate to GroupAggregate, and generally select paths better
> > suited for large number of rows. Not great, because the plan may
> > not be
> > optimal and thus run much slower - but that usually only happens on
> > small tables, because on large tables we would probably end up
> > using
> > those same paths anyway.
> > 
> > With under-estimation, the main risks are much graver, as we may
> > end up
> > using HashAggregate only to get killed by OOM. When we're lucky not
> > to
> > hit that, my experience this often leads to a cascade of NestedLoop
> > nodes processing orders of magnitude more tuples than expected.
> > Awful.
> > 
> > So I think the risk is asymmetric, and that's why I like the new
> > estimator more, because it tends to over-estimate, but in a very
> > reasonable way.
> > 
> Agreed. Under-estimating is worse than over-estimating.
> 
> 
> -           reldistinct *= rel->rows / rel->tuples;
> +           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct,
> rel->rows)
> 
> Looking at this, I agree that this new formula from [1] is generally
> better than the old one in most (but not all) cases.
> 
> One problem with it is that it no longer takes into account
> rel->tuples, which means that when returning all the tuples from the
> table, it no longer gives an estimate equal to the total number of
> distinct values in the table. In fact it tends to underestimate when
> returning nearly all the rows from the table.

Yes, that's a good point.

> 
> The old formula is clearly not a good general-purpose estimate.
> However, there is one case where it does return an accurate estimate
> -- when all the values in the underlying table are distinct. In this
> case the new formula consistently produces underestimates, while the
> old formula works well. For example:
> 
> CREATE TABLE t AS SELECT (100000000 * random())::int a,
>                          i::int b
>                     FROM generate_series(1,1000000) s(i);
> ANALYSE t;
> 
> EXPLAIN ANALYSE
> SELECT a FROM t GROUP BY a;
> 
> So there are clearly around 1 million distinct values for "a", but
> the new formula gives an estimate of around 630,000. That's not a
> massive underestimate, but it's sufficiently obvious and visible that
> it would be good to do better if we can.
> 
> 
> I think that a better formula to use would be
> 
> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
> reldistinct)
> 
> This comes from [2], which gives a general formula for selection
> without replacement, and then gives the above as an approximation to
> the uniform distribution case. It's not really made clear in that
> paper, but that approximation is actually the "with replacement"
> approximation, but starting from different initial assumptions to
> give a formula that has better boundary conditions and special-case
> behaviour, as described below.
> 
> ...
> 
> For most values of N and n, the approximation from [2] is almost
> indistinguishable from the exact formula, whereas the formula from
> [1] tends to underestimate when selecting a large number of distinct
> values (e.g., try setting n=900000 in the plot above).

Yes, I agree that formula you propose seems to behave better.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

18 March 2016, 00:38:05

On 03/13/2016 11:09 PM, Tomas Vondra wrote:
> Hi,
>
> On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
>> On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>>>
...
 >>>
>>
>> I think that a better formula to use would be
>>
>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
>> reldistinct)

Attached is a v3 of the patch using this formula instead of the original
one. Interestingly, that apparently reduces the number of regression
tests that get broken to a single one.

I'm not sure whether we need to provide a link to the PDF the formula
comes from - perhaps we should?

I've also repeated the tests for the two tables (dependent and
independent columns), comparing the actual number of groups and
different estimates, and the results look like this (v3 is the formula
used in this patch):


1) independent

                |   10 |   50 |  100 |  500 |  1000 |  5000
    ---------------------------------------------------------
         actual |  919 | 3829 | 6244 | 9944 | 10001 | 10001
        current |   10 |   50 |  102 |  516 |  1018 |  4996
       new (v1) |  973 | 4001 | 6382 | 9897 |  9951 |  9951
       new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004


2) dependent

                 |  10 |   50 |  100 |  500 |  1000 |  5000
      --------------------------------------------------------
          actual |  10 |   50 |  100 |  500 |  1000 |  5000
         current |  10 |   53 |  105 |  508 |  1016 |  5014
        new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018
        new (v3) | 807 | 3680 | 6050 | 9916 |  9983 |  9983

I only collected numbers for the new estimator, the other numbers are
just a copy from the previous message. So there might be minor
differences due to slightly different ndistinct estimates etc.

Anyway, the numbers are obviously quite close to the formula from v1 of
the patch, plus the formula gives better estimates when scanning nearly
all rows.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

estimate-num-groups-v3.patch

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

18 March 2016, 10:20:31

On 18 March 2016 at 00:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
>>> I think that a better formula to use would be
>>>
>>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
>>> reldistinct)
>
> Attached is a v3 of the patch using this formula instead of the original
> one. Interestingly, that apparently reduces the number of regression tests
> that get broken to a single one.
>

Cool.

> I'm not sure whether we need to provide a link to the PDF the formula comes
> from - perhaps we should?
>

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:
           /*            * Update the estimate based on the restriction selectivity.            *            * This
usesthe formula for the expected number of distinct values            * when selecting with replacement from a
collectionwhere each of            * the distinct values occurs an equal number of times (a uniform            *
distributionof values). This is a very close approximation to            * the more correct method of selecting without
replacementwhen            * the number of distinct values is quite large --- for example,            * see
http://www.adellera.it/investigations/distinct_balls/.           * Additionally, it also turns out to work very well
evenwhen the            * number of distinct values is small.            */

Regards,
Dean

Re: improving GROUP BY estimation

From

Alexander Korotkov

Date:

21 March 2016, 11:39:50

Hi, Dean!

On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct values
* when selecting with replacement from a collection where each of
* the distinct values occurs an equal number of times (a uniform
* distribution of values). This is a very close approximation to
* the more correct method of selecting without replacement when
* the number of distinct values is quite large --- for example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when the
* number of distinct values is small.
*/

Thank you for work on this patch. The formula you propose and explanation look great!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: improving GROUP BY estimation

From

Alexander Korotkov

Date:

22 March 2016, 14:40:42

Hi, Tomas!

On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
* Update the estimate based on the restriction selectivity.
*
* This uses the formula for the expected number of distinct values
* when selecting with replacement from a collection where each of
* the distinct values occurs an equal number of times (a uniform
* distribution of values). This is a very close approximation to
* the more correct method of selecting without replacement when
* the number of distinct values is quite large --- for example,
* see http://www.adellera.it/investigations/distinct_balls/.
* Additionally, it also turns out to work very well even when the
* number of distinct values is small.
*/

+1
Thank you for work on this patch. The formula you propose and explanation look great!

I think you should send a revision of patch including comments proposed by Deam Rasheed.

I'm switching patch status to waiting on author in commitfest.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: improving GROUP BY estimation

From

David Steele

Date:

29 March 2016, 13:48:32

Hi Thomas,

On 3/22/16 10:40 AM, Alexander Korotkov wrote:

> I think you should send a revision of patch including comments proposed
> by Deam Rasheed.
> I'm switching patch status to waiting on author in commitfest.

We're getting pretty close to the end of the CF.  Do you know when you 
will have a new patch ready?

Thanks,
-- 
-David
david@pgmasters.net

Re: improving GROUP BY estimation

From

Tomas Vondra

Date:

30 March 2016, 13:03:56

Hi,

On 03/22/2016 03:40 PM, Alexander Korotkov wrote:
>
> I think you should send a revision of patch including comments proposed
> by Deam Rasheed.
> I'm switching patch status to waiting on author in commitfest.
>

Attached is v4 of the patch - the only difference w.r.t. v3 is that I've
used the comment as proposed by Dean.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

estimate-num-groups-v4.patch

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

31 March 2016, 18:47:53

On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Attached is v4 of the patch

Thanks, I think this is good to go, except that I think we need to use
pow() rather than powl() because AIUI powl() is new to C99, and so
won't necessarily be available on all supported platforms. I don't
think we need worry about loss of precision, since that would only be
an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
so, and it seems unlikely we'll get anywhere near that any time soon.

I think this is a good, well thought-out change, so unless anyone
objects I'll push it (probably this weekend).

Regards,
Dean

Re: improving GROUP BY estimation

From

Tom Lane

Date:

31 March 2016, 19:18:33

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 30 March 2016 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v4 of the patch

> Thanks, I think this is good to go, except that I think we need to use
> pow() rather than powl() because AIUI powl() is new to C99, and so
> won't necessarily be available on all supported platforms. I don't
> think we need worry about loss of precision, since that would only be
> an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
> so, and it seems unlikely we'll get anywhere near that any time soon.

I took a quick look.  I concur with using pow() not powl(); the latter
is not in SUS v2 which is our baseline portability expectation, and in
fact there is *noplace* where we expect long double to work.  Moreover,
I don't believe that any of the estimates we're working with are so
accurate that a double-width power result would be a useful improvement.

Also, I wonder if it'd be a good idea to provide a guard against division
by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
reldistinct can't be zero.  In the same vein, I'm worried about the first
argument of pow() being slightly negative due to roundoff error, leading
to a NaN result.

Maybe we should also consider clamping the final reldistinct estimate to
an integer with clamp_row_est().  The existing code doesn't do that but
it seems like a good idea on general principles.

Another minor gripe is the use of a random URL as justification.  This
code will still be around when that URL exists nowhere but the Wayback
Machine.  Can't we find a more formal citation to use?
        regards, tom lane

Re: improving GROUP BY estimation

From

Alvaro Herrera

Date:

31 March 2016, 20:34:29

Tom Lane wrote:

> Another minor gripe is the use of a random URL as justification.  This
> code will still be around when that URL exists nowhere but the Wayback
> Machine.  Can't we find a more formal citation to use?

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475 

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Tom Lane

Date:

31 March 2016, 20:41:01

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Another minor gripe is the use of a random URL as justification.  This
>> code will still be around when that URL exists nowhere but the Wayback
>> Machine.  Can't we find a more formal citation to use?

> The article text refers to this 1977 S. B. Yao paper "Approximating
> block accesses in database organizations" which doesn't appear to be
> available online, except behind ACM's paywall at
> http://dl.acm.org/citation.cfm?id=359475 

Well, a CACM citation is perfectly fine by my lights (especially one
that's that far back and therefore certainly patent-free ...)

Let's use something like this:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261
        regards, tom lane

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

31 March 2016, 20:44:02

On 31 March 2016 at 20:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, I wonder if it'd be a good idea to provide a guard against division
> by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
> reldistinct can't be zero.  In the same vein, I'm worried about the first
> argument of pow() being slightly negative due to roundoff error, leading
> to a NaN result.

Yeah, that makes sense. In fact, if we only apply the adjustment when
reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
sure if it can be greater), then we just want the original
reldistinct.

> Maybe we should also consider clamping the final reldistinct estimate to
> an integer with clamp_row_est().  The existing code doesn't do that but
> it seems like a good idea on general principles.

OK, that seems sensible.

Regards,
Dean

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

31 March 2016, 20:46:17

On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> Another minor gripe is the use of a random URL as justification.  This
>>> code will still be around when that URL exists nowhere but the Wayback
>>> Machine.  Can't we find a more formal citation to use?
>
>> The article text refers to this 1977 S. B. Yao paper "Approximating
>> block accesses in database organizations" which doesn't appear to be
>> available online, except behind ACM's paywall at
>> http://dl.acm.org/citation.cfm?id=359475
>
> Well, a CACM citation is perfectly fine by my lights (especially one
> that's that far back and therefore certainly patent-free ...)
>
> Let's use something like this:
>
> See "Approximating block accesses in database organizations", S. B. Yao,
> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261
>

Sounds good.

Regards,
Dean

Re: improving GROUP BY estimation

From

Alvaro Herrera

Date:

31 March 2016, 20:51:28

Tom Lane wrote:

> > The article text refers to this 1977 S. B. Yao paper "Approximating
> > block accesses in database organizations" which doesn't appear to be
> > available online, except behind ACM's paywall at
> > http://dl.acm.org/citation.cfm?id=359475 
> 
> Well, a CACM citation is perfectly fine by my lights (especially one
> that's that far back and therefore certainly patent-free ...)
> 
> Let's use something like this:
> 
> See "Approximating block accesses in database organizations", S. B. Yao,
> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

That sounds nice all right, but I'm not sure it's actually helpful,
because the article text is not available anywhere.  I doubt most people
will spend 15 bucks to buy that paper ... so we don't actually know
whether the paper supports the chosen formula :-)  unless you have a
CACM subscription and can verify it.

I think it's good to have the ACM reference anyway, for posterity, but
it'd be good to (additionally) have something that people can read.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improving GROUP BY estimation

From

Tom Lane

Date:

31 March 2016, 20:58:05

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 31 March 2016 at 21:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Let's use something like this:
>> See "Approximating block accesses in database organizations", S. B. Yao,
>> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

Actually, having now looked at both the Dellera paper and the Yao one,
what the latter shows seems to be equivalent to Dellera's equation (2)
(the first one in his section 2.2).  But what the code is actually
using is the approximation in the second equation in 2.2, and that
one is certainly not in Yao, although it's not hard to see how you
get from the first to the second if you assume i << Nr.

I think it'd be worth writing out equation (2), attributing it to the Yao
paper, and then saying something like "Alberto Dellera points out that if
Nd is large, so that all the values of i are much less than Nr, then this
formula can be approximated by [the formula used in the code].  It turns
out that that formula also works well even when Nd is small."
        regards, tom lane

Re: improving GROUP BY estimation

From

Tom Lane

Date:

31 March 2016, 20:59:47

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Yeah, that makes sense. In fact, if we only apply the adjustment when
> reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
> argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
> is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
> sure if it can be greater), then we just want the original
> reldistinct.

Works for me.
        regards, tom lane

Re: improving GROUP BY estimation

From

Tom Lane

Date:

31 March 2016, 21:03:04

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> See "Approximating block accesses in database organizations", S. B. Yao,
>> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

> That sounds nice all right, but I'm not sure it's actually helpful,
> because the article text is not available anywhere.  I doubt most people
> will spend 15 bucks to buy that paper ... so we don't actually know
> whether the paper supports the chosen formula :-)  unless you have a
> CACM subscription and can verify it.

Well, I do and I did, see my previous response.

> I think it's good to have the ACM reference anyway, for posterity, but
> it'd be good to (additionally) have something that people can read.

I'm just concerned about what happens when the Dellera paper stops being
available.  I don't mind including that URL as a backup to the written-out
argument I just suggested.
        regards, tom lane

Re: improving GROUP BY estimation

From

Dean Rasheed

Date:

02 April 2016, 12:20:21

On 31 March 2016 at 22:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm just concerned about what happens when the Dellera paper stops being
> available.  I don't mind including that URL as a backup to the written-out
> argument I just suggested.
>

Here's an updated patch with references to both papers, and a more
detailed justification for the formula, along with the other changes
discussed. Note that although equation (2) in the Dell'Era paper looks
different from the Yao formula, it's actually the same.

Regards,
Dean

Attachment

estimate-num-groups-v5.patch

Re: improving GROUP BY estimation

From

Tom Lane

Date:

02 April 2016, 14:59:25

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Here's an updated patch with references to both papers, and a more
> detailed justification for the formula, along with the other changes
> discussed. Note that although equation (2) in the Dell'Era paper looks
> different from the Yao formula, it's actually the same.

Looks good to me.
        regards, tom lane