Thread: stddev returns 0 when there is one row

stddev returns 0 when there is one row

From

Bruno Wolff III

Date:

19 April 2003, 12:10:52

stddev returns 0 when the number of rows is one. I would have expected null
to be returned in this case as the standard deviation is undefined when
there is one row.

Re: stddev returns 0 when there is one row

From

Manfred Koizar

Date:

19 April 2003, 14:05:57

On Sat, 19 Apr 2003 11:11:21 -0500, Bruno Wolff III <bruno@wolff.to>
wrote:
>stddev returns 0 when the number of rows is one. I would have expected null
>to be returned in this case as the standard deviation is undefined when
>there is one row.

As far as a little googling can tell, there a two kinds of standard
deviation:  Depending on whether you are calculating the standard
deviation of the *whole population* or of a *sample* you divide by N
or by (N - 1), respectively, before you take the square root.

I'm not an expert, what I say here is from
http://www.beyondtechnology.com/tips016.shtml.  Other web pages seem
to say the same.  OpenOffice.org Calc has two flavours of standard
deviation, too.

Servus
 Manfred

Re: stddev returns 0 when there is one row

From

Joe Conway

Date:

19 April 2003, 15:50:13

Manfred Koizar wrote:
> On Sat, 19 Apr 2003 11:11:21 -0500, Bruno Wolff III <bruno@wolff.to>
> wrote:
>
>>stddev returns 0 when the number of rows is one. I would have expected null
>>to be returned in this case as the standard deviation is undefined when
>>there is one row.
>
> As far as a little googling can tell, there a two kinds of standard
> deviation:  Depending on whether you are calculating the standard
> deviation of the *whole population* or of a *sample* you divide by N
> or by (N - 1), respectively, before you take the square root.
>
> I'm not an expert, what I say here is from
> http://www.beyondtechnology.com/tips016.shtml.  Other web pages seem
> to say the same.  OpenOffice.org Calc has two flavours of standard
> deviation, too.
>

There doesn't seem to be anything in SQL99 about this, but in SQL200x I
found in section 4.16.3:

— If STDDEV_POP is specified, then the population standard deviation of
<value expression>
evaluated for each row remaining in the group, defined as the square
root of the population
variance.
— If STDDEV_SAMP is specified, then the sample standard deviation of
<value expression> evaluated
for each row remaining in the group, defined as the square root of the
sample variance.

So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

Joe

Re: stddev returns 0 when there is one row

From

Tom Lane

Date:

19 April 2003, 16:34:03

Joe Conway <mail@joeconway.com> writes:
> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

No, we implement the sample standard deviation, as stated in the docs:
http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggregate.html

The code is pretty straightforward, at least in the float8 case:

    /* We define STDDEV of no values to be NULL, of 1 value to be 0 */
    if (N == 0.0)
        PG_RETURN_NULL();

    if (N <= 1.0)
        PG_RETURN_FLOAT8(0.0);

    numerator = N * sumX2 - sumX * sumX;

    /* Watch out for roundoff error producing a negative numerator */
    if (numerator <= 0.0)
        PG_RETURN_FLOAT8(0.0);

    PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0))));

I don't have a real strong feeling about whether we should change the
behavior at N=1 or not.  Does the SQL200x spec provide any guidance?

            regards, tom lane

Re: stddev returns 0 when there is one row

From

Bruno Wolff III

Date:

19 April 2003, 16:36:03

On Sat, Apr 19, 2003 at 12:50:01 -0700,
  Joe Conway <mail@joeconway.com> wrote:
>
> There doesn't seem to be anything in SQL99 about this, but in SQL200x I
> found in section 4.16.3:
>
>  If STDDEV_POP is specified, then the population standard deviation of
> <value expression>
> evaluated for each row remaining in the group, defined as the square
> root of the population
> variance.
>  If STDDEV_SAMP is specified, then the sample standard deviation of
> <value expression> evaluated
> for each row remaining in the group, defined as the square root of the
> sample variance.
>
> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

Postgres's stddev is n-1 weighted. It specifically checks for n=0
(for which it returns null) and n=1 for which it returns 0.
From a mathematical standpoint returning 0 is odd, since the value
is undefined.

Re: stddev returns 0 when there is one row

From

nolan@celery.tssi.com

Date:

19 April 2003, 16:54:43

> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

A simple test (3 rows, values 1, 2, and 3) shows that pgsql uses the
sample (N-1) formulas for variance and standard deviation, not the
population (N) formulas.

Oracle, which also implements the sample formulas, specifically defines
the variance to be 0 when there is just one row.
--
Mike Nolan

Re: stddev returns 0 when there is one row

From

Joe Conway

Date:

20 April 2003, 00:30:04

Tom Lane wrote:
[PostgreSQL's STDDEV is a sample standard deviation, *not* a population
standrad deviation]
> I don't have a real strong feeling about whether we should change the
> behavior at N=1 or not.  Does the SQL200x spec provide any guidance?
>

I guess that's what I get for jumping to conclusions ;-0

The spec does have specific guidance in section
10.9 <aggregate function>:

j) STDDEV_SAMP(X) is equivalent to SQRT(VAR_SAMP(X)).

- and -

viii) If VAR_POP or VAR_SAMP is specified, then let S1 be the sum of
       values in the column of TXA, and S2 be the sum of the squares of
       the values in the column of TXA.
       1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N.
       2) If VAR_SAMP is specified, then:
          A) If N is 1 (one), then the result is the null value.
          B) Otherwise, the result is (S2-S1*S1/N)/(N-1)

Joe

Re: stddev returns 0 when there is one row

From

Tom Lane

Date:

20 April 2003, 19:50:35

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> I don't have a real strong feeling about whether we should change the
>> behavior at N=1 or not.  Does the SQL200x spec provide any guidance?

> The spec does have specific guidance in section
> 10.9 <aggregate function>:

>        1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N.
>        2) If VAR_SAMP is specified, then:
>           A) If N is 1 (one), then the result is the null value.
>           B) Otherwise, the result is (S2-S1*S1/N)/(N-1)

Okay, that probably trumps the Oracle precedent, especially seeing that
it seems mathematically sounder.  I'll make the changes.

            regards, tom lane

Re: stddev returns 0 when there is one row

From

Douglas Trainor

Date:

20 April 2003, 21:01:26

Tom Lane wrote:

>Joe Conway <mail@joeconway.com> writes:
>
>>Tom Lane wrote:
>>
>>>I don't have a real strong feeling about whether we should change the
>>>behavior at N=1 or not.  Does the SQL200x spec provide any guidance?
>>>
>>The spec does have specific guidance in section
>>10.9 <aggregate function>:
>>
>>       1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N.
>>       2) If VAR_SAMP is specified, then:
>>          A) If N is 1 (one), then the result is the null value.
>>          B) Otherwise, the result is (S2-S1*S1/N)/(N-1)
>>
>
>Okay, that probably trumps the Oracle precedent, especially seeing that
>it seems mathematically sounder.  I'll make the changes.
>
The above is indeed the right thing to do for samples!
(Oracle must do something else as a convenience for programmers who
don't write code that checks for a sample size of at least two.)

What's really interesting to me is that StarOffice 6.0's spreadsheet
functions,
both the standard deviation of a sample (=STDEV) and variance of a sample
(=VAR) are bug-for-bug compatible with Excel 2002!  That is, Excel has a
bug, and StarOffice has the same bug to be compatible with Excel's bug.
I assume the functions are buggy in OpenOffice as well, but I haven't
checked.

For example, both of these calculations produce answers of 0 (zero)
but they should produce answers of 1 (one):

    =stdev(80000000,80000001,80000002)
    =var(80000000,80000001,80000002)

When the numbers are smaller, like this:

    =stdev(0,1,2)
    =var(0,1,2)

They produce correct answers.

    douglas "trying to exploit the R and PostgreSQL synergy" trainor

Re: stddev returns 0 when there is one row

From

Tom Lane

Date:

20 April 2003, 21:34:25

Douglas Trainor <trainor@uic.edu> writes:
> For example, both of these calculations produce answers of 0 (zero)
> but they should produce answers of 1 (one):

>     =stdev(80000000,80000001,80000002)
>     =var(80000000,80000001,80000002)

Looks like roundoff error to me.  That's pushing the limit of what you
can hope to do in float8 math.  Postgres gets the right answer with
NUMERIC data, but not with FLOAT8:

regression=# create table foo (f1 float8, f2 numeric, f3 int);
CREATE TABLE
regression=# insert into foo values(80000000, 80000000, 80000000);
INSERT 291676 1
regression=# insert into foo values(80000001, 80000001, 80000001);
INSERT 291677 1
regression=# insert into foo values(80000002, 80000002, 80000002);
INSERT 291678 1
regression=# select * from foo;
    f1    |    f2    |    f3
----------+----------+----------
 80000000 | 80000000 | 80000000
 80000001 | 80000001 | 80000001
 80000002 | 80000002 | 80000002
(3 rows)

regression=# select stddev(f1), variance(f1) from foo;
      stddev      |     variance
------------------+------------------
 1.15470053837925 | 1.33333333333333
(1 row)

regression=# select stddev(f2), variance(f2) from foo;
         stddev         |        variance
------------------------+------------------------
 1.00000000000000000000 | 1.00000000000000000000
(1 row)

regression=# select stddev(f3), variance(f3) from foo;
         stddev         |        variance
------------------------+------------------------
 1.00000000000000000000 | 1.00000000000000000000
(1 row)


(The integer case uses NUMERIC arithmetic under the hood.)

            regards, tom lane

Re: stddev returns 0 when there is one row

From

Murthy Kambhampaty

Date:

21 April 2003, 12:05:10

If we could step back for second and look at why you would want to calculate
a standard deviation in the first place -- i.e., to measure the "spread", or
"dispersion", in the measure represented by a given field, we might get a
better answer than looking at the specs. for somebody else' software.

If you were looking at a table of sample data, and you had a sample size of
one, then the result should, of course be null, since a sample size of one
gives you no information about the variance and standard deviation (standard
error, really) of the distribution from which the sample was drawn.

But we are talking about a relational database here, where it is common to
throw away redundant information, and multiple draws from a deterministic
process, with a variance and standard deviation of zero, are redundant (did
you ever do a "select distinct from <>"?). So, it is entirely likely that
data stored in postgresql will only have include row for data drawn from a
"distribution with zero standard deviation".

The STDEV_POP and STDEV_SMPL convention is not strictly right for the
context, since we throw away "redundant" records regardless of whether they
represent a sample or a population; so it's no surprise that postgresql is
programmed the way it is: if your single record may not really represent a
single draw on a distribution, so it's the user's job to program your SQL to
return NULL for std. dev. and variance.

Of course, multiple identical records aren't redundant when they describe
statistical data, so you should be keeping a count() column around to record
the "frequency" with which each distinct set of values occurs, and using the
variance and standard deviation formulas for "grouped data" - e.g., ( sum(
X**2 * freq) - sum( X *freq)*sum( X*freq))/(sum( freq) -1) for the sample
variance, where the column "X" records the data values, and the column
"freq" records the number of times the values is repeated in the sample.
Tom, would it be too much trouble to program these in (as STDEV_POP_G and
STDEV_SMPL_G or some variant?

Thanks,
    Murthy


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, April 19, 2003 16:33
To: Joe Conway
Cc: Manfred Koizar; Bruno Wolff III; pgsql-general@postgresql.org
Subject: Re: [GENERAL] stddev returns 0 when there is one row


Joe Conway <mail@joeconway.com> writes:
> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

No, we implement the sample standard deviation, as stated in the docs:
http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggreg
ate.html

The code is pretty straightforward, at least in the float8 case:

    /* We define STDDEV of no values to be NULL, of 1 value to be 0 */
    if (N == 0.0)
        PG_RETURN_NULL();

    if (N <= 1.0)
        PG_RETURN_FLOAT8(0.0);

    numerator = N * sumX2 - sumX * sumX;

    /* Watch out for roundoff error producing a negative numerator */
    if (numerator <= 0.0)
        PG_RETURN_FLOAT8(0.0);

    PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0))));

I don't have a real strong feeling about whether we should change the
behavior at N=1 or not.  Does the SQL200x spec provide any guidance?

            regards, tom lane


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: stddev returns 0 when there is one row

From

Tom Lane

Date:

21 April 2003, 12:21:43

Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
> Tom, would it be too much trouble to program these in (as STDEV_POP_G and
> STDEV_SMPL_G or some variant?

I'm not volunteering, no.  (For starters, I think you'd need to upgrade
the aggregate code to allow multiple-input aggregates.  Which is worth
doing, but I do not expect to have time to think about such things for
7.4.)

            regards, tom lane

Re: stddev returns 0 when there is one row

From

Murthy Kambhampaty

Date:

21 April 2003, 12:56:45

FWIW, it seems that, at least in the long run, the variance(expression) and
stddev(expression) functions, should take two optional arguments, one that
specifies whether to return the sample or population measure, and the other
that identifies the count() column, and the processing would go like this
(pardon my Bash-ism):

variance(X,ARG1,ARG2) () {
    if [[ ARG2 != NULL ]]; then
        SUMX2 = sum( X**2 *"ARG2");
        SUMX    = sum( X *"ARG2");
        N = sum( "ARG2");
    else
        SUMX2 = sum( X**2 );
        SUMX    = sum( X );
        N = count( X);
    if

    if [[ ARG1 = 0 ]]; then # let 0 specify the sample measure
        varX = (SUMX2 - SUMX *SUMX/N)/(N -1) #there was a typo in my
earlier msg: missing "/sum( freq)" in the numerator
    elif [[ ARG1 = 1 ]]; then
        varX = (SUMX2 - SUMX *SUMX/N)/(N)
    fi
}

It's easy enough to do client side, for now.
    Murthy

PS: When you said, in response to Joe Conway's post re the SQL200x spec,
that you'd make the changes, I read it to mean you were going to implement
both the sample and population measures, but on second reading I think you
were just been saying that you'd make the existing function properly return
the sample standard deviation - NULL if only 1 row given?

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, April 21, 2003 12:21
To: Murthy Kambhampaty
Cc: Joe Conway; Manfred Koizar; Bruno Wolff III;
pgsql-general@postgresql.org
Subject: Re: [GENERAL] stddev returns 0 when there is one row


Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
> Tom, would it be too much trouble to program these in (as STDEV_POP_G and
> STDEV_SMPL_G or some variant?

I'm not volunteering, no.  (For starters, I think you'd need to upgrade
the aggregate code to allow multiple-input aggregates.  Which is worth
doing, but I do not expect to have time to think about such things for
7.4.)

            regards, tom lane

Re: stddev returns 0 when there is one row

From

nolan@celery.tssi.com

Date:

21 April 2003, 13:42:08

> I'm not volunteering, no.  (For starters, I think you'd need to upgrade
> the aggregate code to allow multiple-input aggregates.  Which is worth
> doing, but I do not expect to have time to think about such things for
> 7.4.)

Having multiple-input aggregates opens up possibilites for a large number
of multivariate statistical measures, either built in or as user-defined
functions.  (He says as he works on formatting output from an Oracle DBMS
for time-series analysis via SPSS or SAS.)

I guess I don't understand the point of the count(*) parameter, but
wouldn't the easiest way to offer both sample and population formulas for
variance and standard deviation be to just add two new functions,
varp and stddevp, most of the code for which already exists?

That way they remain single-input aggregates.

(A perfectionist would also include vars and stddevs as equivalents to
the existing var and stddev functions.)
--
Mike Nolan

Re: stddev returns 0 when there is one row

From

Murthy Kambhampaty

Date:

21 April 2003, 14:08:20

Since you mentioned SAS, lookup the documentation for PROC MEANS, the FREQ
statement ...nolan@celery.tssi.com writes:
>I guess I don't understand the point of the count(*) parameter
Since you mention SAS, lookup the documentation for PROC MEANS, the FREQ
statement ...

"specifies a numeric variable whose value represents the frequency of the
observation. If you use the FREQ statement, the procedure assumes that each
observation represents n observations, where n is the value of variable. If
n is not an integer, the SAS System truncates it. If n is less than 1 or is
missing, the procedure does not use that observation to calculate
statistics.

The sum of the frequency variable represents the total number of
observations."

Re: stddev returns 0 when there is one row

From

Joe Conway

Date:

21 April 2003, 14:15:11

nolan@celery.tssi.com wrote:
>>I'm not volunteering, no.  (For starters, I think you'd need to upgrade
>>the aggregate code to allow multiple-input aggregates.  Which is worth
>>doing, but I do not expect to have time to think about such things for
>>7.4.)
>
> Having multiple-input aggregates opens up possibilites for a large number
> of multivariate statistical measures, either built in or as user-defined
> functions.  (He says as he works on formatting output from an Oracle DBMS
> for time-series analysis via SPSS or SAS.)
>

The SQL200x standard is pretty explicit, as I posted earlier. Assuming
it doesn't change between the draft I'm looking at and the final
version, we should in fact have each of the following as unary (single
argument) aggregates: VAR_POP, VAR_SAMP, STDDEV_POP, STDDEV_SAMP.
Interestingly section 4.16.3 covers quite a bit more including a number
of binary (two argument) aggregate functions (e.g. COVAR_POP and
COVAR_SAMP). It also talks about aggregation by either groups (similar
to current) and "windows":

4.16.3 Aggregate functions
An aggregate function is a function whose result is derived from an
aggregation of rows defined by one of:
   — The grouping of a grouped table, in which case the aggregate
     function is a group aggregate function, or set function, and for
     each group there is one aggregation, which includes every row
     in the group.
   — The window frame of a row R of a windowed table relative to a
     particular window structure descriptor, in which case the aggregate
     function is a window aggregate function, and the aggregation
     consists of every row in the window frame of R, as defined by the
     window structure descriptor.

All of this is very interesting, but it isn't relevant to PostgreSQL
7.4. In the meantime, if you are interested in multivariate statistics,
see PL/R here:  http://www.joeconway.com/

Joe

Re: stddev returns 0 when there is one row

From

Tom Lane

Date:

21 April 2003, 21:28:16

nolan@celery.tssi.com writes:
> wouldn't the easiest way to offer both sample and population formulas for
> variance and standard deviation be to just add two new functions,
> varp and stddevp, most of the code for which already exists?

As Joe pointed out, SQL200x seems to have laid down the law already on
what to call these things.

> That way they remain single-input aggregates.

My comment about multiple inputs was in response to the suggestion of
grouped input, which I took to mean that you'd want to write something
like "SELECT STDDEV_GROUPED(value, num_occurrences) FROM foo".

You could hack your way around the problem by defining the aggregate
to take a two-element array type:
    SELECT STDDEV_GROUPED(ARRAY[value, num_occurrences]) FROM foo
but this seems ugly and inefficient.

            regards, tom lane