Thread: stddev returns 0 when there is one row
stddev returns 0 when the number of rows is one. I would have expected null to be returned in this case as the standard deviation is undefined when there is one row.
On Sat, 19 Apr 2003 11:11:21 -0500, Bruno Wolff III <bruno@wolff.to> wrote: >stddev returns 0 when the number of rows is one. I would have expected null >to be returned in this case as the standard deviation is undefined when >there is one row. As far as a little googling can tell, there a two kinds of standard deviation: Depending on whether you are calculating the standard deviation of the *whole population* or of a *sample* you divide by N or by (N - 1), respectively, before you take the square root. I'm not an expert, what I say here is from http://www.beyondtechnology.com/tips016.shtml. Other web pages seem to say the same. OpenOffice.org Calc has two flavours of standard deviation, too. Servus Manfred
Manfred Koizar wrote: > On Sat, 19 Apr 2003 11:11:21 -0500, Bruno Wolff III <bruno@wolff.to> > wrote: > >>stddev returns 0 when the number of rows is one. I would have expected null >>to be returned in this case as the standard deviation is undefined when >>there is one row. > > As far as a little googling can tell, there a two kinds of standard > deviation: Depending on whether you are calculating the standard > deviation of the *whole population* or of a *sample* you divide by N > or by (N - 1), respectively, before you take the square root. > > I'm not an expert, what I say here is from > http://www.beyondtechnology.com/tips016.shtml. Other web pages seem > to say the same. OpenOffice.org Calc has two flavours of standard > deviation, too. > There doesn't seem to be anything in SQL99 about this, but in SQL200x I found in section 4.16.3: — If STDDEV_POP is specified, then the population standard deviation of <value expression> evaluated for each row remaining in the group, defined as the square root of the population variance. — If STDDEV_SAMP is specified, then the sample standard deviation of <value expression> evaluated for each row remaining in the group, defined as the square root of the sample variance. So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. Joe
Joe Conway <mail@joeconway.com> writes: > So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. No, we implement the sample standard deviation, as stated in the docs: http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggregate.html The code is pretty straightforward, at least in the float8 case: /* We define STDDEV of no values to be NULL, of 1 value to be 0 */ if (N == 0.0) PG_RETURN_NULL(); if (N <= 1.0) PG_RETURN_FLOAT8(0.0); numerator = N * sumX2 - sumX * sumX; /* Watch out for roundoff error producing a negative numerator */ if (numerator <= 0.0) PG_RETURN_FLOAT8(0.0); PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0)))); I don't have a real strong feeling about whether we should change the behavior at N=1 or not. Does the SQL200x spec provide any guidance? regards, tom lane
On Sat, Apr 19, 2003 at 12:50:01 -0700, Joe Conway <mail@joeconway.com> wrote: > > There doesn't seem to be anything in SQL99 about this, but in SQL200x I > found in section 4.16.3: > > If STDDEV_POP is specified, then the population standard deviation of > <value expression> > evaluated for each row remaining in the group, defined as the square > root of the population > variance. > If STDDEV_SAMP is specified, then the sample standard deviation of > <value expression> evaluated > for each row remaining in the group, defined as the square root of the > sample variance. > > So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. Postgres's stddev is n-1 weighted. It specifically checks for n=0 (for which it returns null) and n=1 for which it returns 0. From a mathematical standpoint returning 0 is odd, since the value is undefined.
> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. A simple test (3 rows, values 1, 2, and 3) shows that pgsql uses the sample (N-1) formulas for variance and standard deviation, not the population (N) formulas. Oracle, which also implements the sample formulas, specifically defines the variance to be 0 when there is just one row. -- Mike Nolan
Tom Lane wrote: [PostgreSQL's STDDEV is a sample standard deviation, *not* a population standrad deviation] > I don't have a real strong feeling about whether we should change the > behavior at N=1 or not. Does the SQL200x spec provide any guidance? > I guess that's what I get for jumping to conclusions ;-0 The spec does have specific guidance in section 10.9 <aggregate function>: j) STDDEV_SAMP(X) is equivalent to SQRT(VAR_SAMP(X)). - and - viii) If VAR_POP or VAR_SAMP is specified, then let S1 be the sum of values in the column of TXA, and S2 be the sum of the squares of the values in the column of TXA. 1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N. 2) If VAR_SAMP is specified, then: A) If N is 1 (one), then the result is the null value. B) Otherwise, the result is (S2-S1*S1/N)/(N-1) Joe
Joe Conway <mail@joeconway.com> writes: > Tom Lane wrote: >> I don't have a real strong feeling about whether we should change the >> behavior at N=1 or not. Does the SQL200x spec provide any guidance? > The spec does have specific guidance in section > 10.9 <aggregate function>: > 1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N. > 2) If VAR_SAMP is specified, then: > A) If N is 1 (one), then the result is the null value. > B) Otherwise, the result is (S2-S1*S1/N)/(N-1) Okay, that probably trumps the Oracle precedent, especially seeing that it seems mathematically sounder. I'll make the changes. regards, tom lane
Tom Lane wrote: >Joe Conway <mail@joeconway.com> writes: > >>Tom Lane wrote: >> >>>I don't have a real strong feeling about whether we should change the >>>behavior at N=1 or not. Does the SQL200x spec provide any guidance? >>> >>The spec does have specific guidance in section >>10.9 <aggregate function>: >> >> 1) If VAR_POP is specified, then the result is (S2-S1*S1/N)/N. >> 2) If VAR_SAMP is specified, then: >> A) If N is 1 (one), then the result is the null value. >> B) Otherwise, the result is (S2-S1*S1/N)/(N-1) >> > >Okay, that probably trumps the Oracle precedent, especially seeing that >it seems mathematically sounder. I'll make the changes. > The above is indeed the right thing to do for samples! (Oracle must do something else as a convenience for programmers who don't write code that checks for a sample size of at least two.) What's really interesting to me is that StarOffice 6.0's spreadsheet functions, both the standard deviation of a sample (=STDEV) and variance of a sample (=VAR) are bug-for-bug compatible with Excel 2002! That is, Excel has a bug, and StarOffice has the same bug to be compatible with Excel's bug. I assume the functions are buggy in OpenOffice as well, but I haven't checked. For example, both of these calculations produce answers of 0 (zero) but they should produce answers of 1 (one): =stdev(80000000,80000001,80000002) =var(80000000,80000001,80000002) When the numbers are smaller, like this: =stdev(0,1,2) =var(0,1,2) They produce correct answers. douglas "trying to exploit the R and PostgreSQL synergy" trainor
Douglas Trainor <trainor@uic.edu> writes: > For example, both of these calculations produce answers of 0 (zero) > but they should produce answers of 1 (one): > =stdev(80000000,80000001,80000002) > =var(80000000,80000001,80000002) Looks like roundoff error to me. That's pushing the limit of what you can hope to do in float8 math. Postgres gets the right answer with NUMERIC data, but not with FLOAT8: regression=# create table foo (f1 float8, f2 numeric, f3 int); CREATE TABLE regression=# insert into foo values(80000000, 80000000, 80000000); INSERT 291676 1 regression=# insert into foo values(80000001, 80000001, 80000001); INSERT 291677 1 regression=# insert into foo values(80000002, 80000002, 80000002); INSERT 291678 1 regression=# select * from foo; f1 | f2 | f3 ----------+----------+---------- 80000000 | 80000000 | 80000000 80000001 | 80000001 | 80000001 80000002 | 80000002 | 80000002 (3 rows) regression=# select stddev(f1), variance(f1) from foo; stddev | variance ------------------+------------------ 1.15470053837925 | 1.33333333333333 (1 row) regression=# select stddev(f2), variance(f2) from foo; stddev | variance ------------------------+------------------------ 1.00000000000000000000 | 1.00000000000000000000 (1 row) regression=# select stddev(f3), variance(f3) from foo; stddev | variance ------------------------+------------------------ 1.00000000000000000000 | 1.00000000000000000000 (1 row) (The integer case uses NUMERIC arithmetic under the hood.) regards, tom lane
If we could step back for second and look at why you would want to calculate a standard deviation in the first place -- i.e., to measure the "spread", or "dispersion", in the measure represented by a given field, we might get a better answer than looking at the specs. for somebody else' software. If you were looking at a table of sample data, and you had a sample size of one, then the result should, of course be null, since a sample size of one gives you no information about the variance and standard deviation (standard error, really) of the distribution from which the sample was drawn. But we are talking about a relational database here, where it is common to throw away redundant information, and multiple draws from a deterministic process, with a variance and standard deviation of zero, are redundant (did you ever do a "select distinct from <>"?). So, it is entirely likely that data stored in postgresql will only have include row for data drawn from a "distribution with zero standard deviation". The STDEV_POP and STDEV_SMPL convention is not strictly right for the context, since we throw away "redundant" records regardless of whether they represent a sample or a population; so it's no surprise that postgresql is programmed the way it is: if your single record may not really represent a single draw on a distribution, so it's the user's job to program your SQL to return NULL for std. dev. and variance. Of course, multiple identical records aren't redundant when they describe statistical data, so you should be keeping a count() column around to record the "frequency" with which each distinct set of values occurs, and using the variance and standard deviation formulas for "grouped data" - e.g., ( sum( X**2 * freq) - sum( X *freq)*sum( X*freq))/(sum( freq) -1) for the sample variance, where the column "X" records the data values, and the column "freq" records the number of times the values is repeated in the sample. Tom, would it be too much trouble to program these in (as STDEV_POP_G and STDEV_SMPL_G or some variant? Thanks, Murthy -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, April 19, 2003 16:33 To: Joe Conway Cc: Manfred Koizar; Bruno Wolff III; pgsql-general@postgresql.org Subject: Re: [GENERAL] stddev returns 0 when there is one row Joe Conway <mail@joeconway.com> writes: > So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. No, we implement the sample standard deviation, as stated in the docs: http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggreg ate.html The code is pretty straightforward, at least in the float8 case: /* We define STDDEV of no values to be NULL, of 1 value to be 0 */ if (N == 0.0) PG_RETURN_NULL(); if (N <= 1.0) PG_RETURN_FLOAT8(0.0); numerator = N * sumX2 - sumX * sumX; /* Watch out for roundoff error producing a negative numerator */ if (numerator <= 0.0) PG_RETURN_FLOAT8(0.0); PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0)))); I don't have a real strong feeling about whether we should change the behavior at N=1 or not. Does the SQL200x spec provide any guidance? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: > Tom, would it be too much trouble to program these in (as STDEV_POP_G and > STDEV_SMPL_G or some variant? I'm not volunteering, no. (For starters, I think you'd need to upgrade the aggregate code to allow multiple-input aggregates. Which is worth doing, but I do not expect to have time to think about such things for 7.4.) regards, tom lane
FWIW, it seems that, at least in the long run, the variance(expression) and stddev(expression) functions, should take two optional arguments, one that specifies whether to return the sample or population measure, and the other that identifies the count() column, and the processing would go like this (pardon my Bash-ism): variance(X,ARG1,ARG2) () { if [[ ARG2 != NULL ]]; then SUMX2 = sum( X**2 *"ARG2"); SUMX = sum( X *"ARG2"); N = sum( "ARG2"); else SUMX2 = sum( X**2 ); SUMX = sum( X ); N = count( X); if if [[ ARG1 = 0 ]]; then # let 0 specify the sample measure varX = (SUMX2 - SUMX *SUMX/N)/(N -1) #there was a typo in my earlier msg: missing "/sum( freq)" in the numerator elif [[ ARG1 = 1 ]]; then varX = (SUMX2 - SUMX *SUMX/N)/(N) fi } It's easy enough to do client side, for now. Murthy PS: When you said, in response to Joe Conway's post re the SQL200x spec, that you'd make the changes, I read it to mean you were going to implement both the sample and population measures, but on second reading I think you were just been saying that you'd make the existing function properly return the sample standard deviation - NULL if only 1 row given? -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, April 21, 2003 12:21 To: Murthy Kambhampaty Cc: Joe Conway; Manfred Koizar; Bruno Wolff III; pgsql-general@postgresql.org Subject: Re: [GENERAL] stddev returns 0 when there is one row Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: > Tom, would it be too much trouble to program these in (as STDEV_POP_G and > STDEV_SMPL_G or some variant? I'm not volunteering, no. (For starters, I think you'd need to upgrade the aggregate code to allow multiple-input aggregates. Which is worth doing, but I do not expect to have time to think about such things for 7.4.) regards, tom lane
> I'm not volunteering, no. (For starters, I think you'd need to upgrade > the aggregate code to allow multiple-input aggregates. Which is worth > doing, but I do not expect to have time to think about such things for > 7.4.) Having multiple-input aggregates opens up possibilites for a large number of multivariate statistical measures, either built in or as user-defined functions. (He says as he works on formatting output from an Oracle DBMS for time-series analysis via SPSS or SAS.) I guess I don't understand the point of the count(*) parameter, but wouldn't the easiest way to offer both sample and population formulas for variance and standard deviation be to just add two new functions, varp and stddevp, most of the code for which already exists? That way they remain single-input aggregates. (A perfectionist would also include vars and stddevs as equivalents to the existing var and stddev functions.) -- Mike Nolan
Since you mentioned SAS, lookup the documentation for PROC MEANS, the FREQ statement ...nolan@celery.tssi.com writes: >I guess I don't understand the point of the count(*) parameter Since you mention SAS, lookup the documentation for PROC MEANS, the FREQ statement ... "specifies a numeric variable whose value represents the frequency of the observation. If you use the FREQ statement, the procedure assumes that each observation represents n observations, where n is the value of variable. If n is not an integer, the SAS System truncates it. If n is less than 1 or is missing, the procedure does not use that observation to calculate statistics. The sum of the frequency variable represents the total number of observations."
nolan@celery.tssi.com wrote: >>I'm not volunteering, no. (For starters, I think you'd need to upgrade >>the aggregate code to allow multiple-input aggregates. Which is worth >>doing, but I do not expect to have time to think about such things for >>7.4.) > > Having multiple-input aggregates opens up possibilites for a large number > of multivariate statistical measures, either built in or as user-defined > functions. (He says as he works on formatting output from an Oracle DBMS > for time-series analysis via SPSS or SAS.) > The SQL200x standard is pretty explicit, as I posted earlier. Assuming it doesn't change between the draft I'm looking at and the final version, we should in fact have each of the following as unary (single argument) aggregates: VAR_POP, VAR_SAMP, STDDEV_POP, STDDEV_SAMP. Interestingly section 4.16.3 covers quite a bit more including a number of binary (two argument) aggregate functions (e.g. COVAR_POP and COVAR_SAMP). It also talks about aggregation by either groups (similar to current) and "windows": 4.16.3 Aggregate functions An aggregate function is a function whose result is derived from an aggregation of rows defined by one of: — The grouping of a grouped table, in which case the aggregate function is a group aggregate function, or set function, and for each group there is one aggregation, which includes every row in the group. — The window frame of a row R of a windowed table relative to a particular window structure descriptor, in which case the aggregate function is a window aggregate function, and the aggregation consists of every row in the window frame of R, as defined by the window structure descriptor. All of this is very interesting, but it isn't relevant to PostgreSQL 7.4. In the meantime, if you are interested in multivariate statistics, see PL/R here: http://www.joeconway.com/ Joe
nolan@celery.tssi.com writes: > wouldn't the easiest way to offer both sample and population formulas for > variance and standard deviation be to just add two new functions, > varp and stddevp, most of the code for which already exists? As Joe pointed out, SQL200x seems to have laid down the law already on what to call these things. > That way they remain single-input aggregates. My comment about multiple inputs was in response to the suggestion of grouped input, which I took to mean that you'd want to write something like "SELECT STDDEV_GROUPED(value, num_occurrences) FROM foo". You could hack your way around the problem by defining the aggregate to take a two-element array type: SELECT STDDEV_GROUPED(ARRAY[value, num_occurrences]) FROM foo but this seems ugly and inefficient. regards, tom lane