Re: stddev returns 0 when there is one row - Mailing list pgsql-general
From | Murthy Kambhampaty |
---|---|
Subject | Re: stddev returns 0 when there is one row |
Date | |
Msg-id | 2D92FEBFD3BE1346A6C397223A8DD3FC0921F7@THOR.goeci.com Whole thread Raw |
In response to | stddev returns 0 when there is one row (Bruno Wolff III <bruno@wolff.to>) |
Responses |
Re: stddev returns 0 when there is one row
|
List | pgsql-general |
If we could step back for second and look at why you would want to calculate a standard deviation in the first place -- i.e., to measure the "spread", or "dispersion", in the measure represented by a given field, we might get a better answer than looking at the specs. for somebody else' software. If you were looking at a table of sample data, and you had a sample size of one, then the result should, of course be null, since a sample size of one gives you no information about the variance and standard deviation (standard error, really) of the distribution from which the sample was drawn. But we are talking about a relational database here, where it is common to throw away redundant information, and multiple draws from a deterministic process, with a variance and standard deviation of zero, are redundant (did you ever do a "select distinct from <>"?). So, it is entirely likely that data stored in postgresql will only have include row for data drawn from a "distribution with zero standard deviation". The STDEV_POP and STDEV_SMPL convention is not strictly right for the context, since we throw away "redundant" records regardless of whether they represent a sample or a population; so it's no surprise that postgresql is programmed the way it is: if your single record may not really represent a single draw on a distribution, so it's the user's job to program your SQL to return NULL for std. dev. and variance. Of course, multiple identical records aren't redundant when they describe statistical data, so you should be keeping a count() column around to record the "frequency" with which each distinct set of values occurs, and using the variance and standard deviation formulas for "grouped data" - e.g., ( sum( X**2 * freq) - sum( X *freq)*sum( X*freq))/(sum( freq) -1) for the sample variance, where the column "X" records the data values, and the column "freq" records the number of times the values is repeated in the sample. Tom, would it be too much trouble to program these in (as STDEV_POP_G and STDEV_SMPL_G or some variant? Thanks, Murthy -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, April 19, 2003 16:33 To: Joe Conway Cc: Manfred Koizar; Bruno Wolff III; pgsql-general@postgresql.org Subject: Re: [GENERAL] stddev returns 0 when there is one row Joe Conway <mail@joeconway.com> writes: > So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP. No, we implement the sample standard deviation, as stated in the docs: http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggreg ate.html The code is pretty straightforward, at least in the float8 case: /* We define STDDEV of no values to be NULL, of 1 value to be 0 */ if (N == 0.0) PG_RETURN_NULL(); if (N <= 1.0) PG_RETURN_FLOAT8(0.0); numerator = N * sumX2 - sumX * sumX; /* Watch out for roundoff error producing a negative numerator */ if (numerator <= 0.0) PG_RETURN_FLOAT8(0.0); PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0)))); I don't have a real strong feeling about whether we should change the behavior at N=1 or not. Does the SQL200x spec provide any guidance? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
pgsql-general by date: