Thread: No stddev() for interval?
Hi all,
I noticed a peculiarity in the default postgres aggregate functions. min(), max() and avg() support interval as an input type, but stddev() and variance() do not.
Is there a rationale behind this, or is it just something that was never implemented?
Regards,
BJ
I noticed a peculiarity in the default postgres aggregate functions. min(), max() and avg() support interval as an input type, but stddev() and variance() do not.
Is there a rationale behind this, or is it just something that was never implemented?
Regards,
BJ
"Brendan Jurd" <direvus@gmail.com> writes: > I noticed a peculiarity in the default postgres aggregate functions. min()= > , > max() and avg() support interval as an input type, but stddev() and > variance() do not. > Is there a rationale behind this, or is it just something that was never > implemented? Is it sensible to calculate standard deviation on intervals? How would you handle the multiple components? I mean, you could certainly define *something*, but how sane/useful would the result be? regards, tom lane
On 5/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Brendan Jurd" <direvus@gmail.com> writes: > > I noticed a peculiarity in the default postgres aggregate functions. min()= > > , > > max() and avg() support interval as an input type, but stddev() and > > variance() do not. > > > Is there a rationale behind this, or is it just something that was never > > implemented? > > Is it sensible to calculate standard deviation on intervals? How would > you handle the multiple components? I mean, you could certainly define > *something*, but how sane/useful would the result be? Strictly speaking there's nothing bad in intervals. Physically standart deviation on interval can be very useful without any doubts. I can make a lot of examples on this. Say you want to know stat parameters of semi-regular periodical process (avg distance in time between maximums of some value and stddev of this quasiperiod -- why not?). Regards, Ivan Zolotukhin
"Ivan Zolotukhin" <ivan.zolotukhin@gmail.com> writes: > On 5/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Is it sensible to calculate standard deviation on intervals? How would >> you handle the multiple components? I mean, you could certainly define >> *something*, but how sane/useful would the result be? > Strictly speaking there's nothing bad in intervals. Physically > standart deviation on interval can be very useful without any doubts. If the intervals are all expressed in seconds then sure, the calculation is straightforward and useful. I'm wondering what happens when nonzero values of days and months get in there. regards, tom lane
If the intervals are all expressed in seconds then sure, the calculation
is straightforward and useful. I'm wondering what happens when nonzero
values of days and months get in there.
regards, tom lane
The existing logic used in avg(interval) can be seen in backend/utils/adt/timestamp.c, refer to functions interval_accum, interval_avg and interval_div.
interval_div(interval, double) is the most interesting for this discussion. There is a helpful comment that reads /* evaluate fractional months as 30 days */.
So for example, interval_div('4 mons'::interval, 3) gives you '1 mon 10 days'.
It's not perfect, but doing arithmetic that involves converting between months and days never is. All in favour of deleting the month as unit of measurement of time say "aye".
Well that's not going to happen in my lifetime. How about we just extend this same logic over to stddev and variance? It's strange having avg but not the other two.
Regards,
interval_div(interval, double) is the most interesting for this discussion. There is a helpful comment that reads /* evaluate fractional months as 30 days */.
So for example, interval_div('4 mons'::interval, 3) gives you '1 mon 10 days'.
It's not perfect, but doing arithmetic that involves converting between months and days never is. All in favour of deleting the month as unit of measurement of time say "aye".
Well that's not going to happen in my lifetime. How about we just extend this same logic over to stddev and variance? It's strange having avg but not the other two.
Regards,
BJ
On Sun, May 21, 2006 at 01:14:15AM +1000, Brendan Jurd wrote: > Hi all, > > I noticed a peculiarity in the default postgres aggregate functions. min(), > max() and avg() support interval as an input type, but stddev() and > variance() do not. > > Is there a rationale behind this, or is it just something that was never > implemented? That's because variance of foo is measured in foo^2 units. What is the square of an interval? Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
That's because variance of foo is measured in foo^2 units. What is
the square of an interval?
Cheers,
D
Well if you're willing to accept that for the purposes of computing the aggregates, an interval "month" is equal to 30 days (which is how avg(interval) already works), then an interval is reducable to a single quantity -- a number of seconds -- which can be squared.
30 days per month is pretty rough ... we could refine it to 30.4375, which is the average number of days per month over four years including one leap year.
First of all, stddev doesn't return square of smth - so, why should we worry about intermediate results? Furthermore, statistics work with any 'units' and doesn't worry about physical meaning of variance in any case (for example, what about variance for the set of lifetime values of people from town N in XX century? ;-) ). Second, SQL standard doesn't contain definition for STDDEV function, but it has STDDEV_POP and STDDEV_SAMP (but it doesn't really matter for this discussion). As for valid datatypes for these functions, I cannot find exact definition unfortunately, but I see folliwing: 'Without Feature T621, "Enhanced numeric functions", conforming SQL language shall not contain a <computational operation> that immediately contains STDDEV_POP, STDDEV_SAMP, VAR_POP, or VAR_SAMP.' So, authors meant that these functions should accept only numeric values. Last but not least, ORA doesn't want to accept interval values for stddev() func: CREATE TABLE teststddev(id INTEGER PRIMARY KEY, val INTERVAL YEAR TO MONTH); INSERT INTO teststddev(id, val) VALUES(1, INTERVAL '300' MONTH(3)); INSERT INTO teststddev(id, val) VALUES(2, INTERVAL '2' YEAR(1)); INSERT INTO teststddev(id, val) VALUES(3, INTERVAL '-125' MONTH(3)); SELECT STDDEV(val) FROM teststddev; ORA-00932: inconsistent datatypes: expected NUMBER got INTERVAL YEAR TO MONTH As for me, I think that it's quite reasonable to expect stddev working with intervals... Why not? On 5/21/06, David Fetter <david@fetter.org> wrote: > On Sun, May 21, 2006 at 01:14:15AM +1000, Brendan Jurd wrote: > > Hi all, > > > > I noticed a peculiarity in the default postgres aggregate functions. min(), > > max() and avg() support interval as an input type, but stddev() and > > variance() do not. > > > > Is there a rationale behind this, or is it just something that was never > > implemented? > > That's because variance of foo is measured in foo^2 units. What is > the square of an interval? > > Cheers, > D > -- > David Fetter <david@fetter.org> http://fetter.org/ > phone: +1 415 235 3778 AIM: dfetter666 > Skype: davidfetter > > Remember to vote! > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > -- Best regards, Nikolay
On 5/20/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If the intervals are all expressed in seconds then sure, the calculation > is straightforward and useful. I'm wondering what happens when nonzero > values of days and months get in there. Ah! Maybe the reason for such thoughts lies in nature of postgres intervals. SQL:2003 standard paper says: 'There are two classes of intervals. One class, called year-month intervals, has an express or implied datetime precision that includes no fields other than YEAR and MONTH, though not both are required. The other class, called day-time intervals, has an express or implied interval precision that can include any fields other than YEAR or MONTH.' So, the basic question is 'why Postgres allows to combine month and day?' Actually, is it good idea? If we have two separate interval types - we haven't the problem of '1 month VS 30 days' at all... And if we have no such a problem, we would work with intervals as with numbers (I don't see the strong reason for absense of stddev() and even variance() for INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND, INTERVAL MINUTE TO SECOND and so on). -- Best regards, Nikolay
On 5/21/06, Brendan Jurd <direvus@gmail.com> wrote: > > Well if you're willing to accept that for the purposes of computing the > aggregates, an interval "month" is equal to 30 days (which is how > avg(interval) already works), then an interval is reducable to a single > quantity -- a number of seconds -- which can be squared. > Let me make a correction. Internally, intervals are stored as separate values of months and days (and even seconds, for daylight saving purposes). So, in almost all cases everything is OK. But it isn't so when we should multiply or devide such 'hetorogenious intevals' ("select interval '1 month 1 day';" gives us '15 days 12:00:00')... -- Best regards, Nikolay
On 5/21/06, Nikolay Samokhvalov <samokhvalov@gmail.com> wrote: > So, in almost all cases everything is OK. But it isn't so > when we should multiply or devide such 'hetorogenious intevals' > ("select interval '1 month 1 day';" gives us '15 days 12:00:00')... You obviously meant "select interval '1 month 1 day' / 2;" to get above weird result of 15 days and 12 hours. Regards, Ivan Zolotukhin
Ah!
Maybe the reason for such thoughts lies in nature of postgres intervals.
SQL:2003 standard paper says:
'There are two classes of intervals. One class, called year-month
intervals, has an express or implied datetime
precision that includes no fields other than YEAR and MONTH, though
not both are required. The other class,
called day-time intervals, has an express or implied interval
precision that can include any fields other than
YEAR or MONTH.'
So, the basic question is 'why Postgres allows to combine month and day?'
Actually, is it good idea?
If we have two separate interval types - we haven't the problem of '1
month VS 30 days' at all... And if we have no such a problem, we would
work with intervals as with numbers (I don't see the strong reason for
absense of stddev() and even variance() for INTERVAL YEAR TO MONTH,
INTERVAL DAY TO SECOND, INTERVAL MINUTE TO SECOND and so on).
--
Best regards,
Nikolay
It certainly would make life easier for interval operations if the month <-> day conversion was no longer a factor. Keeping the years and months separate from the other fields makes sense to me. It would be nice from a standards-compliance point of view too.
AFAICT, the only useful reason for the postgres interval to include the 'month' and 'year' units is so that users can perform arithmetic like "now() - '3 months'::interval". With two separate types, you could still support this kind of operation with the year-month interval, and support all other operations with the day-time interval.
Clearly it would be major effort to build the two new interval types, but I'm thinking it would be worth the trouble (read: I'm willing to sink my own time into it).
Regards
BJ
On 5/21/06, Brendan Jurd <direvus@gmail.com> wrote:
Ah!
Maybe the reason for such thoughts lies in nature of postgres intervals.
SQL:2003 standard paper says:
'There are two classes of intervals. One class, called year-month
intervals, has an express or implied datetime
precision that includes no fields other than YEAR and MONTH, though
not both are required. The other class,
called day-time intervals, has an express or implied interval
precision that can include any fields other than
YEAR or MONTH.'It certainly would make life easier for interval operations if the month <-> day conversion was no longer a factor. Keeping the years and months separate from the other fields makes sense to me. It would be nice from a standards-compliance point of view too.
AFAICT, the only useful reason for the postgres interval to include the 'month' and 'year' units is so that users can perform arithmetic like "now() - '3 months'::interval". With two separate types, you could still support this kind of operation with the year-month interval, and support all other operations with the day-time interval.
Clearly it would be major effort to build the two new interval types, but I'm thinking it would be worth the trouble (read: I'm willing to sink my own time into it).
RegardsBJ
I've been mulling this idea over for the last couple of days, and I still think it would be a big improvement to postgres' interval support. The whole day <-> month problem is a serious fly in the ointment. Nobody's shot me down yet ... is there any significant downside to having the proposed two separate interval types?
Here's how I see it working:
* The existing 'interval' type is left in place with existing functionality, so we don't break existing apps.
* Two new types are added, let's call them 'month interval' and 'second interval' for the moment. The internal representation of the month type can be a signed integer, for the second type a signed value that supports a maximum resolution of 1 microsecond.
* Textual input syntax for 'second interval' is the same general form as for 'interval', with valid fields being second, minute, hour, day and week. For 'month interval', only month, year, decade, century and millenium are considered valid.
* Add functions for specifying the new interval types with numeric input, e.g., "months(int) returns month interval", "years(int) returns month interval", "days(int) returns second interval", "seconds(double) returns second interval".
* Allow (implicit?) cast from month interval to interval, and from second interval to interval, but not the reverse.
* Eventually, the return type of date and timestamp subtraction becomes second interval.
Regards,
BJ