Thread: Q:Aggregrating Weekly Production Data. How do you do it?

Q:Aggregrating Weekly Production Data. How do you do it?

From
Ow Mun Heng
Date:
Just wondering how everyone is doing aggregration of production data.

Data which runs in the vicinity of a few million a week.

What are the methods which will effectively provide the
min/max/average/count/stdev of the weekly sample size based on different
products/software mix etc.

and still be able to answer correctly, what's the average of data_1 over
the pass 2 months?

I can't just take the average of an 8 averages of each week)

eg:
wk   avg data_1
w1 - 2
w2 - 2
w3 - 2
w4 - 3
w5 - 1
w6 - 2
w7 - 2
w8 - 2
average of past 2 months = ave(w1-w8) which is statistically wrong.

using sum of data_1 per week would work though. Please share your
expertise / experience.

(getting min/max/count isn't much of an issue. Stdev is the main issue I
believe)

One such instance I've read about is..

From this website : (it references using SQL Server Analysis services
but I think the concept is the same)

http://www.phptr.com/articles/printerfriendly.asp?p=337135&rl=1

1. Calculate sum of square of each sale
2. multiple the result of step 1 by the sales count
3. sum all sales
4. Square the result of step 3
5. Substract the result of step 4 from the result of step 2
6. Multiply the sales count by one less than sales count ("sales_count"
* ("sales_count" - 1))
7. Divide the result of step 5 by the result of step 6
8. Stdev will be the square root of step 7

The results are valid (verified with actual data) but I don't understand
the logic. All the Statistical books I've read marked stdev as sqrt
(sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
confusion.


Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/17/07 23:34, Ow Mun Heng wrote:
> Just wondering how everyone is doing aggregration of production data.
>
> Data which runs in the vicinity of a few million a week.
>
> What are the methods which will effectively provide the
> min/max/average/count/stdev of the weekly sample size based on different
> products/software mix etc.
>
> and still be able to answer correctly, what's the average of data_1 over
> the pass 2 months?

That's the purpose of data warehouses and ETL, isn't it?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG78prS9HxQb37XmcRAhhBAKCGoBYox6azDqxQpEbvMo/Zya8cAACgy5Y6
XtrDC35IE0TOcD29Iziorfs=
=XCZw
-----END PGP SIGNATURE-----

Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
"John D. Burger"
Date:
Ow Mun Heng wrote:

> The results are valid (verified with actual data) but I don't
> understand
> the logic. All the Statistical books I've read marked stdev as sqrt
> (sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
> confusion.

A formula is not an algorithm.  In particular, the naive way of
calculating variance or standard deviation has massive numerical
instability problems - anything involving sums of squares does.
There are a variety of alternate algorithms for stddev/variance, I
presume your other algorithm is similarly trying to avoid these same
issues (but I have not looked closely at it).  You can also see
Wikipedia for one of the most well known, due to Knuth/Wellford:

   http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

- John D. Burger
   MITRE



Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
"Scott Marlowe"
Date:
On 9/17/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:

> Just wondering how everyone is doing aggregration of production data.

Where I work, we aggregate by the minute in the application, then dump
those stats entries into the database en masse from each machine.  So,
we wind up with rows something like this: (simplified)

id | parentid | host | timestamp | request | total | success | fail1 |
fail2 | fail3 | totalresptime | maxresptime
1 | NULL | jboss01 | 2007-09-17 12:02:03 | authcc | 120 | 112 | 4 | 2
| 2 | 480000 | 12039
2 | 1 | jboss01 | 2007-09-17 12:02:03 | ccconnect | 120 | 118 | 0 | 0
| 2 | 423900 | 10394

where the id comes from a sequence, and parent ID ties our children
stats to their parents.  I.e. in this example authcc called ccconnect,
and so on.  The fail1/2/3 are types of failures, the totalresptime
time is all the successful requests added together, so that
totalresptime/success = average for that minute, and max is the
longest running request ending in that minute.

Then we can aggregate those minutes together, monitor individual
machine performance, etc.  Ours is actually more complex than this,
but you get the idea.

We have a cron job that checks the statistics every x minutes for high
failure rates and have it generate an alert email if any of our
requests go over a preset threshold.  This catches problems long
before anything shows up interesting in the logs most of the time.

> (getting min/max/count isn't much of an issue. Stdev is the main issue I
> believe)
>
> One such instance I've read about is..

Isn't stddev() working for you?

What I do is aggregate the minute time slices by grouping by
date_trunc('xxx',timestamp) and then use that as a subselect to a
query that does the stddev() outside of that.  works pretty well for
us, and makes it easy to identify trends.

One of the things we started doing is to aggregate the last six weeks
data by the day / hour and then comparing the last 24 hours worth of
data to those six weeks worth to see trends for each hour of each day.
 The queries are huge and ugly, but they work, and run in about 4
minutes on a fairly good sized chunk of data.  We have about 150k to
300k entries a day put into this db.

Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
btober@ct.metrocast.net
Date:
Ow Mun Heng wrote:
 > Data which runs in the vicinity of a few million a week.
 >
 > What are the methods which will effectively provide the
 > min/max/average/count/stdev of the weekly sample size based on different
 > products/software mix etc.
 >
 > and still be able to answer correctly, what's the average of data_1 over
 > the pass 2 months?


So, are you viewing the problem as one of figuring out how to avoid
having to store all this raw data permanently but still have the
statistical summary value results available without having to
recalculate each time?


 >
 > I can't just take the average of an 8 averages of each week)
 >
 > eg:
 > wk   avg data_1
 > w1 - 2
 > ...average of past 2 months = ave(w1-w8) which is statistically wrong.


One approach would be to to save the count of values as well as the
average. Then your eight-week average is calculated as a weighted
average, i.e., each weeks average is weighted by the corresponding count
of data values:

CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric
);

Then, for instance, the eight-week average is computed as

SELECT
    AVG(num_of_values * avg_of_values)/SUM(num_of_values) AS eight_week_avg
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();



 >
 > (getting min/max/count isn't much of an issue. Stdev is the main issue I
 > believe)

You probably need to store each of those weekly summary statistic values
too, at least for min and max, i.e.,

CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric,
min_of_values numeric,
max_of_values numeric
);

Then, for instance, overall basic statistics are computed as

SELECT
    COUNT(num_of_values),
    AVG(num_of_values * avg_of_values),
    MIN(min_of_values),
    MAX(max_of_values)
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();


Extending this design to include the variance is a more complicated.
While you can compute the average for the the past eight weeks using a
weighted average of each of the separate eight weeks, I think you
actually need the raw data values for the whole eight weeks to figure
the standard deviation -- I don't readily see how you could without it.
A different possibility would be to maintain a running average and
variance (rather than windowing those statistics for the sliding
eight-week period), and then taking a weekly snap shot of the running
values to use for trending.

 >
 > One such instance I've read about is..
 >
 > 1. Calculate sum of square of each sale
 > ...
 > 8. Stdev will be the square root of step 7
 >
 > The results are valid (verified with actual data) but I don't understand
 > the logic. All the Statistical books I've read marked stdev as sqrt
 > (sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
 > confusion.

I think you copied the std dev formula incorrectly, but the eight step
algorithm can be shown to get you to the same point, after a little
analytical manipulation. The advantage to one over the other is that the
eight-step procedure can be used to implement an on-the-fly calculation
of mean and variance, i.e., you can maintain a "running average" and
update the statistics as you collect more data and not have to maintain
the entire detail data set history, as would be required by an attempt
to implement the definition directly.






Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
Ow Mun Heng
Date:
On Tue, 2007-09-18 at 09:55 -0500, Scott Marlowe wrote:
> On 9/17/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
>
> > Just wondering how everyone is doing aggregration of production data.
>
> Where I work, we aggregate by the minute in the application, then dump
> those stats entries into the database en masse from each machine.  So,
[snip]
> The fail1/2/3 are types of failures, the totalresptime
> time is all the successful requests added together, so that
> totalresptime/success = average for that minute, and max is the
> longest running request ending in that minute.

So, it's basicaly a summation rather than using avg() all the way.
That would mean using something like bigint or something to deal with
the large numbers
>
> > (getting min/max/count isn't much of an issue. Stdev is the main issue I
> > believe)
> >
> > One such instance I've read about is..
>
> Isn't stddev() working for you?

Stdev is based on a population of data. if I do slices and if I want to
stdev across 2 months (8 weeks eg) then it would be wrong.

>
> What I do is aggregate the minute time slices by grouping by
> date_trunc('xxx',timestamp) and then use that as a subselect to a
> query that does the stddev() outside of that.  works pretty well for
> us, and makes it easy to identify trends.

Trending analysis is very critical and the only reference I've found on
how to get "stdev" is based on what I posted.


> One of the things we started doing is to aggregate the last six weeks
> data by the day / hour and then comparing the last 24 hours worth of
> data to those six weeks worth to see trends for each hour of each day.
>  The queries are huge and ugly, but they work, and run in about 4
> minutes on a fairly good sized chunk of data.  We have about 150k to
> 300k entries a day put into this db.

I'm not sure how many K entries in a day(yet to count it) but I'm
getting into trouble w/ one-2-many relationships and PG is choosing to
do nested loops etc. (lots of left joins on same tables)

So, I've to resort to using SRF and function scans. a >2 hour Complex
query (on my laptopn,PG, runs in 20min on high end mssql) is reduced to
~60secs.


Thanks for the response.

BTW, are user queries an issue? Does it interfere with the loading.
(That's one of my concerns over here). I noticed  that you do the calc
at the app before sending it to the DB, which is a good thing and every
minute too. (is it adviseable to do it in chunks of 1min? less data per
minute vs per 30min/1 hour )

Re: Q:Aggregrating Weekly Production Data. How do you do it?

From
"Scott Marlowe"
Date:
On 9/18/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
> On Tue, 2007-09-18 at 09:55 -0500, Scott Marlowe wrote:
> > On 9/17/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:
> >
> > > Just wondering how everyone is doing aggregration of production data.
> >
> > Where I work, we aggregate by the minute in the application, then dump
> > those stats entries into the database en masse from each machine.  So,
> [snip]
> > The fail1/2/3 are types of failures, the totalresptime
> > time is all the successful requests added together, so that
> > totalresptime/success = average for that minute, and max is the
> > longest running request ending in that minute.
>
> So, it's basicaly a summation rather than using avg() all the way.
> That would mean using something like bigint or something to deal with
> the large numbers

Yeah.  We use numeric(18) (bonus points if anyone knows what other
database I'm pumping data out of to get a numeric(18) as a default.)

The advantage is that I don't have to re-weight my measurements.  When
the system was first designed, it stored avg response and number of
responses, which I had to then multiply to get the original total time
and process to get a weighted average.  This method removes the
averaging step in the client software AND the multiplying step to get
the raw numbers back on my end, win-win.

> > > (getting min/max/count isn't much of an issue. Stdev is the main issue I
> > > believe)
> > >
> > > One such instance I've read about is..
> >
> > Isn't stddev() working for you?
>
> Stdev is based on a population of data. if I do slices and if I want to
> stdev across 2 months (8 weeks eg) then it would be wrong.

But if you ask for the data set grouped by
date_trunc('minute',timestmp) in an inner select, then in the outer
select you can do a stddev on that set and get it.  Or is that not
accurate?

> > What I do is aggregate the minute time slices by grouping by
> > date_trunc('xxx',timestamp) and then use that as a subselect to a
> > query that does the stddev() outside of that.  works pretty well for
> > us, and makes it easy to identify trends.
>
> Trending analysis is very critical and the only reference I've found on
> how to get "stdev" is based on what I posted.

Well, what we're doing seems to be getting us what look like proper numbers.

> > One of the things we started doing is to aggregate the last six weeks
> > data by the day / hour and then comparing the last 24 hours worth of
> > data to those six weeks worth to see trends for each hour of each day.
> >  The queries are huge and ugly, but they work, and run in about 4
> > minutes on a fairly good sized chunk of data.  We have about 150k to
> > 300k entries a day put into this db.
>
> I'm not sure how many K entries in a day(yet to count it) but I'm
> getting into trouble w/ one-2-many relationships and PG is choosing to
> do nested loops etc. (lots of left joins on same tables)

Yeah, for this kind of thing, you want to pre-process your tables into
one big flat table if you can.  We do have one to many relationships
within the same table (parent / child system requests are one to many)
and when we join the table to itself several times it can get hairy.
We had some nested loop problems until I upped the stats target on a
few of the columns and scheduled regular analyzes for the main table.
Now the system picks the right plan 99.9% or more of the time.  And
when it doesn't, it's usually something fresh and new that's the
issue.

> So, I've to resort to using SRF and function scans. a >2 hour Complex
> query (on my laptopn,PG, runs in 20min on high end mssql) is reduced to
> ~60secs.

That's an excellent improvement!

> Thanks for the response.
>
> BTW, are user queries an issue?   Does it interfere with the loading.

Not usually. I've got them trained pretty well, and the system has
lots more I/O bandwidth than CPU horsepower, so generally a long
running query just waits on I/O.  RAID-10 has made this system much
faster than the old single drive workstation (mine) it was running on.

The data loading on this system runs once a minute and it pumps the
data out of the production db, and puts it into the reporting db.
Average run time to insert those rows is < 1 second.

> (That's one of my concerns over here). I noticed  that you do the calc
> at the app before sending it to the DB, which is a good thing and every
> minute too. (is it adviseable to do it in chunks of 1min? less data per
> minute vs per 30min/1 hour )

exactly.  Each minute is easy to keep track of, and if the app goes
crazy for a bit and loses the data / gets restarted, you only lose a
minutes or so worth of data.

Also, we use this system for real time alerting and performance
monitoring.  If some request or other starts taking too long or the
failure rate shoots up, it generates alerts.  If it was collected
every 30 minutes or an hour that would be far too late.