Postgresql Historical data store - Mailing list pgsql-novice
From | ben.carlyle@invensys.com |
---|---|
Subject | Postgresql Historical data store |
Date | |
Msg-id | OFBFC2E035.E89BBA31-ON4A256BD5.007BD758@westsig.com.au Whole thread Raw |
Responses |
Re: Postgresql Historical data store
|
List | pgsql-novice |
G'day, As part of my work-related software development I've been asked to develop a system related to my company's product that is capable of storing time-ordered values for a set of inputs. The data is in the form of a IEEE-style double and must be timestamped with at least millisecond accuracy over the period of up to a year (or perhaps longer for other projects). I can expect around 20 million changes per day between all the inputs although many inputs will not change very much at all. I also need redundancy. :) I'm on top of most aspects of this piece of software, in fact I've written a prototype to model the kind of data distrubution I want from a software-guy point of view using a very simple but elegant sql server called sqlite[1]. I'm now rethinking the entire design from the standpoint of implimenting it in a postgresql database. Since I'm not a database guru I thought I'd call upon the combined wisom of the HUMBUG oracle on two main points which I'll discuss in some detail: 1) Sample-based queries and 2) Postgresql built-in redundancy vs and external redundancy solution Sample Based Queries: To minimise space and to provide an accurate representation of the data I'm storing I've so far opted to store an initial value and time for each input at startup of the History system, then store transitions with transition timestamps for each of the values. This is a reasonable approach, but it makes a mockery of the basic sql aggregate functions. It's meaningless to calculate an average or standard deviation on a dataset where each entry in the dataset can have any period of time associated with it. In fact, for an average the appropriate way to calculate the average is something more like: Avg = sum(entry value * entry period) / total time I also need to be able to extract fixed period samples or instantaneous values for particular instants in time. There are more complications including a requirement to know when the data was "good" and being collected correctly verses when it was down due to a failure of the source or the history or some other factor... but I think I can deal with that myself ;) So based on data of the form[2]: Time Value 1.01 1.005 1.02 1.006 1.09 2.504 2.34 2.459 I should be able to request a sample every 0.02 seconds from 1.01 and get a result of Time Value 1.01 1.005 1.03 1.006 1.05 1.006 1.07 1.006 1.09 2.504 1.11 2.504 ... snip ... 2.35 2.459 and be able to do averages and other transforms on that result set, and also be able to do the true average calculations which would need to get the time difference between each entry and it's next entry: Avg = ((1.02-1.01)*1.005 + (1.09-1.02)*1.006 + (2.34 - 1.09) * 2.504) / (2.34 - 1.01) So... coming to the actual question: Are these kinds of transforms possible using simple SQL? If not... Are these kinds of transforms possible using PL/pgSQL or by coding some C into a shared library and introducing that as a fuction? Do temporary tables have a part to play in this, and how does one ensure that the temporary tables "live" only as long as the query takes to process? Although I've read most of the documentation relating to these fuctions on the postgresql web site it all seems to refer to functions that return simple typed values instead of performing any kind of transformation on the data set. Ideally I'd like any client application to be able to do a select on a data set and be able to trigger the appropriate functions internal to the server rather than bolting my own post-processing application on to the top of the system. ... anyway. On to question 2: Postgresql built-in redundancy vs and external redundancy solution: The requirements for redundancy are fairly simple. 1) Get data to disk as soon as possible 2) Get data to redundant copies as soon as possible 3) Redundant copies are completely redundant databases 4) Islanding is possible at any time for any duration 5) Failure is possible at any time for any duration 6) The data of communicating peers must become consistent as soon as possible after failure or islanding restoration 7) Redundancy is n-way based on configuration parameters 8) Data can arrive in any order with any timestamp In my prototypical model I've done all the redundancy in software. I have a good infrastructure at my disposal for distributed assignment of control and also for communication. I've based the redundancy on a stream concept where the controlling history is the source of data at any time and writes changes directly into a new file associated with the stream and labelled with the host identifier and start time of the stream. A subscription mechanism is provided so that all non-controller entities can subscribe to the controller's version of the streams list and to any stream that they do not have the complete version of. In the case of host failure control may transition and a new stream may be started. The old stream will be left incomplete in all servers until the failed host restores and the controller is able to determine whether the dead host has information in it's database that had not made it into the other server's store and vice versa. For this reason the controller keeps an eye on all non-controllers to grab any information that they may have and it does not so that it can both store it locally and distribute it out to other non-controllers. In the case of host islanding multiple controllers may be assigned and therefore multiple simultaneous streams can be started. When islanding is recovered from all streams are synchronised so that both sets of data are available to all databases. This also has sql implications, because for overlapping data from different streams we really want to perform an intelligent merge: Bad data mixed with bad data gives us bad data Good data mixed with bad data gives us good data Good data mixed with conflicting good data gives us bad data Good data mixed with non-conflicting good data gives us good data All of this work in the prototype basically happens on the query end of the system. If a host comes up and it's database has been completely destroyed it will automatically download the list of streams and start synchronising them thereby ensuring consistency as soon as possible. So the question is... Can I achive similar reliablilty using postgresql, or am I better of adapting my existing system to interface to postgresql? [1] sqlite is public domain software, and quite nice for small tasks :) [2] Time is actually an absolute UTC time extracted from a timeval, and probably stored in the database as a timestamp type
pgsql-novice by date: