Ranges for well-ordered types - Mailing list pgsql-hackers
From | Michael Glaesemann |
---|---|
Subject | Ranges for well-ordered types |
Date | |
Msg-id | 10AC5536-280E-463F-A335-71FEC1FBB774@seespotcode.net Whole thread Raw |
Responses |
Re: Ranges for well-ordered types
Re: Ranges for well-ordered types Re: Ranges for well-ordered types Re: Ranges for well-ordered types |
List | pgsql-hackers |
I've been interested in representing and manipulating time ranges in PostgreSQL, where a time range is defined by a start time and an end time. Time ranges are useful, for example, in representing when some predicate was known valid. Similarly, time ranges can be used to represent "transaction time": the version history of the data itself. This is similar to the "time travel" feature in previous versions of PostgreSQL. (Tables that include both valid time and transaction time information are sometimes called bitemporal tables.) Ranges can also be useful in scheduling applications. The original use case that prompted me to explore this area was tracking what time periods teachers were assigned to schools (a valid- time table). Here's one way to represent this in PostgreSQL: create table teachers__schools_1 ( teacher text not null , school text not null , from_date date not null , to_date date not null , check (from_date<= to_date) , unique (teacher, school, from_date) , unique (teacher, school, to_date) ); Each row of this table represents the time range (from from_date to to_date) during which a teacher was assigned to a particular school. (Teachers can be assigned to more than one school at a time.) The check constraint guarantees that [from_date, to_date] represents a valid closed-closed interval (where the end points are included in the range). Two unique constraints are necessary to guarantee uniqueness for each row. In my use case, it would also be desirable to apply constraints to prevent overlapping and ensure continuity-- that teachers are always at some school. This can be done using constraint triggers, though making sure all of the comparisons for four dates (beginning and end points for two ranges) are done properly can be a little daunting and prone to bugs. The problem of representing time ranges in a relational database has been worked on for quite some time. Indeed, the PostgreSQL source still includes a tinterval type, where the beginning and end points are of type abstime, though this type is no longer included in the documentation. Drafts of SQL3 included the PERIOD constructor to represent date, time, and timestamp ranges as part of SQL/Temporal, but this section of the specification was not included in the final SQL:2003 standard. At least two relatively prominent books [1][2] as well as numerous papers have been written on the subject. An aside regarding terminology: In the literature I've read, time ranges are most often referred to as intervals. However, SQL already uses INTERVAL to refer to another distinct type. As mentioned above, SQL3 used the term PERIOD for a constructor to create types with a beginning point and an end point. RANGE is a reserved keyword in SQL: 2003 (related, I believe, to windowed tables). Range also has a distinct meaning in relational calculus. I'm at a bit of a loss as to how to refer to these structures with a beginning and an end point with a term that isn't already reserved in SQL or may be in the future. Suggestions welcome :) Span? Reuse tinterval? For the time being, I'll arbitrarily continue to use range. In the first six months of 2006 alone, I've noted quite a few threads related to time ranges in the various PostgreSQL mailing lists, so it seems this issue is ripe for a solution. Just two weeks ago, Albert Hertroys started work on a vector type[3] , where he rightly notes that time ranges are just an application of a general range (or vector, to use his term) that could just as easily be used with "integers, reals, points, dates, etc". For the past couple of months, I've been working with composite types and PL/pgSQL to define and manipulate date ranges and integer ranges, and see what can be abstracted out to a general range type. In the general case, a particular range value can be represented as two point values. For example, the date range [2006-01-01, 2006-05-31] (using the closed-closed interval representation) is represented by the two date "point" values 2006-01-01 and 2006-05-31. The interval range [3,9] is represented by the two integer "point" values 3 and 9. A range can be formed for any point type, where a point type is any type that's well-ordered:* the range of values is bounded (the number of values in the type is finite)* comparisons are well-defined for any two values, and* for any point p, the next point can be found using a successor function Given a well-ordered "point" type, common questions of ranges can be be answered, such as, for two ranges r1 and r2* Do r1 and r2 overlap?* Does r1 meet r2?* Is the union of r1 and r2 defined,and if so, what is it?* Is the intersection of r1 and r2 defined, and if so, what is it? The way I've thought of implementing ranges in PostgreSQL is through an additional system catalog table (called pg_ordered_type for convenience) that would store the necessary information for each point type:* ordtype: the point type (referencing pg_type.oid)* ordcmpfunc: the comparison function (referencing pg_proc.oid)*ordsucc: the successor function (referencing pg_proc.oid)* ordmin, ordmax the minimum and maximum values forthe point type (not sure how this would be represented) From one point of view, the successor function and minimum and maximum values (and perhaps the comparison function) are intrinsic attributes of the type itself. However, I can see potential usefulness using the same "base" type with different successor functions or different maximum and minimum values. For example, to represent a point type of even numbers, the base type could be integer, and the successor function would "p + 2". (This raises the question of how to restrict the values of the type to even numbers: perhaps using domains would be a solution.) Another example would be using timestamps of different precision: timestamp(6) has a different successor function than timestamp(0). It may be desirable to include in pg_ordered_type a "predecessor" function (returning the point prior to a point p) and a duration function (providing a more efficient way to return the number of points included in the interval than iterating over the successor function). With comparison and the successor function, one can define the 13 predicates (often called Allen operators) describing the relationship between any two range values: equals, before, meets, overlaps, starts, during, finishes, and inverses of the last six. Note that two of the built-in numeric types--real and double- precision--are not well-ordered, and therefore ranges for real and double-precision would not be defined. The primary reason for this is that without a successor function, it's not possible to determine whether or not two range values meet (determined by the meets Allen operator). And without a definition for meets, it's not possible to define continuity--to use the above example, that a teacher must always be assigned to some school. In that respect, the implementation I'm proposing here differs from the vector type proposed by Albert Hertroys. Perhaps there's a way to have a "looser" form of ranges that do not include a successor function, and for those looser ranges, continuity constraints couldn't be defined. However, the issues arising from the inexactness of real and double precision calls into question the usefulness of functions that rely on testing for equality (e.g., equals, meets, starts, finishes, and inverses of the last three). Another issue for a general range implementation is uniqueness. Currently btree indexes are used to enforce uniqueness in PostgreSQL. By somewhat arbitrarily defining a comparison result for the 13 possible predicates, I've been able to define an opclass for ranges so btree indexes can be defined. Note that in defining an opclass, comparision is already required, so perhaps the comparison function could be stored someplace other than the pg_ordered_type table. As for other indexes, I've looked at some of the GiST material and think that it probably can be usefully applied to ranges: for example the ip4r type[4], which defines ranges of IPv4 addresses, uses GiST for indexing. However, GiST is currently well over my head, so this is speculation on my part. Returning to my original example, with a "date_range" type, the table could be defined as: create table teachers__schools_2 ( teacher text not null , school text not null , period date_range not null , primary key (teacher, school, period) ); The original check constraint is handled by the date_range type and the two unique constraints are replaced by a single primary key constraint. Constraints for overlapping and continuity are still handled using constraint triggers, but are easier to implement using functions available to compare ranges rather than handling beginning and end points individually. I believe ranges in PostgreSQL would be useful and satisfy a need expressed in numerous mailing list threads. I hope to do my part in their implementation. I look forward to your feedback. Michael Glaesemann grzm seespotcode net [1] Richard T. Snodgrass, Developing Time-Oriented Database Applications in SQL, Morgan Kaufmann, 2000. ISBN 1-55860-436-7. Out of print, but available from the author as a PDF. (http://www.cs.arizona.edu/people/rts/tdbbook.pdf) [2] Chris Date, Hugh Darwen, Nikolas Lorentzos, Temporal Data and the Relational Model, Morgan Kaufmann, 2002. ISBN 1-55860-855-9 (http:// www.amazon.com/gp/product/1558608559/) [3] Vector type (Re: [GENERAL] challenging constraint situation - how do I make it) (http://archives.postgresql.org/pgsql-general/2006-05/ msg01279.php) [4] ip4r (http://pgfoundry.org/projects/ip4r/)
pgsql-hackers by date: