[PROPOSAL] Temporal query processing with range types - Mailing list pgsql-hackers
From | Anton Dignös |
---|---|
Subject | [PROPOSAL] Temporal query processing with range types |
Date | |
Msg-id | CALNdv1h7TUP24Nro53KecvWB2kwA67p+PByDuP6_1GeESTFgSA@mail.gmail.com Whole thread Raw |
Responses |
Re: [PROPOSAL] Temporal query processing with range types
Re: [PROPOSAL] Temporal query processing with range types Re: [HACKERS] [PROPOSAL] Temporal query processing with range types |
List | pgsql-hackers |
Hi hackers, we are a group of researches that work on temporal databases. Our main focus is the processing of data with time intervals, such as the range types in PostgreSQL. For this type of processing the range types that are stored with the tuples are considered to represent the tuples' valid time, such as for instance the time period an employee works for a project. Intuitively, the processing of temporal operators corresponds to the use of "at each time point" in natural language. For example, a temporal count on such a relation (temporal relation) shall determine the count at each time point, while an anti-join shall compute an anti-join at each time point. We would like to contribute to PostgreSQL a solution that supports the query processing of "at each time point". The basic idea is to offer two new operators, NORMALIZE and ALIGN, whose purpose is to adjust (or split) the ranges of tuples so that subsequent queries can use the usual grouping and equality conditions to get the intended results. Query processing works as follows: <temporal operator> := <NORMALIZE | ALIGN> + <traditional operator> Attached is a proposal patch file that implements this extension and on which the example queries below can be run. The support includes distinct, aggregation, set operations, Cartesian product, Join, Left Join, Right Join, Full Join, and Anti-join. In our prototype the syntax for the two new operators is: ... FROM (table_ref NORMALIZE table_ref USING(att_list) WITH (columnref, columnref) ) alias_clause ... and ... FROM (table_ref ALIGN table_ref ON a_expr WITH (columnref, columnref) ) alias_clause ... where NORMALIZE is used for distinct, aggregation and set operations and ALIGN is used for Cartesian product, Join, Left Join, Right Join, Full Join, and Anti-join. ------------------------------------------------------------------------------- EXAMPLE FOR AGGREGATION ------------------------------------------------------------------------------- Relation 'projects' records when employees are assigned to projects. CREATE TABLE projects (empl VARCHAR, proj VARCHAR, pay FLOAT, t DATERANGE); INSERT INTO projects VALUES ('Ann', 'P1', 60, DATERANGE('2016-01-01', '2016-09-01')), ('Sam', 'P1', 40, DATERANGE('2016-06-01', '2016-12-01')), ('Joe', 'P2', 80, DATERANGE('2016-03-01', '2016-06-01')); TABLE projects; empl | proj | pay | t ------+------+-----+------------------------- Ann | P1 | 60 | [2016-01-01,2016-09-01) Sam | P1 | 40 | [2016-06-01,2016-12-01) Joe | P2 | 80 | [2016-03-01,2016-06-01) (3 rows) QUERY1: At each time point, what is the number of employees assigned to projects? First, we use NORMALIZE to adjust the ranges so that they can be aggregated as usual by using grouping on the adjusted timestamp t: SELECT * FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted; empl | proj | pay | t ------+------+-----+------------------------- Ann | P1 | 60 | [2016-01-01,2016-03-01) Ann | P1 | 60 | [2016-03-01,2016-06-01) Ann | P1 | 60 | [2016-06-01,2016-09-01) Sam | P1 | 40 | [2016-06-01,2016-09-01) Sam | P1 | 40 | [2016-09-01,2016-12-01) Joe | P2 | 80 | [2016-03-01,2016-06-01) (6 rows) Observe that the time ranges have been adjusted, and it is now trivial to compute the count of employees at each time point by grouping on t: SELECT COUNT(*), t FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted GROUP BY t; count | t -------+------------------------- 1 | [2016-01-01,2016-03-01) 2 | [2016-03-01,2016-06-01) 2 | [2016-06-01,2016-09-01) 1 | [2016-09-01,2016-12-01) (4 rows) QUERY2: At each time point, what is the number of employees assigned to EACH project? SELECT * FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted; empl | proj | pay | t ------+------+-----+------------------------- Ann | P1 | 60 | [2016-01-01,2016-06-01) Ann | P1 | 60 | [2016-06-01,2016-09-01) Sam | P1 | 40 | [2016-06-01,2016-09-01) Sam | P1 | 40 | [2016-09-01,2016-12-01) Joe | P2 | 80 | [2016-03-01,2016-06-01) (5 rows) and SELECT COUNT(*), proj, t FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted GROUP BY proj, t; count | proj | t -------+------+------------------------- 1 | P2 | [2016-03-01,2016-06-01) 1 | P1 | [2016-01-01,2016-06-01) 2 | P1 | [2016-06-01,2016-09-01) 1 | P1 | [2016-09-01,2016-12-01) (4 rows) ------------------------------------------------------------------------------- EXAMPLE FOR ANTI-JOIN ------------------------------------------------------------------------------- QUERY3: At each time point, deterime the employees with the highest salary. Without time this query is answered with a NOT EXISTS subquery. With time periods everything remains, but the time ranges must be adjusted first: SELECT * FROM (projects p1 ALIGN projects p2 ON p1.pay < p2.pay WITH (t,t)) p1_adjusted WHERE NOT EXISTS ( SELECT * FROM (projects p2 ALIGN projects p1 ON p1.pay < p2.pay WITH (t,t)) p2_adjusted WHERE p1_adjusted.pay < p2_adjusted.pay AND p1_adjusted.t = p2_adjusted.t ); empl | proj | pay | t ------+------+-----+------------------------- Ann | P1 | 60 | [2016-01-01,2016-03-01) Ann | P1 | 60 | [2016-06-01,2016-09-01) Sam | P1 | 40 | [2016-09-01,2016-12-01) Joe | P2 | 80 | [2016-03-01,2016-06-01) (4 rows) ------------------------------------------------------------------------------- EXAMPLE FOR JOINS ------------------------------------------------------------------------------- CREATE TABLE managers (mgr VARCHAR, proj VARCHAR, t DATERANGE); INSERT INTO managers VALUES ('Tom', 'P1', DATERANGE('2016-09-01', '2016-12-01')), ('Frank', 'P2', DATERANGE('2016-01-01', '2016-12-01')); TABLE managers; mgr | proj | t -------+------+------------------------- Tom | P1 | [2016-09-01,2016-12-01) Frank | P2 | [2016-01-01,2016-12-01) (2 rows) QUERY4: At each time point, determine employees and their managers (including the periods with no employee or no manager). WITH p_ext AS (SELECT *, t u FROM projects), m_ext AS (SELECT *, t v FROM managers) SELECT proj, empl, mgr, t FROM (p_ext ALIGN m_ext ON m_ext.proj = p_ext.proj WITH (t,t)) p_adjusted FULL OUTER JOIN (m_ext ALIGN p_ext ON m_ext.proj = p_ext.proj WITH (t,t)) m_adjusted USING (proj, t) WHERE t = u * v OR v IS NULL OR u IS NULL; proj | empl | mgr | t ------+------+-------+------------------------- P1 | Ann | | [2016-01-01,2016-09-01) P1 | Sam | | [2016-06-01,2016-09-01) P1 | Sam | Tom | [2016-09-01,2016-12-01) P2 | | Frank | [2016-01-01,2016-03-01) P2 | Joe | Frank | [2016-03-01,2016-06-01) P2 | | Frank | [2016-06-01,2016-12-01) (6 rows) ------------------------------------------------------------------------------- IMPLEMENTATION ------------------------------------------------------------------------------- Our implementation approach is to use two operators that take care of adjusting the ranges in such a way that afterwards the corresponding nontemporal operator can be used to compute the result: input --> {NORMALIZE/ALIGN} --> nontemporal operator --> result. The two operators (NORMALIZE and ALIGN) in this prototype are rewritten into subqueries, which then serve as input for a new executor function ExecTemporalAdjustment. The executor function takes care of the splitting of the range types. For NORMALIZE the tuples' ranges need to be split into all sub-ranges according to all matching ranges of the second relation. For this we create a subquery that first joins one relation with the range boundaries of the other and then sorts the result. The executor function splits the ranges in a sweep-line based manner. For ALIGN the tuples' ranges must be split into all intersections and differences with the other relation according to the join condition. For this we create a subquery that first joins the two relations and then sorts the result. The executor function splits the ranges accordingly in a sweep-line based manner. After the splitting it is possible to only use equality (GROUP BY, DISTINCT, JOIN CONDITION, ...) on the range types to get the intended result. The subquery is 'visible' with the EXPLAIN command. Range types need to be of half-open, i.e., [lower, upper). Unbound ranges (Infinity), NaN, and NULL values (of ranges and range bounds) are not yet supported. ------------------------------------------------------------------------------- We are grateful for any feedback. Best regards, Anton, Johann, Michael, Peter
Attachment
pgsql-hackers by date: