Home > mailing lists
[PROPOSAL] Temporal query processing with range types - Mailing list pgsql-hackers

From	Anton Dignös
Subject	[PROPOSAL] Temporal query processing with range types
Date	July 22, 2016 16:37:39
Msg-id	CALNdv1h7TUP24Nro53KecvWB2kwA67p+PByDuP6_1GeESTFgSA@mail.gmail.com Whole thread Raw
Responses	Re: [PROPOSAL] Temporal query processing with range types Re: [PROPOSAL] Temporal query processing with range types Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
List	pgsql-hackers
Tree view
Hi hackers,

we are a group of researches that work on temporal databases. Our main focus is
the processing of data with time intervals, such as the range types in
PostgreSQL.

For this type of processing the range types that are stored with the tuples are
considered to represent the tuples' valid time, such as for instance the time
period an employee works for a project. Intuitively, the processing of temporal
operators corresponds to the use of "at each time point" in natural language.
For example, a temporal count on such a relation (temporal relation) shall
determine the count at each time point, while an anti-join shall compute an
anti-join at each time point.

We would like to contribute to PostgreSQL a solution that supports the query
processing of "at each time point". The basic idea is to offer two new
operators, NORMALIZE and ALIGN, whose purpose is to adjust (or split) the
ranges of tuples so that subsequent queries can use the usual grouping and
equality conditions to get the intended results.

Query processing works as follows:
<temporal operator> := <NORMALIZE | ALIGN> + <traditional operator>

Attached is a proposal patch file that implements this extension and on which
the example queries below can be run. The support includes distinct,
aggregation, set operations, Cartesian product, Join, Left Join, Right Join,
Full Join, and Anti-join.

In our prototype the syntax for the two new operators is:

... FROM (table_ref NORMALIZE table_ref
          USING(att_list)
          WITH (columnref, columnref)
         ) alias_clause ...

and

... FROM (table_ref ALIGN table_ref
          ON a_expr
          WITH (columnref, columnref)
         ) alias_clause ...

where NORMALIZE is used for distinct, aggregation and set operations and ALIGN
is used for Cartesian product, Join, Left Join, Right Join, Full Join, and
Anti-join.

-------------------------------------------------------------------------------
EXAMPLE FOR AGGREGATION
-------------------------------------------------------------------------------

Relation 'projects' records when employees are assigned to projects.

CREATE TABLE projects (empl VARCHAR, proj VARCHAR, pay FLOAT, t DATERANGE);
INSERT INTO projects VALUES
('Ann', 'P1', 60, DATERANGE('2016-01-01', '2016-09-01')),
('Sam', 'P1', 40, DATERANGE('2016-06-01', '2016-12-01')),
('Joe', 'P2', 80, DATERANGE('2016-03-01', '2016-06-01'));

TABLE projects;

 empl | proj | pay |            t
------+------+-----+-------------------------
 Ann  | P1   |  60 | [2016-01-01,2016-09-01)
 Sam  | P1   |  40 | [2016-06-01,2016-12-01)
 Joe  | P2   |  80 | [2016-03-01,2016-06-01)
(3 rows)

QUERY1: At each time point, what is the number of employees assigned
to projects?

First, we use NORMALIZE to adjust the ranges so that they can be
aggregated as usual by using grouping on the adjusted timestamp t:

SELECT *
FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted;

 empl | proj | pay |            t
------+------+-----+-------------------------
 Ann  | P1   |  60 | [2016-01-01,2016-03-01)
 Ann  | P1   |  60 | [2016-03-01,2016-06-01)
 Ann  | P1   |  60 | [2016-06-01,2016-09-01)
 Sam  | P1   |  40 | [2016-06-01,2016-09-01)
 Sam  | P1   |  40 | [2016-09-01,2016-12-01)
 Joe  | P2   |  80 | [2016-03-01,2016-06-01)
(6 rows)

Observe that the time ranges have been adjusted, and it is now trivial
to compute the count of employees at each time point by grouping on t:

SELECT COUNT(*), t
FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted
GROUP BY t;

 count |            t
-------+-------------------------
     1 | [2016-01-01,2016-03-01)
     2 | [2016-03-01,2016-06-01)
     2 | [2016-06-01,2016-09-01)
     1 | [2016-09-01,2016-12-01)
(4 rows)

QUERY2: At each time point, what is the number of employees assigned
to EACH project?

SELECT *
FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted;

 empl | proj | pay |            t
------+------+-----+-------------------------
 Ann  | P1   |  60 | [2016-01-01,2016-06-01)
 Ann  | P1   |  60 | [2016-06-01,2016-09-01)
 Sam  | P1   |  40 | [2016-06-01,2016-09-01)
 Sam  | P1   |  40 | [2016-09-01,2016-12-01)
 Joe  | P2   |  80 | [2016-03-01,2016-06-01)
(5 rows)

and

SELECT COUNT(*), proj, t
FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted
GROUP BY proj, t;

 count | proj |            t
-------+------+-------------------------
     1 | P2   | [2016-03-01,2016-06-01)
     1 | P1   | [2016-01-01,2016-06-01)
     2 | P1   | [2016-06-01,2016-09-01)
     1 | P1   | [2016-09-01,2016-12-01)
(4 rows)

-------------------------------------------------------------------------------
EXAMPLE FOR ANTI-JOIN
-------------------------------------------------------------------------------

QUERY3: At each time point, deterime the employees with the highest
salary.

Without time this query is answered with a NOT EXISTS subquery. With
time periods everything remains, but the time ranges must be adjusted
first:

SELECT *
FROM (projects p1 ALIGN projects p2 ON p1.pay < p2.pay WITH (t,t)) p1_adjusted
WHERE NOT EXISTS (
 SELECT *
 FROM (projects p2 ALIGN projects p1 ON p1.pay < p2.pay WITH (t,t)) p2_adjusted
 WHERE p1_adjusted.pay < p2_adjusted.pay
       AND p1_adjusted.t = p2_adjusted.t );

 empl | proj | pay |            t
------+------+-----+-------------------------
 Ann  | P1   |  60 | [2016-01-01,2016-03-01)
 Ann  | P1   |  60 | [2016-06-01,2016-09-01)
 Sam  | P1   |  40 | [2016-09-01,2016-12-01)
 Joe  | P2   |  80 | [2016-03-01,2016-06-01)
(4 rows)


-------------------------------------------------------------------------------
EXAMPLE FOR JOINS
-------------------------------------------------------------------------------

CREATE TABLE managers (mgr VARCHAR, proj VARCHAR, t DATERANGE);
INSERT INTO managers VALUES
('Tom', 'P1', DATERANGE('2016-09-01', '2016-12-01')),
('Frank', 'P2', DATERANGE('2016-01-01', '2016-12-01'));

TABLE managers;

  mgr  | proj |            t
-------+------+-------------------------
 Tom   | P1   | [2016-09-01,2016-12-01)
 Frank | P2   | [2016-01-01,2016-12-01)
(2 rows)

QUERY4: At each time point, determine employees and their managers
(including the periods with no employee or no manager).

WITH p_ext AS (SELECT *, t u FROM projects),
     m_ext AS (SELECT *, t v FROM managers)
SELECT proj, empl, mgr, t
FROM (p_ext ALIGN m_ext ON m_ext.proj = p_ext.proj WITH (t,t)) p_adjusted
       FULL OUTER JOIN
     (m_ext ALIGN p_ext ON m_ext.proj = p_ext.proj WITH (t,t)) m_adjusted
       USING (proj, t)
WHERE t = u * v OR v IS NULL OR u IS NULL;

 proj | empl |  mgr  |            t
------+------+-------+-------------------------
 P1   | Ann  |       | [2016-01-01,2016-09-01)
 P1   | Sam  |       | [2016-06-01,2016-09-01)
 P1   | Sam  | Tom   | [2016-09-01,2016-12-01)
 P2   |      | Frank | [2016-01-01,2016-03-01)
 P2   | Joe  | Frank | [2016-03-01,2016-06-01)
 P2   |      | Frank | [2016-06-01,2016-12-01)
(6 rows)

-------------------------------------------------------------------------------
IMPLEMENTATION
-------------------------------------------------------------------------------

Our implementation approach is to use two operators that take care of
adjusting the ranges in such a way that afterwards the corresponding
nontemporal operator can be used to compute the result:
input --> {NORMALIZE/ALIGN} --> nontemporal operator --> result.

The two operators (NORMALIZE and ALIGN) in this prototype are rewritten into
subqueries, which then serve as input for a new executor function
ExecTemporalAdjustment. The executor function takes care of the splitting of
the range types.

For NORMALIZE the tuples' ranges need to be split into all sub-ranges
according to all matching ranges of the second relation. For this we
create a subquery that first joins one relation with the range
boundaries of the other and then sorts the result. The executor
function splits the ranges in a sweep-line based manner.

For ALIGN the tuples' ranges must be split into all intersections and
differences with the other relation according to the join condition.
For this we create a subquery that first joins the two relations and
then sorts the result. The executor function splits the ranges
accordingly in a sweep-line based manner.

After the splitting it is possible to only use equality (GROUP BY,
DISTINCT, JOIN CONDITION, ...) on the range types to get the intended
result.

The subquery is 'visible' with the EXPLAIN command.

Range types need to be of half-open, i.e., [lower, upper).

Unbound ranges (Infinity), NaN, and NULL values (of ranges and range
bounds) are not yet supported.

-------------------------------------------------------------------------------

We are grateful for any feedback.

Best regards,

Anton, Johann, Michael, Peter
Attachment

tpg_primitives_out_v1.patch
pgsql-hackers by date:
From: Tom Lane
Date: 22 July 2016, 15:33:11
Subject: Re: Oddity in handling of cached plans for FDW queries
From: Guangzhou Zhang
Date: 22 July 2016, 16:41:08
Subject: Re: Unexpected memory usage for repeated inserts within plpgsql function
[PROPOSAL] Temporal query processing with range types - Mailing list pgsql-hackers

Attachment

Previous

Next