Thread: cross-table constraints?

cross-table constraints?

From

Kevin Hunter Kesling

Date:

27 August 2013, 18:41:18

Hello Postgres Novice Helpers,

I don't know if this is the correct place for this question, but please
redirect as appropriate.

I am trying to create a datastore for a mathematical optimization model
I've built.  The math model data structure has a lot of
interdependencies and data integrity has been a constant battle over the
past year.  I've been made aware that data integrity is the I in ACID,
and is exactly the reason RDBMSs were created.  Great!

I've spent some time putting together a schema that appears to make
sense, taking care of such concepts as "single point of authority" and
"normalization".  I've tried to normalize it as best I could, but I
don't exactly have a lot of practice designing schemas.  I'm now running
into a situation where I can't create a constraint that I think I need.
  My research suggests that the type of constraint I want is
colloquially known as a "cross-table constraint", and implies that I
have not properly normalized or otherwise structured my schema.  So, I
would highly appreciate any pointers this list could give me.

For the sake of the schema below, let me briefly describe what the model
needs and does.

The model is an optimization model, so at it's core finds a maximum or
minimum value of an "objective function", subject to any number of
constraints.  For example, a typical objective function would be "Find
the minimum cost to supply energy to this system."

The model is organized into analyses, and each analysis can have a set
of processes.  I define 'process' as a technology-vintage tuple <t, v>.
  For example, a technology 'e_coal' might describe an electric coal
power plant, and <e_coal, 1980> would describe a coal power plant built
in 1980.

We might define the amount of capacity installed as:

    CAP[e_coal, 1980] = 100    # some number of units

The analysis also has the notion of a first period.  For example, if the
analysis in question had these vintages:

     1980, 1990, 2000, 2010, 2020, 2030, 2040

And period_0 (the first period), was 2010, then the model only makes
decisions about the periods 2010, 2020, and 2030.

This means that to /utilize/ the <e_coal, 1980> process, we might define
some "activity" variables for it:

     ACT[2010, e_coal, 1980]
     ACT[2020, e_coal, 1980]
     ACT[2030, e_coal, 1980]
     ACT[2040, e_coal, 1980]

The ACTivity variable is indexed by period, technology, and vintage
<p,t,v>.  *Note that the set of periods (e.g., 2010, 2020, 2030, 2040)
is a subset of the set of vintages.*

Here's where I get to how this relates to my question: as the model can
only make decisions with variables that describe future time periods
(e.g., ACT[2010,...], ACT[2020,...], ACT[2030,...], ...), it makes no
sense to define any characteristics for periods prior to 2010.  For
example, if the use of the <e_coal, 1980> cost different amounts in
different periods, I might have data like this:

     Cost[2010, e_coal, 1980] = 0.10   # $/kWh
     Cost[2020, e_coal, 1980] = 0.30   # $/kWh
     Cost[2030, e_coal, 1980] = 0.60   # $/kWh
     Cost[2040, e_coal, 1980] = 0.33   # $/kWh

However, what's currently to prevent a user from giving data for 2000,
or 1990?  At the moment, nothing other than my application logic.  For
example, consider this (minimized) schema:

     CREATE TABLE analysis (
       id        INTEGER        NOT NULL PRIMARY KEY,
       name      VARCHAR(32767) NOT NULL UNIQUE,
       period_0  INTEGER        NOT NULL
     );

     CREATE TABLE vintage (
       id          INTEGER NOT NULL PRIMARY KEY,
       analysis_id INTEGER NOT NULL REFERENCES analysis (id),
       vintage     INTEGER NOT NULL
     );

     CREATE TABLE technology (
       id    integer        NOT NULL PRIMARY KEY,
       name  varchar(32767) NOT NULL UNIQUE
     );

     CREATE TABLE process (
       id            INTEGER NOT NULL PRIMARY KEY,
       analysis_id   INTEGER NOT NULL REFERENCES analysis (id),
       technology_id INTEGER NOT NULL REFERENCES technology (id),
       vintage_id    INTEGER NOT NULL REFERENCES vintage (id),

       UNIQUE (analysis_id, technology_id, vintage_id)
     );

     CREATE TABLE cost (
       id          integer NOT NULL PRIMARY KEY,
       period_id   integer NOT NULL REFERENCES vintage (id),
       process_id  integer NOT NULL REFERENCES process (id),
       value       real    NOT NULL,

       UNIQUE (period_id, process_id)
     );

Ideally, I'd have a constraint on the 'cost' table that required that
the period referenced by period_id was not less than the associated
analysis period_0.  I would like any INSERT statement that mimicks this
printf format string would fail:

     INSERT INTO cost (period_id, process_id, value)
               VALUES (%d, %d, %f);

   (where, period_id maps to a vintage.vintage that is less than
   analysis.period_0)

What would be a good way to reorganize my schema to enable a constraint
that effectively limits the allowable rows of cost, subject to the value?

Many thanks (and apologies for the long post!),

Kevin

Re: cross-table constraints?

From

Michael Swierczek

Date:

27 August 2013, 19:06:02

( See response below. )

On Tue, Aug 27, 2013 at 2:41 PM, Kevin Hunter Kesling <kmhunte2@ncsu.edu> wrote:
> Hello Postgres Novice Helpers,
>
> I don't know if this is the correct place for this question, but please
> redirect as appropriate.
>
> I am trying to create a datastore for a mathematical optimization model I've
> built.  The math model data structure has a lot of interdependencies and
> data integrity has been a constant battle over the past year.  I've been
> made aware that data integrity is the I in ACID, and is exactly the reason
> RDBMSs were created.  Great!
>
> I've spent some time putting together a schema that appears to make sense,
> taking care of such concepts as "single point of authority" and
> "normalization".  I've tried to normalize it as best I could, but I don't
> exactly have a lot of practice designing schemas.  I'm now running into a
> situation where I can't create a constraint that I think I need.  My
> research suggests that the type of constraint I want is colloquially known
> as a "cross-table constraint", and implies that I have not properly
> normalized or otherwise structured my schema.  So, I would highly appreciate
> any pointers this list could give me.
>
> For the sake of the schema below, let me briefly describe what the model
> needs and does.
>
> The model is an optimization model, so at it's core finds a maximum or
> minimum value of an "objective function", subject to any number of
> constraints.  For example, a typical objective function would be "Find the
> minimum cost to supply energy to this system."
>
> The model is organized into analyses, and each analysis can have a set of
> processes.  I define 'process' as a technology-vintage tuple <t, v>.  For
> example, a technology 'e_coal' might describe an electric coal power plant,
> and <e_coal, 1980> would describe a coal power plant built in 1980.
>
> We might define the amount of capacity installed as:
>
>    CAP[e_coal, 1980] = 100    # some number of units
>
> The analysis also has the notion of a first period.  For example, if the
> analysis in question had these vintages:
>
>     1980, 1990, 2000, 2010, 2020, 2030, 2040
>
> And period_0 (the first period), was 2010, then the model only makes
> decisions about the periods 2010, 2020, and 2030.
>
> This means that to /utilize/ the <e_coal, 1980> process, we might define
> some "activity" variables for it:
>
>     ACT[2010, e_coal, 1980]
>     ACT[2020, e_coal, 1980]
>     ACT[2030, e_coal, 1980]
>     ACT[2040, e_coal, 1980]
>
> The ACTivity variable is indexed by period, technology, and vintage <p,t,v>.
> *Note that the set of periods (e.g., 2010, 2020, 2030, 2040) is a subset of
> the set of vintages.*
>
> Here's where I get to how this relates to my question: as the model can only
> make decisions with variables that describe future time periods (e.g.,
> ACT[2010,...], ACT[2020,...], ACT[2030,...], ...), it makes no sense to
> define any characteristics for periods prior to 2010.  For example, if the
> use of the <e_coal, 1980> cost different amounts in different periods, I
> might have data like this:
>
>     Cost[2010, e_coal, 1980] = 0.10   # $/kWh
>     Cost[2020, e_coal, 1980] = 0.30   # $/kWh
>     Cost[2030, e_coal, 1980] = 0.60   # $/kWh
>     Cost[2040, e_coal, 1980] = 0.33   # $/kWh
>
> However, what's currently to prevent a user from giving data for 2000, or
> 1990?  At the moment, nothing other than my application logic.  For example,
> consider this (minimized) schema:
>
>     CREATE TABLE analysis (
>       id        INTEGER        NOT NULL PRIMARY KEY,
>       name      VARCHAR(32767) NOT NULL UNIQUE,
>       period_0  INTEGER        NOT NULL
>     );
>
>     CREATE TABLE vintage (
>       id          INTEGER NOT NULL PRIMARY KEY,
>       analysis_id INTEGER NOT NULL REFERENCES analysis (id),
>       vintage     INTEGER NOT NULL
>     );
>
>     CREATE TABLE technology (
>       id    integer        NOT NULL PRIMARY KEY,
>       name  varchar(32767) NOT NULL UNIQUE
>     );
>
>     CREATE TABLE process (
>       id            INTEGER NOT NULL PRIMARY KEY,
>       analysis_id   INTEGER NOT NULL REFERENCES analysis (id),
>       technology_id INTEGER NOT NULL REFERENCES technology (id),
>       vintage_id    INTEGER NOT NULL REFERENCES vintage (id),
>
>       UNIQUE (analysis_id, technology_id, vintage_id)
>     );
>
>     CREATE TABLE cost (
>       id          integer NOT NULL PRIMARY KEY,
>       period_id   integer NOT NULL REFERENCES vintage (id),
>       process_id  integer NOT NULL REFERENCES process (id),
>       value       real    NOT NULL,
>
>       UNIQUE (period_id, process_id)
>     );
>
> Ideally, I'd have a constraint on the 'cost' table that required that the
> period referenced by period_id was not less than the associated analysis
> period_0.  I would like any INSERT statement that mimicks this printf format
> string would fail:
>
>     INSERT INTO cost (period_id, process_id, value)
>               VALUES (%d, %d, %f);
>
>   (where, period_id maps to a vintage.vintage that is less than
>   analysis.period_0)
>
> What would be a good way to reorganize my schema to enable a constraint that
> effectively limits the allowable rows of cost, subject to the value?
>
> Many thanks (and apologies for the long post!),
>
> Kevin
>
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice

Kevin,

I am not a database expert, I don't see a way to enforce what you want
directly in the uniqueness and foreign key constraints of the schema.
But I believe you should be able to enforce the constraint with a
trigger.  http://www.postgresql.org/docs/current/static/plpgsql-trigger.html

I may have misunderstood your example, but I think this _might_ do
what you want:

CREATE FUNCTION cost_period_check() RETURNS trigger AS $cost_period_check$
    DECLARE
       a_row analysis%ROWTYPE;
    BEGIN
       SELECT INTO a_row analysis.* FROM analysis
             INNER JOIN process ON process.analysis_id = analysis.id
             WHERE process.id = NEW.process_id LIMIT 1;
       IF a_row.period_0 > NEW.period_id THEN
            RAISE EXCEPTION '% is before analysis period_0 %',
NEW.period_id, a_row.period_0;
       END IF;
        RETURN NEW;
    END;
$cost_period_check$ LANGUAGE plpgsql;

CREATE TRIGGER cost_trigger BEFORE INSERT OR UPDATE ON cost
    FOR EACH ROW EXECUTE PROCEDURE cost_period_check();

Does that help?
-Mike

Re: cross-table constraints?

From

David Johnston

Date:

27 August 2013, 19:38:31

Kevin Hunter Kesling wrote
>      CREATE TABLE cost (
>        id          integer NOT NULL PRIMARY KEY,
>        period_id   integer NOT NULL REFERENCES vintage (id),
>        process_id  integer NOT NULL REFERENCES process (id),
>        value       real    NOT NULL,
>
>        UNIQUE (period_id, process_id)
>      );

A check constraint can only reference the same table as on which it is
defined so you will most likely, in some place, define either a trigger or
wrap the relevant constraint checking into an API function and only modify
the relevant database items via that function.

That said you can create a relation containing all the valid combinations
and then use a foreign key constraint on the cost side so that only defined
combinations can be costed out.

Two other comments to consider.  The "cost" relation defined above, if you
de-normalize it via the foreign keys, ends up having two "analysis_id"
fields - the one on vintage and the one on process.  It is possible that
these two field may not be in sync - in addition to the "minimum period"
error you already have identified.  The reason for this problem is that you
are using artificial keys for your relationships instead of natural keys.

I may espouse on this more later but cannot at the moment.  While surrogate
keys are not evil they are also not something to be used lightly and in as
complex a model as this using natural keys does have its advantages.  Since
you can define multi-column foreign keys the same analysis_id on the cost
table can be related to other tables in two separately defined "references".

Something to ponder and it may help solve both problems.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/cross-table-constraints-tp5768724p5768733.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

Re: cross-table constraints?

From

Kevin Hunter Kesling

Date:

28 August 2013, 17:59:29

At 3:05pm -0400 Tue, 27 Aug 2013, Michael Swierczek wrote:
>> What would be a good way to reorganize my schema to enable a
>> constraint that effectively limits the allowable rows of cost,
>> subject to the value [from another table and column]?

> I am not a database expert, I don't see a way to enforce what you
> want directly in the uniqueness and foreign key constraints of the
> schema. But I believe you should be able to enforce the constraint
> with a trigger. http://www.postgresql.org/docs/current/static/plpgsql-trigger.html
>
> I may have misunderstood your example, but I think this _might_ do
> what you want:
>
> CREATE FUNCTION [...]

> CREATE TRIGGER cost_trigger BEFORE INSERT OR UPDATE ON cost
>      FOR EACH ROW EXECUTE PROCEDURE cost_period_check();

Thank you very much for this example code.  That is a couple hours of
tinkering/learning now reduced to changing a couple of characters.

> Does that help?

Indeed it does.  I was hoping to avoid a trigger, and that I was simply
missing some requisite understanding of RDB internals and the SQL
language.  Alas.

Thanks!

Kevin

Re: cross-table constraints?

From

Kevin Hunter Kesling

Date:

28 August 2013, 18:09:04

At 3:33pm -0400 Tue, 27 Aug 2013, David Johnston wrote:
> Kevin Hunter Kesling wrote
>>   CREATE TABLE cost (
>>     id          integer NOT NULL PRIMARY KEY,
>>     period_id   integer NOT NULL REFERENCES vintage (id),
>>     process_id  integer NOT NULL REFERENCES process (id),
>>     value       real    NOT NULL,
>>
>>     UNIQUE (period_id, process_id)
>>   );
>
> A check constraint can only reference the same table as on which it
> is defined so you will most likely, in some place, define either a
> trigger or wrap the relevant constraint checking into an API function
> and only modify the relevant database items via that function.
>
> That said you can create a relation containing all the valid
> combinations and then use a foreign key constraint on the cost side
> so that onlydefined combinations can be costed out.

I had thought about that, but that would require transforming the
trigger logic from the check constraint to the creation of rows in the
"valid" table.  Is that a more "correct" approach then simply performing
the check via a trigger on the cost table?  If so, why?

> Two other comments to consider. The "cost" relation defined above, if
> you de-normalize it via the foreign keys, ends up having two
> "analysis_id" fields - the one on vintage and the one on process.

This is a really good observation!  Thank you for pointing it out.

> It is possible that these two field may not be in sync - in addition
> to the "minimum period" error you already have identified.

Heh, while I will do my utmost to get my application logic correction,
this is a constant worry for me.  It would be /really/ nice to be able
to figure this out at the DB level.  Hopefully I can work around this
with triggers.

> The reason for this problem is that you are using artificial keys
> for your relationships instead of natural keys.
>
> I may espouse on this more later but cannot at the moment. While
> surrogate keys are not evil they are also not something to be used
> lightly and in as complex a model as this using natural keys does
> have its advantages. Since you can define multi-column foreign keys
> the same analysis_id on the cost table can be related to other tables
> in two separately defined  "references".

I really like this point.  I really do.  Unfortunately, the large detail
that I left out from my question is that I've created this model through
Django, which (currently) requires a single column be the primary key
for it's ORM logic.

     [1] "Each model requires exactly one field to have
          primary_key=True."  (where 'field' = DB column)

     [2] Currently Django models only support a single column in this
         set, denying many designs where the natural primary key of a
         table is multiple columns. Django currently can't work with
         these schemas; they must instead introduce a redundant
         single-column key (a “surrogate” key), forcing applications
         to make arbitrary and otherwise-unnecessary choices about
         which key to use for the table in any given instance.

The choice of Django has drastically helped my project in other areas,
but leaves some of the data integrity and modeling issues to the
application logic.  I suppose "one pays their monies and makes their
choices!"

Thank you very much for your helpful insights.

Kevin

[1]
https://docs.djangoproject.com/en/dev/topics/db/models/#automatic-primary-key-fields

[2] https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys