Thread: cross-table constraints?
Hello Postgres Novice Helpers, I don't know if this is the correct place for this question, but please redirect as appropriate. I am trying to create a datastore for a mathematical optimization model I've built. The math model data structure has a lot of interdependencies and data integrity has been a constant battle over the past year. I've been made aware that data integrity is the I in ACID, and is exactly the reason RDBMSs were created. Great! I've spent some time putting together a schema that appears to make sense, taking care of such concepts as "single point of authority" and "normalization". I've tried to normalize it as best I could, but I don't exactly have a lot of practice designing schemas. I'm now running into a situation where I can't create a constraint that I think I need. My research suggests that the type of constraint I want is colloquially known as a "cross-table constraint", and implies that I have not properly normalized or otherwise structured my schema. So, I would highly appreciate any pointers this list could give me. For the sake of the schema below, let me briefly describe what the model needs and does. The model is an optimization model, so at it's core finds a maximum or minimum value of an "objective function", subject to any number of constraints. For example, a typical objective function would be "Find the minimum cost to supply energy to this system." The model is organized into analyses, and each analysis can have a set of processes. I define 'process' as a technology-vintage tuple <t, v>. For example, a technology 'e_coal' might describe an electric coal power plant, and <e_coal, 1980> would describe a coal power plant built in 1980. We might define the amount of capacity installed as: CAP[e_coal, 1980] = 100 # some number of units The analysis also has the notion of a first period. For example, if the analysis in question had these vintages: 1980, 1990, 2000, 2010, 2020, 2030, 2040 And period_0 (the first period), was 2010, then the model only makes decisions about the periods 2010, 2020, and 2030. This means that to /utilize/ the <e_coal, 1980> process, we might define some "activity" variables for it: ACT[2010, e_coal, 1980] ACT[2020, e_coal, 1980] ACT[2030, e_coal, 1980] ACT[2040, e_coal, 1980] The ACTivity variable is indexed by period, technology, and vintage <p,t,v>. *Note that the set of periods (e.g., 2010, 2020, 2030, 2040) is a subset of the set of vintages.* Here's where I get to how this relates to my question: as the model can only make decisions with variables that describe future time periods (e.g., ACT[2010,...], ACT[2020,...], ACT[2030,...], ...), it makes no sense to define any characteristics for periods prior to 2010. For example, if the use of the <e_coal, 1980> cost different amounts in different periods, I might have data like this: Cost[2010, e_coal, 1980] = 0.10 # $/kWh Cost[2020, e_coal, 1980] = 0.30 # $/kWh Cost[2030, e_coal, 1980] = 0.60 # $/kWh Cost[2040, e_coal, 1980] = 0.33 # $/kWh However, what's currently to prevent a user from giving data for 2000, or 1990? At the moment, nothing other than my application logic. For example, consider this (minimized) schema: CREATE TABLE analysis ( id INTEGER NOT NULL PRIMARY KEY, name VARCHAR(32767) NOT NULL UNIQUE, period_0 INTEGER NOT NULL ); CREATE TABLE vintage ( id INTEGER NOT NULL PRIMARY KEY, analysis_id INTEGER NOT NULL REFERENCES analysis (id), vintage INTEGER NOT NULL ); CREATE TABLE technology ( id integer NOT NULL PRIMARY KEY, name varchar(32767) NOT NULL UNIQUE ); CREATE TABLE process ( id INTEGER NOT NULL PRIMARY KEY, analysis_id INTEGER NOT NULL REFERENCES analysis (id), technology_id INTEGER NOT NULL REFERENCES technology (id), vintage_id INTEGER NOT NULL REFERENCES vintage (id), UNIQUE (analysis_id, technology_id, vintage_id) ); CREATE TABLE cost ( id integer NOT NULL PRIMARY KEY, period_id integer NOT NULL REFERENCES vintage (id), process_id integer NOT NULL REFERENCES process (id), value real NOT NULL, UNIQUE (period_id, process_id) ); Ideally, I'd have a constraint on the 'cost' table that required that the period referenced by period_id was not less than the associated analysis period_0. I would like any INSERT statement that mimicks this printf format string would fail: INSERT INTO cost (period_id, process_id, value) VALUES (%d, %d, %f); (where, period_id maps to a vintage.vintage that is less than analysis.period_0) What would be a good way to reorganize my schema to enable a constraint that effectively limits the allowable rows of cost, subject to the value? Many thanks (and apologies for the long post!), Kevin
( See response below. ) On Tue, Aug 27, 2013 at 2:41 PM, Kevin Hunter Kesling <kmhunte2@ncsu.edu> wrote: > Hello Postgres Novice Helpers, > > I don't know if this is the correct place for this question, but please > redirect as appropriate. > > I am trying to create a datastore for a mathematical optimization model I've > built. The math model data structure has a lot of interdependencies and > data integrity has been a constant battle over the past year. I've been > made aware that data integrity is the I in ACID, and is exactly the reason > RDBMSs were created. Great! > > I've spent some time putting together a schema that appears to make sense, > taking care of such concepts as "single point of authority" and > "normalization". I've tried to normalize it as best I could, but I don't > exactly have a lot of practice designing schemas. I'm now running into a > situation where I can't create a constraint that I think I need. My > research suggests that the type of constraint I want is colloquially known > as a "cross-table constraint", and implies that I have not properly > normalized or otherwise structured my schema. So, I would highly appreciate > any pointers this list could give me. > > For the sake of the schema below, let me briefly describe what the model > needs and does. > > The model is an optimization model, so at it's core finds a maximum or > minimum value of an "objective function", subject to any number of > constraints. For example, a typical objective function would be "Find the > minimum cost to supply energy to this system." > > The model is organized into analyses, and each analysis can have a set of > processes. I define 'process' as a technology-vintage tuple <t, v>. For > example, a technology 'e_coal' might describe an electric coal power plant, > and <e_coal, 1980> would describe a coal power plant built in 1980. > > We might define the amount of capacity installed as: > > CAP[e_coal, 1980] = 100 # some number of units > > The analysis also has the notion of a first period. For example, if the > analysis in question had these vintages: > > 1980, 1990, 2000, 2010, 2020, 2030, 2040 > > And period_0 (the first period), was 2010, then the model only makes > decisions about the periods 2010, 2020, and 2030. > > This means that to /utilize/ the <e_coal, 1980> process, we might define > some "activity" variables for it: > > ACT[2010, e_coal, 1980] > ACT[2020, e_coal, 1980] > ACT[2030, e_coal, 1980] > ACT[2040, e_coal, 1980] > > The ACTivity variable is indexed by period, technology, and vintage <p,t,v>. > *Note that the set of periods (e.g., 2010, 2020, 2030, 2040) is a subset of > the set of vintages.* > > Here's where I get to how this relates to my question: as the model can only > make decisions with variables that describe future time periods (e.g., > ACT[2010,...], ACT[2020,...], ACT[2030,...], ...), it makes no sense to > define any characteristics for periods prior to 2010. For example, if the > use of the <e_coal, 1980> cost different amounts in different periods, I > might have data like this: > > Cost[2010, e_coal, 1980] = 0.10 # $/kWh > Cost[2020, e_coal, 1980] = 0.30 # $/kWh > Cost[2030, e_coal, 1980] = 0.60 # $/kWh > Cost[2040, e_coal, 1980] = 0.33 # $/kWh > > However, what's currently to prevent a user from giving data for 2000, or > 1990? At the moment, nothing other than my application logic. For example, > consider this (minimized) schema: > > CREATE TABLE analysis ( > id INTEGER NOT NULL PRIMARY KEY, > name VARCHAR(32767) NOT NULL UNIQUE, > period_0 INTEGER NOT NULL > ); > > CREATE TABLE vintage ( > id INTEGER NOT NULL PRIMARY KEY, > analysis_id INTEGER NOT NULL REFERENCES analysis (id), > vintage INTEGER NOT NULL > ); > > CREATE TABLE technology ( > id integer NOT NULL PRIMARY KEY, > name varchar(32767) NOT NULL UNIQUE > ); > > CREATE TABLE process ( > id INTEGER NOT NULL PRIMARY KEY, > analysis_id INTEGER NOT NULL REFERENCES analysis (id), > technology_id INTEGER NOT NULL REFERENCES technology (id), > vintage_id INTEGER NOT NULL REFERENCES vintage (id), > > UNIQUE (analysis_id, technology_id, vintage_id) > ); > > CREATE TABLE cost ( > id integer NOT NULL PRIMARY KEY, > period_id integer NOT NULL REFERENCES vintage (id), > process_id integer NOT NULL REFERENCES process (id), > value real NOT NULL, > > UNIQUE (period_id, process_id) > ); > > Ideally, I'd have a constraint on the 'cost' table that required that the > period referenced by period_id was not less than the associated analysis > period_0. I would like any INSERT statement that mimicks this printf format > string would fail: > > INSERT INTO cost (period_id, process_id, value) > VALUES (%d, %d, %f); > > (where, period_id maps to a vintage.vintage that is less than > analysis.period_0) > > What would be a good way to reorganize my schema to enable a constraint that > effectively limits the allowable rows of cost, subject to the value? > > Many thanks (and apologies for the long post!), > > Kevin > > > -- > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-novice Kevin, I am not a database expert, I don't see a way to enforce what you want directly in the uniqueness and foreign key constraints of the schema. But I believe you should be able to enforce the constraint with a trigger. http://www.postgresql.org/docs/current/static/plpgsql-trigger.html I may have misunderstood your example, but I think this _might_ do what you want: CREATE FUNCTION cost_period_check() RETURNS trigger AS $cost_period_check$ DECLARE a_row analysis%ROWTYPE; BEGIN SELECT INTO a_row analysis.* FROM analysis INNER JOIN process ON process.analysis_id = analysis.id WHERE process.id = NEW.process_id LIMIT 1; IF a_row.period_0 > NEW.period_id THEN RAISE EXCEPTION '% is before analysis period_0 %', NEW.period_id, a_row.period_0; END IF; RETURN NEW; END; $cost_period_check$ LANGUAGE plpgsql; CREATE TRIGGER cost_trigger BEFORE INSERT OR UPDATE ON cost FOR EACH ROW EXECUTE PROCEDURE cost_period_check(); Does that help? -Mike
Kevin Hunter Kesling wrote > CREATE TABLE cost ( > id integer NOT NULL PRIMARY KEY, > period_id integer NOT NULL REFERENCES vintage (id), > process_id integer NOT NULL REFERENCES process (id), > value real NOT NULL, > > UNIQUE (period_id, process_id) > ); A check constraint can only reference the same table as on which it is defined so you will most likely, in some place, define either a trigger or wrap the relevant constraint checking into an API function and only modify the relevant database items via that function. That said you can create a relation containing all the valid combinations and then use a foreign key constraint on the cost side so that only defined combinations can be costed out. Two other comments to consider. The "cost" relation defined above, if you de-normalize it via the foreign keys, ends up having two "analysis_id" fields - the one on vintage and the one on process. It is possible that these two field may not be in sync - in addition to the "minimum period" error you already have identified. The reason for this problem is that you are using artificial keys for your relationships instead of natural keys. I may espouse on this more later but cannot at the moment. While surrogate keys are not evil they are also not something to be used lightly and in as complex a model as this using natural keys does have its advantages. Since you can define multi-column foreign keys the same analysis_id on the cost table can be related to other tables in two separately defined "references". Something to ponder and it may help solve both problems. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/cross-table-constraints-tp5768724p5768733.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
At 3:05pm -0400 Tue, 27 Aug 2013, Michael Swierczek wrote: >> What would be a good way to reorganize my schema to enable a >> constraint that effectively limits the allowable rows of cost, >> subject to the value [from another table and column]? > I am not a database expert, I don't see a way to enforce what you > want directly in the uniqueness and foreign key constraints of the > schema. But I believe you should be able to enforce the constraint > with a trigger. http://www.postgresql.org/docs/current/static/plpgsql-trigger.html > > I may have misunderstood your example, but I think this _might_ do > what you want: > > CREATE FUNCTION [...] > CREATE TRIGGER cost_trigger BEFORE INSERT OR UPDATE ON cost > FOR EACH ROW EXECUTE PROCEDURE cost_period_check(); Thank you very much for this example code. That is a couple hours of tinkering/learning now reduced to changing a couple of characters. > Does that help? Indeed it does. I was hoping to avoid a trigger, and that I was simply missing some requisite understanding of RDB internals and the SQL language. Alas. Thanks! Kevin
At 3:33pm -0400 Tue, 27 Aug 2013, David Johnston wrote: > Kevin Hunter Kesling wrote >> CREATE TABLE cost ( >> id integer NOT NULL PRIMARY KEY, >> period_id integer NOT NULL REFERENCES vintage (id), >> process_id integer NOT NULL REFERENCES process (id), >> value real NOT NULL, >> >> UNIQUE (period_id, process_id) >> ); > > A check constraint can only reference the same table as on which it > is defined so you will most likely, in some place, define either a > trigger or wrap the relevant constraint checking into an API function > and only modify the relevant database items via that function. > > That said you can create a relation containing all the valid > combinations and then use a foreign key constraint on the cost side > so that onlydefined combinations can be costed out. I had thought about that, but that would require transforming the trigger logic from the check constraint to the creation of rows in the "valid" table. Is that a more "correct" approach then simply performing the check via a trigger on the cost table? If so, why? > Two other comments to consider. The "cost" relation defined above, if > you de-normalize it via the foreign keys, ends up having two > "analysis_id" fields - the one on vintage and the one on process. This is a really good observation! Thank you for pointing it out. > It is possible that these two field may not be in sync - in addition > to the "minimum period" error you already have identified. Heh, while I will do my utmost to get my application logic correction, this is a constant worry for me. It would be /really/ nice to be able to figure this out at the DB level. Hopefully I can work around this with triggers. > The reason for this problem is that you are using artificial keys > for your relationships instead of natural keys. > > I may espouse on this more later but cannot at the moment. While > surrogate keys are not evil they are also not something to be used > lightly and in as complex a model as this using natural keys does > have its advantages. Since you can define multi-column foreign keys > the same analysis_id on the cost table can be related to other tables > in two separately defined "references". I really like this point. I really do. Unfortunately, the large detail that I left out from my question is that I've created this model through Django, which (currently) requires a single column be the primary key for it's ORM logic. [1] "Each model requires exactly one field to have primary_key=True." (where 'field' = DB column) [2] Currently Django models only support a single column in this set, denying many designs where the natural primary key of a table is multiple columns. Django currently can't work with these schemas; they must instead introduce a redundant single-column key (a “surrogate” key), forcing applications to make arbitrary and otherwise-unnecessary choices about which key to use for the table in any given instance. The choice of Django has drastically helped my project in other areas, but leaves some of the data integrity and modeling issues to the application logic. I suppose "one pays their monies and makes their choices!" Thank you very much for your helpful insights. Kevin [1] https://docs.djangoproject.com/en/dev/topics/db/models/#automatic-primary-key-fields [2] https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys