Re: pervasiveness of surrogate (also called synthetic) keys - Mailing list pgsql-general
From | David Johnston |
---|---|
Subject | Re: pervasiveness of surrogate (also called synthetic) keys |
Date | |
Msg-id | B6BA5655-A7DF-4208-B3FA-5F40B5519C7F@yahoo.com Whole thread Raw |
In response to | Re: pervasiveness of surrogate (also called synthetic) keys (Greg Smith <greg@2ndquadrant.com>) |
Responses |
Re: pervasiveness of surrogate (also called synthetic)
keys
|
List | pgsql-general |
On May 3, 2011, at 22:03, Greg Smith <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> If your data modeler that made the the assumptions that a MAC is >> unique (a mistake obviously) at least the other tables are protected >> from violations of that assumption because the database would reject >> them with an error, which is a Good Thing. Without a uniqueness >> constraint you now have ambiguous data which is a Very Bad Thing. >> > > With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and possiblylost when the insertion happens. You say that's a good thing, plenty of people will say that's the worst possiblething that can happen. When dealing with external data, it's often impossible to know everything you're going tosee later at design time. Approaching that problem with the idea that you're going to lose any data that doesn't fit intothe original model is not what everyone finds reasonable behavior. > > I don't think it's possible to decide in a generic way which of these is the better approach: to reject unexpected dataand force the problem back at the application immediately (commit failure), or to accept with with because you're usinga surrogate key and discover the problems down the line. Both are valid approaches with a very different type of riskassociated with them. I think it's fair to say that real-world data is not always well known enough at design time tofollow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys inthe industry. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general If you implicitly expect MAC to be unique but do not enforce it then you are likely to write queries that rely upon thatuniqueness. If you do enforce it then the constraint failure occurs anyway. A scalar sub-query that links via the MAC will fail when the duplicate data is encountered, and normal queries will returntoo-many records. A detail report may be obvious but if you are summarizing the data the specific offending recordis going to require some effort to find. I guess if you are the compromising type you can loosely enforce the uniqueness by running a check query periodically tosee if supposedly unique values have been duplicated. I agree there is no right answer - the designer needs to make trade-offs - but I'd rather reject new data and leave the systemin a status-quo stable state instead of introducing invalid data and putting the system into a state where it requireseffort to get it functioning again. If you accept the invalid data the likely scenario, if something breaks, issomeone finds the offending record and removes it until the application and database can be fixed properly - which is wherewe are at with validation. The common exception is where identifiers are reused over time and you remove the old recordin order to keep/allow the newer record to remain. On a tangential course I've started considering is a setup whereby you basically have two identifiers for a record. Oneis end-user facing and updatable whereas the other is static and used in intra-table relations. You can create a newrecord with the same user-facing id as an existing Id but the existing Id will be replaced with its system id. This isuseful when users will be using the Id often and it can be reasonably assumed to be unique over a moderate period of time(say a year). Invoice numbers, customer numbers are two common examples. The lookup Id itself may require additionalfields in order to qualify as a primary (natural) key but the static key wants to be a single field. Often simplyputting a date with the original id (and parent identifiers) is sufficient due to the infrequency of updates. Thedownside is, with string-based parent identifiers the pk value can be quite long. I currently have PKs of 40-50 lengthbut during my new design my first pass on a couple of tables indicated >100 characters limit. Is there any rules-of-thumb on the performance of a PK as a function of key length? I like using varchar based identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins. I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+ lengthprimary key. David J.
pgsql-general by date: