Home > mailing lists

Re: pervasiveness of surrogate (also called synthetic) keys - Mailing list pgsql-general

From	David Johnston
Subject	Re: pervasiveness of surrogate (also called synthetic) keys
Date	May 4, 2011 01:00:24
Msg-id	B6BA5655-A7DF-4208-B3FA-5F40B5519C7F@yahoo.com Whole thread Raw
In response to	Re: pervasiveness of surrogate (also called synthetic) keys (Greg Smith <greg@2ndquadrant.com>)
Responses	Re: pervasiveness of surrogate (also called synthetic) keys
List	pgsql-general

Tree view

On May 3, 2011, at 22:03, Greg Smith <greg@2ndquadrant.com> wrote:

> Merlin Moncure wrote:
>> If your data modeler that made the the assumptions that a MAC is
>> unique (a mistake obviously) at least the other tables are protected
>> from violations of that assumption because the database would reject
>> them with an error, which is a Good Thing.  Without a uniqueness
>> constraint you now have ambiguous data which is a Very Bad Thing.
>>
>
> With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and
possiblylost when the insertion happens.  You say that's a good thing, plenty of people will say that's the worst
possiblething that can happen.  When dealing with external data, it's often impossible to know everything you're going
tosee later at design time.  Approaching that problem with the idea that you're going to lose any data that doesn't fit
intothe original model is not what everyone finds reasonable behavior. 
>
> I don't think it's possible to decide in a generic way which of these is the better approach:  to reject unexpected
dataand force the problem back at the application immediately (commit failure), or to accept with with because you're
usinga surrogate key and discover the problems down the line.  Both are valid approaches with a very different type of
riskassociated with them.  I think it's fair to say that real-world data is not always well known enough at design time
tofollow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys
inthe industry. 
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

If you implicitly expect MAC to be unique but do not enforce it then you are likely to write queries that rely upon
thatuniqueness.  If you do enforce it then the constraint failure occurs anyway. 

A scalar sub-query that links via the MAC will fail when the duplicate data is encountered, and normal queries will
returntoo-many records.  A detail report may be obvious but if you are summarizing the data the specific offending
recordis going to require some effort to find. 

I guess if you are the compromising type you can loosely enforce the uniqueness by running a check query periodically
tosee if supposedly unique values have been duplicated. 

I agree there is no right answer - the designer needs to make trade-offs - but I'd rather reject new data and leave the
systemin a status-quo stable state instead of introducing invalid data and putting the system into a state where it
requireseffort to get it functioning again.  If you accept the invalid data the likely scenario, if something breaks,
issomeone finds the offending record and removes it until the application and database can be fixed properly - which is
wherewe are at with validation.  The common exception is where identifiers are reused over time and you remove the old
recordin order to keep/allow the newer record to remain. 

On a tangential course I've started considering is a setup whereby you basically have two identifiers for a record.
Oneis end-user facing and updatable whereas the other is static and used in intra-table relations.  You can create a
newrecord with the same user-facing id as an existing Id but the existing Id will be replaced with its system id.  This
isuseful when users will be using the Id often and it can be reasonably assumed to be unique over a moderate period of
time(say a year).  Invoice numbers, customer numbers are two common examples.  The lookup Id itself may require
additionalfields in order to qualify as a primary (natural) key but the static key wants to be a single field.  Often
simplyputting a date with the original id (and parent identifiers) is sufficient due to the infrequency of updates.
Thedownside is, with string-based parent identifiers the pk value can be quite long.  I currently have PKs of 40-50
lengthbut during my new design my first pass on a couple of tables indicated >100 characters limit. 

Is there any rules-of-thumb on the performance of a PK as a function of key length?  I like using varchar based
identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins.
I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+
lengthprimary key. 

David J.

pgsql-general by date:

From: Greg Smith
Date: 03 May 2011, 23:12:44
Subject: Re: pervasiveness of surrogate (also called synthetic) keys

From: Tian Luo
Date: 04 May 2011, 01:17:12
Subject: "full_page_writes" makes no difference?

Re: pervasiveness of surrogate (also called synthetic) keys - Mailing list pgsql-general

Previous

Next