Thread: FW: Surrogate keys (Was: enums)
Ooops, fat-finger'd -hackers... -----Original Message----- Adding -hackers back to the list. > From: Leandro Guimarães Faria Corcete Dutra > Em Seg, 2006-01-16 às 12:52 -0600, Jim C. Nasby escreveu: > > On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote: > > > > > > For UPDATEs and INSERTs, the "proper" primary key also > needs to be > > > checked, but keys are used for more than just checking > uniqueness: > > > they're also often used in JOINs. Joining against a > single integer > > > I'd think it quite a different proposition (I'd think > faster in terms > > > of performance) than joining against, say, a text column or a > > > composite key. > > How different is that? Comparing two ints is much, much faster than comparing two text fields. For a small number of comparisons, it doesn't matter.When you're joining tables together, it's a different story. > > a) the optimizer does a really poor job on multi-column > index statistics > > Then it should eventually be fixed? It's on the to-do, but it's not an easy nut to crack. > > b) If each parent record will have many children, the space > savings from > > using a surrogate key can be quite large > > Only where the surrogate is significantly smaller than > the natural? #define significant Here's a real-life example: the primary table for stats.distributed.net has about 120M rows. One field in that table (participant_id)links back to the participant table; it's an int. If instead we used participant_name and that averaged 8characters in length, that would grow the main table by 1GB (8 chars takes 8 bytes instead of 4, plus there's the varlenaheader of 4 bytes). The machine that stats runs on has 4G of memory, so cutting 1G of wasted space out of that tablehelps quite a bit. (In actuality, there isn't participant_name... participants are identified by email address (not a great idea, but I wasn'taround when that was chosen). As you can imagine, email addresses are substantially longer than 4 bytes. When we normalizedemail out of that main table things got substantially faster. That was a number of years ago, so the table wasprobably 15-25% of it's current size, but it still made a huge difference.) > > c) depending on how you view things, putting actual keys > all over the > > place is denormalized > > How come? See my other reply... :) > > Generally, I just use surrogate keys for everything unless > performance > > dictates something else. > > Shouldn't it be the other way round, for the user's sake? Why should it? It's trivial to create views that abstract surrogate keys out, and if you really want to you can even makethe views updatable. But here's two other things to consider: In many cases you can't define a single field as a unique key. So you end up with having to add many extra keys to all yourjoin clauses. Not very friendly, and prone to error. Not every language has equal support for text comparisons (and in my experience, almost all real keys are mostly text). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
>> Comparing two ints is much, much faster than comparing two text >> fields. For a small number of comparisons, it doesn't matter. When >> you're joining tables together, it's a different story. > > That is where data independence would come handy... like a better enum, >with possreps and hidden implementation. > > Forgive me my ignorance, but are ints inherently faster to compare than >strings, or is it just an implementation detail? Ideally, if this is so >a fully data-independent system would create a hash behind the back of >user in order to get performance. The CPU can do an integer comparison with one instruction; it can't do that with a text string. (Well, theoretically if the string was 3/4 bytes exactly (dependin on null termination) you could compare just as fast, but I'm pretty certain that no compiler is that fancy.) >> Here's a real-life example: the primary table for >> stats.distributed.net has about 120M rows. One field in that table >> (participant_id) links back to the participant table; it's an int. If >> instead we used participant_name and that averaged 8 characters in >> length, that would grow the main table by 1GB (8 chars takes 8 bytes >> instead of 4, plus there's the varlena header of 4 bytes). The machine >> that stats runs on has 4G of memory, so cutting 1G of wasted space out >> of that table helps quite a bit. > > OK, hardly a typical example. As I think I left clear, my problem is >not using surrogate keys, but using them by default, or even >exclusively. No? It's certainly not uncommon to have tables with 100M+ rows. And keep in mind that this applies to every row of every table that has foreign keys. I'd bet it's actually common to save 1G or more with surrogate keys in moderately sized databases. Of course, you do have to be intelligent here, too. The only key defined on the table in my example is participant_id, project_id, date; there is no surrogate key because there's no real reason to have one. >> (In actuality, there isn't participant_name... participants are >> identified by email address (not a great idea, but I wasn't around >> when that was chosen). As you can imagine, email addresses are >> substantially longer than 4 bytes. When we normalized email out of >> that main table things got substantially faster. That was a number of >> years ago, so the table was probably 15-25% of it's current size, but >> it still made a huge difference.) > > This isn't normalisation at all, as far as I understand it. It is just I don't have the rules of normalization memorized enough to know what form this breaks, but I'm 99% certain it breaks at least one of them. Look at it this way: if someone wants to change their email address, best case scenario is that you have cascading RI setup and it updates thousands of rows in that table. Worst case scenario, you just de-linked a whole bunch of data. But with a surrogate key, all you have to do is update one row in one table and you're done. >that we don't have data independence... so you had to expose an >implementation detail? Expose to what? The application? First, this is a pretty minor thing to expose; second, if it's that big a concern you can completely hide it by using a view. But the reality is, dealing with a numeric ID can be a heck of a lot easier than an email address. Look at URLs that embbed one versus the other for a good example. >> Why should it? It's trivial to create views that abstract surrogate >> keys out, and if you really want to you can even make the views >> updatable. But here's two other things to consider: > > These views, in heavy querying environments, can be prohibitive. "Normalize 'til it hurts; denormalize 'til it works." Yes, the added overhead of rules for updates/inserts/deletes could start to add up in performance-critical code. But if performance is that critical you're far more likely to run into other bottlenecks first. And worst-case, you abstract behind a stored procedure that just has the right queries hard-coded. As for select-only views you'll have a hard time showing any meaningful performance penalty. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu: > > > > Forgive me my ignorance, but are ints inherently faster to compare than > >strings, or is it just an implementation detail? Ideally, if this is so > >a fully data-independent system would create a hash behind the back of > >user in order to get performance. > > The CPU can do an integer comparison with one instruction; it can't do > that with a text string. OK. Again, data independence should be the goal here. > > OK, hardly a typical example. As I think I left clear, my problem is > >not using surrogate keys, but using them by default, or even > >exclusively. > > No? It's certainly not uncommon to have tables with 100M+ rows. No, but neither are they *that* common. Certainly, lots of database have a few of them. But then, they have dozens, hundreds, thousands of much smaller tables. > And keep > in mind that this applies to every row of every table that has foreign > keys. I'd bet it's actually common to save 1G or more with surrogate > keys in moderately sized databases. Only if you have quite some children, because otherwise, in the main tables, the surrogate keys add a field, an index and a sequence to an otherwise smaller table and index. > Of course, you do have to be intelligent here, too. The only key defined > on the table in my example is participant_id, project_id, date; there is > no surrogate key because there's no real reason to have one. Quite. > >> (In actuality, there isn't participant_name... participants are > >> identified by email address (not a great idea, but I wasn't around > >> when that was chosen). As you can imagine, email addresses are > >> substantially longer than 4 bytes. When we normalized email out of > >> that main table things got substantially faster. That was a number of > >> years ago, so the table was probably 15-25% of it's current size, but > >> it still made a huge difference.) > > > > This isn't normalisation at all, as far as I understand it. It is just > > I don't have the rules of normalization memorized enough to know what > form this breaks, but I'm 99% certain it breaks at least one of them. No, never. Normalisation is about eliminating redundancy and, therefore, update anomalies. Making all the table dependent on only the keys and the whole keys, by projecting relations to eliminate entity mixups. What you mention is actually exposing an implementation detail, namely an integer that serves as a hash of the key. > Look at it this way: if someone wants to change their email address, > best case scenario is that you have cascading RI setup and it updates > thousands of rows in that table. Worst case scenario, you just de-linked > a whole bunch of data. But with a surrogate key, all you have to do is > update one row in one table and you're done. OK, if you have lots of linked data. But most tables are really dead ends. > >that we don't have data independence... so you had to expose an > >implementation detail? > > Expose to what? The application? First, this is a pretty minor thing to > expose; second, if it's that big a concern you can completely hide it by > using a view. As someone said, you end up with ids everywhere, and no user-understandable data at all... > But the reality is, dealing with a numeric ID can be a heck of a lot > easier than an email address. Look at URLs that embbed one versus the > other for a good example. Again, implementation details... levels mixup. > >> Why should it? It's trivial to create views that abstract surrogate > >> keys out, and if you really want to you can even make the views > >> updatable. But here's two other things to consider: > > > > These views, in heavy querying environments, can be prohibitive. > > "Normalize 'til it hurts; denormalize 'til it works." Lack of data implementation biting us again. > Yes, the added overhead of rules for updates/inserts/deletes could start > to add up in performance-critical code. But if performance is that > critical you're far more likely to run into other bottlenecks first. And > worst-case, you abstract behind a stored procedure that just has the > right queries hard-coded. > > As for select-only views you'll have a hard time showing any meaningful > performance penalty. Yet real user-defined data types could make it all much simpler. -- +55 (11) 5685 2219 xmpp:leandrod@jabber.org +55 (11) 9406 7191 Yahoo!: lgcdutra +55 (11) 5686 9607 MSN: leandro@dutra.fastmail.fm +55 (11) 4390 5383 ICQ/AIM: 61287803
On Jan 19, 2006, at 9:31 , Leandro Guimarães Faria Corcete Dutra wrote: > OK. Again, data independence should be the goal here. <snip /> > Again, implementation details... levels mixup. <snip /> > Lack of data implementation biting us again. <snip /> > Yet real user-defined data types could make it all much simpler. Again, again, and again, patches welcome! PostgreSQL is an open- source project, and people contribute in a variety of ways, two of which include submitting code and sponsoring others to develop code. If you look at the todo list, there are *lots* of things people would like to see improved in PostgreSQL, but the pace at which PostgreSQL is improved and what is improved is driven in large part by what people are willing to do themselves or sponsor. If these are things you're interested in (and it certainly appears you are), why not contribute? Michael Glaesemann grzm myrealbox com
Maybe it goes better into Advocacy or something, but I have found a quote by database big-wigs that I strongly disagree with: From: http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf We have this. "PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be assigned by the DBMS only if a user-defined primary keyis not available. Second generation systems support the notion of a primary key, which is a user-assigned unique identifier. If a primarykey exists for a collection that is known never to change, for example social security number, student registrationnumber, or employee number, then no additional system-assigned UID is required. An immutable primary key hasan extra advantage over a system-assigned unique identifier because it has a natural, human readable meaning. Consequently,in data interchange or debugging this may be an advantage. If no primary key is available for a collection,then it is imperative that a system-assigned UID be provided. Because SQL supports update through a cursor, secondgeneration systems must be able to update the last record retrieved, and this is only possible if it can be uniquelyidentified. If no primary key serves this purpose, the system must include an extra UID. Therefore, several secondgeneration systems already obey this proposition. Moreover, as will be noted in Proposition 2.3, some collections,e.g. views, do not necessarily have system assigned UIDs, so building a system that requires them is likely tobe proven undesirable. We close our discussion on Tenet 1 with a final proposition that deals with the notion of rules." This is a bad idea. Let's take the example of a Social Security Number. Not everyone has one: http://www.ssa.gov/pubs/10002.html#how2 If people do have one, they can definitely change it. If someone has stolen a SSN, then the wronged party is able to gettheir SSN changed: http://101-identitytheft.com/ssn.htm The odds of this happening are low, but if you cannot handle it, then the damage caused is considerable. Now what happens if you want to have customers outside of the USA? {Don't worry, we'll never go global...} I hope that myobjections are very plain and obvious. The primary key should be immutable, meaning that its value should not be changed during the course of normal operationsof the database. What natural key is immutable? The answer is that such an attribute does not exist. To usethem for such a purpose is begging for trouble. I saw the argument that there is a great volume of space wasted by adding a column that does not naturally occur in the data. That argument is simply absurd. Consider a database with 10 billion rows of data in it. Each of those tables getsan 8 byte primary key added for every row, resulting in 80 GB consumed. The cost of 80 GB is perhaps $200. With a databasethat large (where the extra space consumed by an artificial key column has a cost that can easily be measured) theodds of a problem arising due to a natural column changing its value are huge. The cost of such a tragedy is certainlymore than the $200 pittance! If there is an argument that we also have the parent key values propagated into the child tables as foreign keys, that argumenthas no merit. The other attribute that would have been chosen would also be propagated. And so (for instance) thereis no savings to propagating a SSN field into child tables verses propagating an 8 byte integer. I also saw an argument that the propagated ID values are confusing to end-users. That is the fault of the database designerwho game them a stupid name. If they were things like InvoiceID and LineItemID then there will not be the same sortof confusion. The meaning and purpose of the column is immediately apparent. As an alternative, the ubiquitous OIDname for a column on a table is also very transparent. Of course, when it is used in a foreign key, it must be givena role name to avoid confusion in that case. At any rate, the use of natural keys is a mistake made by people who have never had to deal with very large database systems. IMO-YMMV. > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Leandro Guimarães Faria Corcete Dutra > Sent: Wednesday, January 18, 2006 4:31 PM > To: Jim C. Nasby > Cc: pgsql-hackers@postgresql.org > Subject: Re: FW: [HACKERS] Surrogate keys (Was: enums) > > Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu: > > > > > > Forgive me my ignorance, but are ints inherently faster to compare > than > > >strings, or is it just an implementation detail? Ideally, if this is > so > > >a fully data-independent system would create a hash behind the back of > > >user in order to get performance. > > > > The CPU can do an integer comparison with one instruction; it can't do > > that with a text string. > > OK. Again, data independence should be the goal here. > > > > > OK, hardly a typical example. As I think I left clear, my problem > is > > >not using surrogate keys, but using them by default, or even > > >exclusively. > > > > No? It's certainly not uncommon to have tables with 100M+ rows. > > No, but neither are they *that* common. > > Certainly, lots of database have a few of them. But then, they have > dozens, hundreds, thousands of much smaller tables. > > > > And keep > > in mind that this applies to every row of every table that has foreign > > keys. I'd bet it's actually common to save 1G or more with surrogate > > keys in moderately sized databases. > > Only if you have quite some children, because otherwise, in the main > tables, the surrogate keys add a field, an index and a sequence to an > otherwise smaller table and index. > > > > Of course, you do have to be intelligent here, too. The only key defined > > on the table in my example is participant_id, project_id, date; there is > > no surrogate key because there's no real reason to have one. > > Quite. > > > > >> (In actuality, there isn't participant_name... participants are > > >> identified by email address (not a great idea, but I wasn't around > > >> when that was chosen). As you can imagine, email addresses are > > >> substantially longer than 4 bytes. When we normalized email out of > > >> that main table things got substantially faster. That was a number of > > >> years ago, so the table was probably 15-25% of it's current size, but > > >> it still made a huge difference.) > > > > > > This isn't normalisation at all, as far as I understand it. It is > just > > > > I don't have the rules of normalization memorized enough to know what > > form this breaks, but I'm 99% certain it breaks at least one of them. > > No, never. > > Normalisation is about eliminating redundancy and, therefore, update > anomalies. Making all the table dependent on only the keys and the > whole keys, by projecting relations to eliminate entity mixups. > > What you mention is actually exposing an implementation detail, > namely > an integer that serves as a hash of the key. > > > > Look at it this way: if someone wants to change their email address, > > best case scenario is that you have cascading RI setup and it updates > > thousands of rows in that table. Worst case scenario, you just de-linked > > a whole bunch of data. But with a surrogate key, all you have to do is > > update one row in one table and you're done. > > OK, if you have lots of linked data. But most tables are really > dead > ends. > > > > >that we don't have data independence... so you had to expose an > > >implementation detail? > > > > Expose to what? The application? First, this is a pretty minor thing to > > expose; second, if it's that big a concern you can completely hide it by > > using a view. > > As someone said, you end up with ids everywhere, and no > user-understandable data at all... > > > > But the reality is, dealing with a numeric ID can be a heck of a lot > > easier than an email address. Look at URLs that embbed one versus the > > other for a good example. > > Again, implementation details... levels mixup. > > > > >> Why should it? It's trivial to create views that abstract surrogate > > >> keys out, and if you really want to you can even make the views > > >> updatable. But here's two other things to consider: > > > > > > These views, in heavy querying environments, can be prohibitive. > > > > "Normalize 'til it hurts; denormalize 'til it works." > > Lack of data implementation biting us again. > > > > Yes, the added overhead of rules for updates/inserts/deletes could start > > to add up in performance-critical code. But if performance is that > > critical you're far more likely to run into other bottlenecks first. And > > worst-case, you abstract behind a stored procedure that just has the > > right queries hard-coded. > > > > As for select-only views you'll have a hard time showing any meaningful > > performance penalty. > > Yet real user-defined data types could make it all much simpler. > > -- > +55 (11) 5685 2219 xmpp:leandrod@jabber.org > +55 (11) 9406 7191 Yahoo!: lgcdutra > +55 (11) 5686 9607 MSN: leandro@dutra.fastmail.fm > +55 (11) 4390 5383 ICQ/AIM: 61287803 > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Jan 19, 2006, at 10:34 , Dann Corbit wrote: > http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf > "PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be > assigned by the DBMS only if a user-defined primary key is not > available. <snip /> > An immutable primary key has an extra advantage over a system- > assigned unique identifier because it has a natural, human readable > meaning. Consequently, in data interchange or debugging this may be > an advantage. If no primary key is available for a collection, > then it is imperative that a system-assigned UID be provided. <snip /> Dann Corbit: > The primary key should be immutable, meaning that its value should > not be changed during the course of normal operations of the > database. What natural key is immutable? The answer is that such > an attribute does not exist. To use them for such a purpose is > begging for trouble. As far as I can tell, the only difference between your position, Dann, and Date and Darwen's, is that you think no natural key is immutable. If you *could* find an immutable natural key, would it be an acceptable key for you? Date and Darwen say explicitly that if no immutable (natural) (primary) key is available a system-assigned UID is required. If you think there is no immutable natural key available, Darwen and Date would agree that you should use a system- generated key. Or do you think I'm misreading you or The Third Manifesto? Michael Glaesemann grzm myrealbox com
Dann, > The primary key should be immutable, meaning that its value should not be > changed during the course of normal operations of the database. Why? I don't find this statement to be self-evident. Why would we have ON UPDATE CASCADE if keys didn't change sometimes? > At any rate, the use of natural keys is a mistake made by people who have > never had to deal with very large database systems. Oh, I guess I'm dumb then. The biggest database system I ever had to deal with was merely 5 TB ... Anyway, my opinion on this, in detail, will be on the ITToolBox blog. You can argue with me there. -- Josh Berkus Aglio Database Solutions San Francisco
Michael Glaesemann <grzm@myrealbox.com> writes: > As far as I can tell, the only difference between your position, > Dann, and Date and Darwen's, is that you think no natural key is > immutable. D&D's examples of "natural" keys are worth a second look though: >> If a primary key exists for a collection that is known never to change, >> for example social security number, student registration number, or >> employee number, then no additional system-assigned UID is required. The problem with SSN is that somebody other than you controls it. If you are the college registrar, then you control the student's registration number, and you don't have to change it. In fact, guess what: you probably generated it in the same way as a surrogate key. I'd argue that all of these are in reality the exact same thing as a surrogate key --- from the point of view of the issuing authority. But from anyone else's point of view, they are external data and you can't hang your own database design on the assumption that they won't change. regards, tom lane
> >> If a primary key exists for a collection that is known never to change, > >> for example social security number, student registration number, or > >> employee number, then no additional system-assigned UID is required. In point of fact Social security numbers *can* change. -- greg
Em Qui, 2006-01-19 às 09:54 +0900, Michael Glaesemann escreveu: > On Jan 19, 2006, at 9:31 , Leandro Guimarães Faria Corcete Dutra wrote: > > If these are things > you're interested in (and it certainly appears you are), why not > contribute? 'Cause unfortunately am not a hacker... -- +55 (11) 5685 2219 xmpp:leandrod@jabber.org +55 (11) 9406 7191 Yahoo!: lgcdutra +55 (11) 5686 9607 MSN: leandro@dutra.fastmail.fm +55 (11) 4390 5383 ICQ/AIM: 61287803
On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote: > Michael Glaesemann <grzm@myrealbox.com> writes: > > As far as I can tell, the only difference between your position, > > Dann, and Date and Darwen's, is that you think no natural key is > > immutable. > > D&D's examples of "natural" keys are worth a second look though: > > >> If a primary key exists for a collection that is known never to > >> change, for example social security number, student registration > >> number, or employee number, then no additional system-assigned > >> UID is required. > > The problem with SSN is that somebody other than you controls it. No, that's not the big problem. The big problem is that it's very likely illegal for you to use it for anything unless you happen to be the Social Security Administration. > If you are the college registrar, then you control the student's > registration number, and you don't have to change it. In fact, > guess what: you probably generated it in the same way as a surrogate > key. True. > I'd argue that all of these are in reality the exact same thing as a > surrogate key --- from the point of view of the issuing authority. > But from anyone else's point of view, they are external data and you > can't hang your own database design on the assumption that they > won't change. Right :) Cheers, D -- David Fetter david@fetter.org http://fetter.org/ phone: +1 415 235 3778 Remember to vote!
On Thu, Jan 19, 2006 at 12:50:52AM -0800, David Fetter wrote: > On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote: > No, that's not the big problem. The big problem is that it's very > likely illegal for you to use it for anything unless you happen to be > the Social Security Administration. Actually no. From reading here[1] it appears anyone is allowed to ask you your SSN and they can do what they like with it. What you're describing is more like the TFN in Australia. Apart from the fact you're not required to have one or provide it if asked, if you're not a bank, or share registrary or some other such institution, you're not allowed to ask for it, let alone store it. Medicare number the same, if you're not a health service provider, you can't ask for it. Anyway, this doesn't mean an SSN is a good key, for all sorts of other reasons people have already stated. [1] http://www.cpsr.org/prevsite/cpsr/privacy/ssn/ssn.faq.html#IsItIllegalToAsk Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Thu, Jan 19, 2006 at 00:06:41 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The problem with SSN is that somebody other than you controls it. > If you are the college registrar, then you control the student's > registration number, and you don't have to change it. In fact, guess > what: you probably generated it in the same way as a surrogate key. I work for a University and even the numbers assigned by us get changed on a regular basis as it is very easy for people to get entered into the system multiple times. (And for a while campus ids were SSNs by default and we are still in the process of making them different for everyone.) There are several effectively surrogate keys (campus id and emplid), but they don't map 1 to 1 to real people. I believe we keep a history of campus ids, and delete emplids for duplicates.