Thread: DB design advice: lots of small tables?
Hello. We are currently redesigning a medium/large office management web application. There are 75 tables in our existing PostgreSQL database, but that number is artificially low, due to some unfortunate design choices. The main culprits are two tables named "catalog" and "catalog_entries". They contain all those data sets that the previous designer deemed too small for a separate table, so now they are all stored together. The values in catalog_entries are typically used to populate dropdown select fields. The catalog table (simplified): id SERIAL -- artificial record ID code VARCHAR -- unique text identifier description VARCHAR -- (info only, otherwise unused) The catalog_entries table (simplified): id SERIAL -- artificial record ID catalog_id INTEGER -- parent catalog ID code VARCHAR -- unique (per catalog) text identifier rank INTEGER -- used for sorting the values text_short VARCHAR -- text for display (short version) text_long TEXT -- text for display (long version) Here are some examples of what the tables contain: Catalog: department Entries: it, sales, accounting, cases, ... Catalog: printers Entries: ma_color, pa_color, pa_black, pdf, ... Catalog: invoice_status Entries: open, locked, entered, booked, cancelled, ... Catalog: coverage Entries: national, regional, international, obsolete Records in other tables reference the values in catalog_entries by id. For example, the "invoices" table has a "status_id" column pointing to a record in catalog_entries. Of course, this leads to possible integrity issues (there is nothing to prevent an invoice record referencing the "ma_color" value instead of "open" in its status_id field). There are 64 "catalogs" (data sets), in addition to the 75 real tables. Now we have finally got the go-ahead to refactor this mess. Most of the old "catalogs" will probably end up as separate tables. Others might be replaced with ENUMs or booleans, especially the ones with only 2-3 values. The reason why I'm hesitating and asking for advice now, is that after refactoring, we'll end up with ~60 new tables, all of them rather small and with practically identical columns. (Only five catalogs have more than 20 entries; about half have five entries or less; five catalogs have only two entries) So, my first main question would be: is it "normal" or desirable to have that many tiny tables? And is it a problem that many of the tables have the same (or a similar) column definitions? The second point is that we have redundant unique identifiers in catalog_entries (id and code). The code value is used by the application whenever we need to find to one of the values. For example, for a query like "show all open invoices", we would either - 1) select the id from catalog_entries where catalog_id refers to the "invoice_status" catalog and the code is "open" 2) use that id to filter select * from invoices - or do the same in one query using joins. This pattern occurs hundreds of times in the application code. From a programming viewpoint, having all-text ids would make things a lot simpler and cleaner (i.e., keep only the "code" column). The "id" column was used (AFAIK) to reduce the storage size. Most of the data tables have less than 100k records, so the overhead wouldn't be too dramatic, but a few tables (~10) have more; one of them has 1.2m records. These tables can also refer to the old catalog_entries table from more than one column. Changing all these references from INT to VARCHAR would increase the DB size, and probably make scans less performant. I'm not sure know how indexes on these columns would be affected. To summarize, the second question is whether we should ditch the artificial numeric IDs and just use the "code" column as primary key in the new tiny tables. Thanks in advance for your advice. crl
lender <crlender@gmail.com> wrote: > So, my first main question would be: is it "normal" or desirable > to have that many tiny tables? Yes. > And is it a problem that many of the tables have the same (or a > similar) column definitions? No. > To summarize, the second question is whether we should ditch the > artificial numeric IDs and just use the "code" column as primary > key in the new tiny tables. This one becomes more of a judgment call, but I generally lean toward using the visible unique identifier (your "code") as the primary key and ditching a numeric "id". The only two contra- indications would be if the "code" values have a significant chance of being changed or if you will have extreme numbers (billions) of narrow rows which must reference the table. <soapbox-rant> I occasionally hear someone maintaining that having a meaningless sequential ID column as the primary key of each table is required by the relational model. At those moments I swear I can actually hear E.F. Codd turning in his grave. It was a requirement of old pre-relational databases from the 60's and 70's, and some equally primitive ORMs still like to have one, but a big point of relational databases is that you don't need to navigate artificial linkages between tables -- the relationship can generally be determined by the fact that they contain common data elements. If these are natural, meaningful values which are visible to the user it often allows complex queries to be much better optimized, since they aren't forced through a single navigational linkage. </soapbox-rant> -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner, 15.03.2013 14:36: > <soapbox-rant> > I occasionally hear someone maintaining that having a meaningless > sequential ID column as the primary key of each table is required > by the relational model. At those moments I swear I can actually > hear E.F. Codd turning in his grave. It was a requirement of old > pre-relational databases from the 60's and 70's, and some equally > primitive ORMs still like to have one, but a big point of > relational databases is that you don't need to navigate artificial > linkages between tables -- the relationship can generally be > determined by the fact that they contain common data elements. If > these are natural, meaningful values which are visible to the user > it often allows complex queries to be much better optimized, since > they aren't forced through a single navigational linkage. > </soapbox-rant> You might be interested in a discussion regarding this topic on comp.databases.theory: https://groups.google.com/forum/?fromgroups=#!topic/comp.databases.theory/mqZZw3ojnjA
Le 2013-03-15 à 09:58, Thomas Kellerer a écrit : > Kevin Grittner, 15.03.2013 14:36: >> <soapbox-rant> >> I occasionally hear someone maintaining that having a meaningless >> sequential ID column as the primary key of each table is required >> by the relational model. At those moments I swear I can actually >> hear E.F. Codd turning in his grave. It was a requirement of old >> pre-relational databases from the 60's and 70's, and some equally >> primitive ORMs still like to have one, but a big point of >> relational databases is that you don't need to navigate artificial >> linkages between tables -- the relationship can generally be >> determined by the fact that they contain common data elements. If >> these are natural, meaningful values which are visible to the user >> it often allows complex queries to be much better optimized, since >> they aren't forced through a single navigational linkage. >> </soapbox-rant> > > You might be interested in a discussion regarding this topic on comp.databases.theory: > > https://groups.google.com/forum/?fromgroups=#!topic/comp.databases.theory/mqZZw3ojnjA Along those lines, I love what Kenneth Downs says on his blog, The Database Programmer. Start at http://database-programmer.blogspot.ca/2010/11/database-skills.htmland look for "Understanding Primary Keys, Foreign Keysand Constraints". Ken suggests having a data dictionary and generating the schema from the dictionary. He has a PHP tool, being rewritten butwith very slow progress. Keeping a meaningless ID is not a problem in and of itself. It makes it easier to edit records from the UI, since you canreference the ID in the UPDATE and DELETE statements, without fear of colliding with anything else. It's not so much aproblem on small lookup tables, but on larger entities (people, companies, etc), referencing through the ID is much, mucheasier. Hope that helps! François Beausoleil
Attachment
Thanks everybody for your comments. On 2013-03-15 16:03, François Beausoleil wrote: > Keeping a meaningless ID is not a problem in and of itself. It makes > it easier to edit records from the UI, since you can reference the ID in > the UPDATE and DELETE statements, without fear of colliding with > anything else. It's not so much a problem on small lookup tables, but on > larger entities (people, companies, etc), referencing through the ID is > much, much easier. I'm not so much concerned with giving records in a large data set surrogate IDs. There is often no good candidate for a natural key, or the candidates aren't static enough. The small tables I mentioned earlier all have clear and very static keys (the "code" column). I'm only concerned about the practical consequences of introducing strings where there used to be integers. To give a practical example, this is how it looks with artificial IDs: Table documents: id | name | type_id -------+-------------+---------- 62307 | Example.odt | 413 Table document_types: id | code | text_short -----+----------------------------------+--------------------------- 413 | information_disclosure_statement | Information Disclosure [...] Using the natural key, it would look like this: Table documents: id | name | type ----+--------------+---------------------------------- 23 | Example.odt | information_disclosure_statement Table document_types: code | text_short ----------------------------------+--------------------------- information_disclosure_statement | Information Disclosure [...] (admittedly, "information_disclosure_statement" is one of the longer codes we use. The average is about 14 characters, the longest is 38 characters) Now, what if we have hundreds of thousands of records in the "documents" table? Apart from the increased storage requirements, will scanning the table take noticably longer? Will the indexes suffer? Will creating, updating, importing, dumping, restoring etc take (much) longer? Comparing two integers is computationally less expensive than comparing two variable-length strings, of course, but I have no empirical notion of how much of a performance hit to expect. I know that these questions cannot be answered with any accuracy without knowing all the details, but that's the type of thing that has me a little worried right now. I'm fine with getting a little less performance; that should be mostly offset by the other changes and improvements we're making. I just don't want to introduce a fundamental mistake at this stage. Thanks again, crl
On 03/15/2013 08:36 AM, Kevin Grittner wrote: > I occasionally hear someone maintaining that having a meaningless > sequential ID column as the primary key of each table is required > by the relational model. You know, I've heard you mention this a couple times, and I still don't understand why you maintain such a position. Artificial the linkages may be, but unless I'm missing something, the level of abstraction is often necessary to avoid excessive storage requirements and bulky foreign keys. For me, it comes down to two things: 1. Copy a "natural" arbitrary-length text string possibly millions of times, or use a 32/64-bit integer lookup value. 2. Use multiple natural columns in a primary key, necessitating copying all columns to child tables for foreign keys, or use one surrogate key for both. Yes, it complicates the schema. Possibly needlessly so. But until someone comes up with a database storage method that automatically deduplicates stored data, I can never advocate using arbitrary strings as natural keys, no matter how unique and meaningful they are. Maybe I just say that because I was burned by just such a column in a previous engagement. It was 64-characters of arbitrary text, and was used as a lookup value for dozens of tables. Had it been mapped to a "meaningless" surrogate key, several tables would have been halved (or more) in size. Don't even get me started on indexing that horrible monstrosity in every table it lived in. Small tables? Yeah, whatever. I don't care. But anything that has the potential to be duplicated millions of times? You better bet that's going to be a serial identifier. Now, you've been doing this longer than I have, in a wider array of contexts, so you clearly have some perspective I don't. From where I'm sitting though, I don't get the barely suppressed rage. ;) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 2013-03-15, lender <crlender@gmail.com> wrote: > Hello. > > We are currently redesigning a medium/large office management web > application. There are 75 tables in our existing PostgreSQL database, > but that number is artificially low, due to some unfortunate design choices. > > The main culprits are two tables named "catalog" and "catalog_entries". > They contain all those data sets that the previous designer deemed too > small for a separate table, so now they are all stored together. The > values in catalog_entries are typically used to populate dropdown select > fields. > So, my first main question would be: is it "normal" or desirable to have > that many tiny tables? And is it a problem that many of the tables have > the same (or a similar) column definitions? Dunno about "normal", but certainly "Normal" (as in "-form"). No problem. > The second point is that we have redundant unique identifiers in > catalog_entries (id and code). The code value is used by the application > whenever we need to find to one of the values. For example, for a query > like "show all open invoices", we would either - > > 1) select the id from catalog_entries where catalog_id refers to the > "invoice_status" catalog and the code is "open" > 2) use that id to filter select * from invoices > > - or do the same in one query using joins. This pattern occurs hundreds > of times in the application code. From a programming viewpoint, having > all-text ids would make things a lot simpler and cleaner (i.e., keep > only the "code" column). > > The "id" column was used (AFAIK) to reduce the storage size. Most of the > data tables have less than 100k records, so the overhead wouldn't be too > dramatic, but a few tables (~10) have more; one of them has 1.2m > records. These tables can also refer to the old catalog_entries table > from more than one column. Changing all these references from INT to > VARCHAR would increase the DB size, and probably make scans less > performant. I'm not sure know how indexes on these columns would be > affected. > > To summarize, the second question is whether we should ditch the > artificial numeric IDs and just use the "code" column as primary key in > the new tiny tables. I if they aren't hurting you keep them. > Thanks in advance for your advice. If you're worried about clutter It may make sense to put all the small tables in a separate schema. -- ⚂⚃ 100% natural
Shaun Thomas <sthomas@optionshouse.com> wrote: > On 03/15/2013 08:36 AM, Kevin Grittner wrote: > >> I occasionally hear someone maintaining that having a meaningless >> sequential ID column as the primary key of each table is required >> by the relational model. > > You know, I've heard you mention this a couple times, and I still don't > understand why you maintain such a position. Artificial the linkages may be, but > unless I'm missing something, the level of abstraction is often necessary to > avoid excessive storage requirements and bulky foreign keys. > > For me, it comes down to two things: > > 1. Copy a "natural" arbitrary-length text string possibly millions of > times, or use a 32/64-bit integer lookup value. > 2. Use multiple natural columns in a primary key, necessitating copying all > columns to child tables for foreign keys, or use one surrogate key for both. When I was working for the Wisconsin Courts the Circuit Court databases had some tables which were maintained by a central Court Operations group and some tables holding data generated by each county. The primary key of each county maintained table included a county number (smallint). Most county-maintained tables were related to court cases, and included a court case number as part of the primary key (varchar(14)). Beyond that, other fields were added to each table, and where there was a natural heirarchy to the data the "child" normally had the PK of its parent and something to make it unique within that set. Often this was a unique number that started at 1 for each parent. Some of these tables, when replicated to the central state-wide site, had hundreds of millions of rows. You would probably be surprised how many queries were able to optimize much better than they could have with a single-column ID in each table. The plans often surprised me, and when I would force the planner to take the plan which seemed obvious to me, 90% of the time the planner had me beat. The other times often suggested new indexes or adjustments to cost factors which benefited the whole workload. > Yes, it complicates the schema. That's not the point. > But until someone comes up with a database storage method that automatically deduplicates stored data, I > can never advocate using arbitrary strings as natural keys, no matter how unique > and meaningful they are. If micro-managing disk space usage it the primary concern, single-column synthetic keys are likely to win in many circumstances. I have no doubt that there are cases where it can benefit performance. I can guarantee you that many of the queries we ran at Wisconsin Courts would have been much slower with such synthetic keys because they limit the available plans and force some unnatural "navigation". I know because I saw cases where people forced the type of navigation the planner would need to do if synthetic keys were used, and the result was performance orders of magnitude worse. > Maybe I just say that because I was burned by just such a column in a previous > engagement. It was 64-characters of arbitrary text, and was used as a lookup > value for dozens of tables. Had it been mapped to a "meaningless" > surrogate key, several tables would have been halved (or more) in size. > Don't even get me started on indexing that horrible monstrosity in every > table it lived in. That is getting pretty extreme. > Small tables? Yeah, whatever. I don't care. But anything that has the > potential to be duplicated millions of times? You better bet that's going to > be a serial identifier. Like I said, Wis Cts. has a table that has hundreds of millions of rows and the primary key is a smallint, a varchar(14), and another smallint. You can poke around in it at this site by drilling down on the "Court Record Events" button for a case: http://wcca.wicourts.gov/ > Now, you've been doing this longer than I have, in a wider array of > contexts, so you clearly have some perspective I don't. From where I'm > sitting though, I don't get the barely suppressed rage. ;) Oh, every time a programmer who had been working in Access or MySQL was hired, I had to have this discussion all over again. You're ripping the scabs off the old wounds from those battles. :-) I usually had to pull out plans from complex (and fast!) queries to make my point. The other thing is that it brings back memories from the early '80s of working in a database product called TOTAL which required explicit navigation over links. Relational technology was such a breath of fresh air compared to that, I dread a pendulum swing back to that. That's been my experience, anyway. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company