Wednesday, April 8, 2009

Are you normal?

In 1972 Dr. E. F. Codd published a paper titled, "Further Normalization of the Data Base Relational Model", which has become a best seller for insomniacs worldwide. Dr. Codd was brilliant to be sure, but neither of my kids ever requested a second reading of his material at bedtime. This begs the question, "did you really read them technical manuals at night?" The answer is, when your child has been up for 18 straight hours... and you're exhausted... and you just want them to sleep... you find yourself "thinking outside the box."

Strangely, neither of my offspring ever considered a technology career path. I blame Goodnight Moon, Dr. Seuss, and Shel Silverstein.

Dr. Codd proposed a series of rules to normalize the design of databases. I can state with some authority that his skills with arithmetic far and away exceed his talent as a story teller. I would go so far as to suggest that Further Normalization of the Data Base Relational Model is the mathematical equivalent of Muzak.

The first three of his rules can be collapsed into this; if a piece of data exists in two places then it is wrong in one of them. Of course, we can find countless examples where this is not true which in turn lulls us into a false sense of trust. So, if the same piece of data exists in two different places, then it is possible for it to be wrong in one of them and it is impossible to predict where or when. It is in understanding this over-simplified construct that issues relating to database integrity and scalability can be managed.

Codd's formulas are often expressed as the Rules for Normalization; rules which we are far too willing to violate because "it just makes sense in this case", "normalization is too theoretical", and the ever popular "THERE'S NO TIME." As a recap, here are the first three Normalization Rules:

First Normal Form (1NF): No Repeating Columns or Groups of Columns. (Wrong: AREA_CODE_1, PHONE_1, AREA_CODE_2, PHONE_2, ...).

Second Normal Form (2NF): No Partial Dependencies on a combination key. This only applies to tables with a primary key which is comprised of multiple columns (combination key). This means that we have to make sure that none of the non-key columns of the table are dependent on some, but not all, of the key columns. Tables that have a one column primary key (assuming they pass First Normal Form) are automatically compliant with the Second Normal Form.

Third Normal Form (3NF): No Dependencies on Non-Key columns. Sometimes we'll have a pair of columns in a table that really should be a foreign key pointing to a row of another table. A classic example is Zip code, City, and State. Given the zip code, the city and state could be looked up. The downside to carrying all three (BTW - everybody does this) is that if you change any one of the three data items you create a referential integrity issue for the other two data items.

Now wasn't that refreshing. No, not the rules, the little nap you took right after "First Nor....zzzzz" Did it again, huh? Yeah they're a real sleeper set. But if you can just get through them one time, you'll find that strict adherence to these rules will save you a lot of time, money, and aggravation when maintaining transactional database tables. For read-only reporting tables you can have at it - make them as flat and wide as you want. Who cares, no one believes that data anyways.

Normalization is one of those silly little best practice thingies we are delighted to skip because we are too smart to be trapped by the fallout later on. So, here is some bad news. You're not that bright. Don't get me wrong, I'm not that bright either; nobody is. No one can remember all of the little rules we break en-route to a new application or function. So, following the rules of normalization, even when you think the database is too small, or too big, is an important declaration that you accept the human condition. Referential integrity and scalability (normalization) are functions of physics - they are based on mathematics, not some arbitrary set of majority-wins guidelines established by ivory tower, monolithic governance bodies bent on imposing intellectual will over the unwashed.

A database design should not be based on the needs of an application, rather it should reflect the realities of reality. The data design should mimic, to the highest degree possible, the true relationships that exist in the domain being modeled. All applications, and therefore the databases, start small and grow. Begin with the right (extensible) design and your life, and that of your successor, will be easier. Don't be lulled into the false trap of "this is a little app so I can break the rules, no one will ever notice."

Follow by Email