Database Normalization - Who's Right

Database normalization - who's right?

You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.

Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.

Normalizing Database Table

You've done it right according to:

First Normal Form

  1. Eliminate repeating groups in individual table.
  2. Create a separate table for each set of related data.
  3. Identify each set of related data with a primary key.

Only thing to point out is your "month" entity which i would change to a date instead, as it limits the employee to be employeed for only one year (as pointed out by another comment)

Is database normalization still necessary?

It depends on what type of application(s) are using the database.

For OLTP apps (principally data entry, with many INSERTs, UPDATEs and DELETES, along with SELECTs), normalized is generally a good thing.

For OLAP and reporting apps, normalization is not helpful. SELECT queries will run much more quickly against a denormalized schema, which could be achieved with views.

You might also find some helpful information in these very popular similar questions:

Should I normalize my DB or not?

In terms of databases, is “Normalize for correctness, denormalize for performance” a right mantra?

What is the resource impact from normalizing a database?

How to convince someone to normalize a database?

Is it really better to use normalized tables?

Normalization: What does repeating groups mean?

The term "repeating group" originally meant the concept in CODASYL and COBOL based languages where a single field could contain an array of repeating values. When E.F.Codd described his First Normal Form that was what he meant by a repeating group. The concept does not exist in any modern relational or SQL-based DBMS.

The term "repeating group" has also come to be used informally and imprecisely by database designers to mean a repeating set of columns, meaning a collection of columns containing similar kinds of values in a table. This is different to its original meaning in relation to 1NF. For instance in the case of a table called Families with columns named Parent1, Parent2, Child1, Child2, Child3, ... etc the collection of Child N columns is sometimes referred to as a repeating group and assumed to be in violation of 1NF even though it is not a repeating group in the sense that Codd intended.

This latter sense of a so-called repeating group is not technically a violation of 1NF if each attribute is only single-valued. The attributes themselves do not contain repeating values and therefore there is no violation of 1NF for that reason. Such a design is often considered an anti-pattern however because it constrains the table to a predetermined fixed number of values (maximum N children in a family) and because it forces queries and other business logic to be repeated for each of the columns. In other words it violates the "DRY" principle of design. Because it is generally considered poor design it suits database designers and sometimes even teachers to refer to repeated columns of this kind as a "repeating group" and a violation of the spirit of the First Normal Form.

This informal usage of terminology is slightly unfortunate because it can be a little arbitrary and confusing (when does a set of columns actually constitute a repetition?) and also because it is a distraction from a more fundamental issue, namely the Null problem. All of the Normal Forms are concerned with relations that don't permit the possibility of nulls. If a table permits a null in any column then it doesn't meet the requirements of a relation schema satisfying 1NF. In the case of our Families table, if the Child columns permit nulls (to represent families who have fewer than N children) then the Families table doesn't satisfy 1NF. The possibility of nulls is often forgotten or ignored in normalization exercises but the avoidance of unnecessary nullable columns is one very good reason for avoiding repeating sets of columns, whether or not you call them "repeating groups".

See also this article.

When is a good time to break normalization rules?

The rule is normalize til it hurts, then denormalize til it works. (who said that?)

In general, I often denormalize when I have a lot of parent child relationships and I know I would often have to join to five or six large tables to get one piece of data (say the client id for instance) and will not need any of the information from the intermediate tables much of the time. If at all possible, I try to denormalize things that will not change frequently (such as id fields). But anytime you denormalize, you have to write triggers or some other process (but normally triggers if it isn't something that can be handled through a PK/FK relationship and cascading updates) to make sure the data stays in synch. If you fail to do this at the database level, then you will have data integrity problems and your data becomes useless. Do not think you can maintain the denormalization through the application code. This is a recipe for disaster, as database are updated often from places other than the application.

Denormalizing correctly can slow inserts, updates and deletes, especially if you need to do large batches of data. It may or may not improve select query speed depending on how you need to query the data. If you end up needing to do a lot of self-joins to get the data, it is possible you would have been better off not denormalizing. Never denormalize without testing to see if you have improved performance. Remember slowing inserts/updates/deletes will have an overall effect on the system when many users are using it. By denormalizing to fix one problem, you may be introducing a worse problem in the overall system. Don't just test the one query you are trying to speed up, test the performance of the whole system. You might speed up a query that runs once a month and slow down other qreries that run thousands of times a day.

Denormalizing is often done for data warehouses which are a special case as they are generally updated automatically on a schedule rather than one record at a time by a user. DBAs who specialize in data warehousing also tend to build them and they know how to avoid the data integrity issues.

Another common denormalizing technique is to create a staging table for data related to a complex report that doesn't need to be run with real time data. This is a sort of poor man's data warehouse and should never be done without a way to update the staging table on a schedule (As infrequently as you can get away with, this uses server resources that could be better spend elsewhere most of the time.) Often these types of table are updated when there are few users on the system and lag a full day behind the real time data. Don't consider doing this unless the query you are staging the data for is truly slow and cannot otherwise be optimized. Many slow queries can be optimized without denomalization as developers often use the easiest to understand rather than the most performant ways to select data.

What is the resource impact from normalizing a database?

This can not really be answered in a general manner, as the impact will vary heavily depending on the specifics of the database in question and the apps using it.

So you basically stated the general expectations concerning the impact:

  1. Overall memory demands for storage should go down, as redundant data gets removed
  2. CPU needs might go up, as queries might get more expensive (Note that in many cases queries on a normalized database will actually be faster, even if they are more complex, as there are more optimization options for the query engine)
  3. Development resource needs might go up, as developers might need to construct more elaborate queries (But on the other hand, you need less development effort to maintain data integrity)

So the only real answer is the usual: it depends ;)

Note: This assumes that we are talking about cautious and intentional denormalization. If you are referring to the 'just throw some tables together as data comes along' approach way to common with inexperienced developers, I'd risk the statement that normalization will reduce resource needs on all levels ;)


Edit: Concerning the specific context added by cdeszaq, I'd say 'Good luck getting your point through' ;)

Oviously, with over 300 Tables and no constraints (!), the answer to your question is definitely 'normalizing will reduce resource needs on all levels' (and probably very substantially), but:

Refactoring such a mess will be a major undertaking. If there is only one app using this database, it is already dreadful - if there are many, it might become a nightmare!

So even if normalizing would substantially reduce resource needs in the long run, it might not be worth the trouble, depending on circumstances. The main questions here are about long term scope - how important is this database, how long will it be used, will there be more apps using it in the future, is the current maintenance effort constant or increasing, etc. ...

Don't ignore that it is a running system - even if it's ugly and horrible, according to your description it is not (yet) broken ;-)



Related Topics



Leave a reply



Submit