Normalization in Plain English

Normalization in plain English

Well, if I had to explain it to my wife it would have been something like that:

The main idea is to avoid duplication of large data.

Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.

Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:

Person   CountryVisited   AnotherInformation   D.O.B.
Faruz    USA              Blah Blah            1/1/2000
Faruz    Canada           Blah Blah            1/1/2000

I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.

What is Normalisation (or Normalization)?

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.

First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.

The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.

For example consider a database of moons and planets:

Moon(PK) | Planet  | Planet kind
------------------------------
Phobos   | Mars    | Rock
Daimos   | Mars    | Rock
Io       | Jupiter | Gas
Europa   | Jupiter | Gas
Ganymede | Jupiter | Gas

The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:

Moon(PK) | Planet  | Planet kind
------------------------------
Phobos   | Mars    | Rock
Deimos   | Mars    | Rock
Io       | Jupiter | Gas
Europa   | Jupiter | Rock <-- Oh no!
Ganymede | Jupiter | Gas

A query can now give inconsistent results which can have disastrous consequences.

(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)

The normalized design would split the table into two tables:

Moon(PK) | Planet(FK)     Planet(PK) | Planet kind
---------------------     ------------------------
Phobos   | Mars           Mars       | Rock
Deimos   | Mars           Jupiter    | Gas
Io       | Jupiter 
Europa   | Jupiter 
Ganymede | Jupiter

Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)

Rule of thumb
If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.

We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.

How do I avoid such problems?
Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.

Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.

(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

How do you determine how far to normalize a database?

You want to start designing a normalized database up to 3rd normal form. As you develop the business logic layer you may decide you have to denormalize a bit but never, never go below the 3rd form. Always, keep 1st and 2nd form compliant. You want to denormalize for simplicity of code, not for performance. Use indexes and stored procedures for that :)

The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.

There are a couple of good articles:

http://www.agiledata.org/essays/dataNormalization.html

How to answer this interview question?

Way too broad of a question for an interview - it could fill a small book. I would simply remember a few key points about the first 3 normal forms (4 and 5 for extra credit). Here's a somewhat decent summary of them.

If I were interviewing you, and asked the question, I would want to hear above anything else that most db designers strive for at least 3NF but should be able to deviate from that for X reasons. Knowing when to stray from normalization and why is way more important and telling than knowing the definitions.

Which normalization form is commonly used?

It depends on the reqquirment whcih normal form to use, Most of the people make use of 3NF.

But one thing is there as the normalization level increase joins will increase to get data from database server.

How to apply normalization on mysql using php

Ok, couple of things:

php has got nothing to do with this. normalization is about modelling data
normalization is not about saving disk space. It is about organizing data so that it is easily maintainable, which in turn is a way to maintain data-integrity.
normalization is typically described in a few stages or 'normal forms'. In practice, people that design relational databases often intuitively 'get it right' most of the time. But it is still good to be aware of the normal forms and what their characteristics are. There is a lot of documentation on that on the internet (fe http://en.wikipedia.org/wiki/Database_normalization), and you should certainly do you own research, but the most important stages are:

unormalized data: in this stage, data is not truly tabular ('relational'). There is a lot of discussion of what tabular really means, and experts disagree with one another. but most people agree that data is unnormalized in case there are multi-valued attributes (=columns that can for one row contain lists as value), or in case there are repeating groups (=multiple columns or multiple groups of columns for storing the same type of data)

Example of multi-valued column: person (first_name, last_name, phonenumbers)
Here, phonenumbers implies there could be more phonenumbers, stored in one column

Example of repeating group: person(first_name, last_name, child1_first_name, child1_birth_date, child2_first_name, child2_birth_date..., childN_first_name, childN_birth_date)
Here, the person table has a number of column pairs (child_first_name, child_birth_date) to store the person's children.

Note that something like order (shipping_address, billing_address) is not a repeating group: the addresses for billing and shipping may be similar pieces of data, but each has its own distinct role for an order, both just represent a different aspect of an order. child1 thru child10 do not - children do not have specific roles, and the list of children is variable (you never know how many groups you should reserve in advance)

In both cases, multi-valued columns and repeating groups, you basically have "nested table" structure - a table within a table. Data is said to be in 1NF (first normal form) if neither of these occur.

The 1NF is about structural characeristics: the tabular form of the data. All subsequenct normal forms have to do with eliminating redundancy. Redundancy occurs when the same information is independently stored multiple times. Redundancy is bad: if you want to change some fact, you have to change it in multiple places. If you forget to chance one of them, you have inconsistent data - the data is contradicting itself.

There are a lot of processes that can eliminate redundancy, each leading to a higher normal form, all the way from 1nf up to 6nf. However, typically most databases are adequately normalized at 3nf (or a lsight variation of that called boyce-codd normal form, BCNF) You should study 2nf and 3nf, but the principle is very simple: a table is adequately normalized, if:

the table is in 1nf
the table has a key (a column or column combination whose values are required, and which uniquely identifies a row - ie. there can be only one row having that combination of values in the key columns)
there are no functional dependencies between the non-key columns
non-key columns are not functionally dependent upon part of the key (but are completely functionally dependent upon the entire key).

functional dependency means that a column's value can be derived from another column. simple example:

order_item (order_id, item_number, customer_id, product_code, product_description, amount)

let's assume (order_id, item_number) is key. product_code and product description are functionally dependent upon each other: for one particular product_code, you will always find the same product description (as if product description is a function of product_code). The problem is now: suppose a product description changes for a particualr product code, you have to change all orders that us that product_code. forget only one and you have an inconsistent database.

The way to solve it is to create a new product table with (product_code, product_description), having (product_code) as key, and then instead of storing all product fields in order, only store a reference to a row in the product table in the order_item records (in this case, order_item should only keep product_code, which is sufficient to look up a row in the product table and find the product_description)

So as you u can see, with this solution you do actually save space (by not storing all these product descriptions in each order_item that happens to order the product) and you do get more tables (split off product from order_item) But just remember that it is not because of saving diskspace: it is because you eliminate redundancy, thus making it easier to maintain the data. because now you only have to change one row in the product table to change the description

Normalization in Plain English