QRdvark.com

Introduction To Database Normalization

Introduction To Database Normalization

Contents

  1. What is Normalization?
  2. The Problem
  3. First Normal Form (1NF)
  4. Second Normal Form (2NF)
  5. Third Normal Form (3NF)

What is normalization?

Database normalization is a design, or process of organizing data within a database to eliminate data redundancy that arranges data into tabular a form by removing duplicate data from relation tables, ensuring data dependencies make sense and data is logically stored.

The Problem

Data which is not normalized becomes difficult to manage. Simple Create, Read, Update and Delete (CRUD) tasks become laboious and slow and increase the risk of data loss. Data anomolies are frequent as demonstrated below.

Consider a table of school teachers. The table, in non normalized form, might looks like this..

Full Name First Name Surname Address Zip Code Department
Emma Smith Emma Smith 1 Smith St 1234 Physics
Roger Ramjet Roger Ramjet 2 Fast St 2345 Aeronautics
Chris Pitt Chris Pitt 4 Pit Lane 3456 Mechanics
Emma Smith Emma Smith 93 Jones St 4567 Biology
Chen Lee Chen Lee 101 Alphabet Rd English

In this simple example, we see the teacher Emma Smith teaches both Physics and Biology. Problems arise when we try to Create a record. If a teacher has not yet been assigned a subject to teach, the Department field will be NULL. Reading data from this type of table will be slower, as each record for Emma Smith is not related in any way. When updating a record for a teacher who teaches multiple subjects, each of the records needs to be updated, or the data will be inconsistent. Deleting a teachers record will become problematic as each record needs to be found and deleted.

To eliminate these anomolies, and potential data loss, Normalization is used. Normalization has four basic tenets.

  1. First Normal Form (1NF)
  2. Second Normal Form (2NF)
  3. Third Normal Form (3NF)
  4. Boyce & Codd Normal Form (BCNF)

First Normal Form (1NF)

A row of data must not contain repeated data. This means each column must have a unique value. Each row of data must have a unique identifier, such as a Primary Key. In our Example Table we see several problems. There is not unique identifier, such as a Primary Key column, so this column cannot be guarenteed to be unique. The full name field is redundant, as this could be gained from the first name and surname fields, therefore the data is not atomic. There is no way of telling if the Emma Smith who teaches Physics, is the same Emma Smith who teaches Biology.

The solution to the problem of the Example Table to satisfy 1NF is to break the data into seperate tables and add unique identifiers.

Teacher ID First Name Surname Address Zip Code
1 Emma Smith 1 Smith St 1234
2 Roger Ramjet 2 Fast St 2345
3 Chris Pitt 4 Pit Lane 3456
4 Chen Lee 101 Alphabet Rd 4567
Department ID Teacher ID Department
1 1 Physics
2 2 Ramjet Aeronautics
3 3 Mechanics
4 1 Biology
5 4 English

With the data seperated each record (row) has a unique identifier, the ID field, which is the Primary Key for this table in the database. The user Emma Smith now has the ID of 1. In the second table, the teacher ID of 1 is used for both Physics and Biology, thus the data is now normalized to First Normal Form (1NF).

But there is also another issue when a Primary Key for the second table is a concatenation of Department ID, and the Teacher ID.

Second Normal Form (2NF)

For a table to meet Second Normal Form (2NF), the data must satisfy the needs of 1NF, and there must not be any partial dependency of any column on primary key. The above, using a concatination of Department ID and Teacher ID does not satisy this second condition. There is no link between the Deparment Name, and the teacher.

To satisfy Second Normal Form (2NF), the tables are once again reduced to provide three tables.

  1. Teacher Table
  2. Department Table
  3. Teacher Department Link Table
Teacher ID First Name Surname Address Zip Code
1 Emma Smith 1 Smith St 1234
2 Roger Ramjet 2 Fast St 2345
3 Chris Pitt 4 Pit Lane 3456
4 Chen Lee 101 Alphabet Rd 4567
Department ID Department
1 Physics
2 Aeronautics
3 Mechanics
4 Biology
5 English
Department ID Teacher ID
1 1
2 2
3 3
4 1
5 4

With this third table in place, the tables now satisfy Second Normal Form (2NF). Each of the data rows is now accessable by its primary key (ID). But, I hear you say, are table JOINs not slow? No. JOINS on primary key fields are ATOMIC.

However, there is still a problem here. The address of the teacher depends upon the Zip Code field. This is called a "Transitive Functional Dependency".

Third Normal Form (3NF)

A table is said to satisfy Third Normal Form when it means the requirements of Second Normal Form, and has no Transitive Dependencies.

To fix the issue of the Transitive Dependency, one again the address data can be broken out into a seperate table. The zip code itself could be used as the primary key, or a seperate Primary Key field could be used. For the purpose of this tutorial, a seperate address table and ID field will be created with an address linking table to connect them.

Teacher ID First Name Surname Address ID
1 Emma Smith 101
2 Roger Ramjet 102
3 Chris Pitt 103
4 Chen Lee 104
Department ID Department
1 Physics
2 Aeronautics
3 Mechanics
4 Biology
5 English
Department ID Teacher ID
1 1
2 2
3 3
4 1
5 4
Address ID Address Zip Code
101 1 Smith St 1234
102 2 Fast St 2345
103 4 Pit Lane 3456
104 101 Alphabet Rd 4567

The teacher table now relates to the address table via the Primary Key field Address IO. With this in place, the transitive dependency is gone and the table now satisfies third Normal Form (3NF).