Data Schema Normalization
by Scot Becker

ABSTRACT
The purpose of this white paper is to outline the various normal forms used in ("manual") data schema normalization. This paper begins with a discussion as to why a database designer is concerned with normalization. Immediately following is a discussion of functional dependencies and an introduction to eight normal forms. This paper is closed with a discussion as to how normalization fits in with the data modeling technique, Object-Role Modeling (ORM). The reader should be familiar with basic database concepts (such as tables, columns, rows, and keys) before reading this paper.

INTRODUCTION
The main reasons that a designer normalizes data structures include:

Semantic grouping of related elements, reduction of redundant values in tuples (one or more columns) which can cause insert, delete and update anomalies.

Reduction of null values in tuples.

Disallow spurious tuples (incorrect join combinations of data values due to improper structures).

Further, normalization tends to aid in the discovery process, allows more precise capture of business logic, and aids in model management and integration (even across the entire Enterprise).

FUNCTIONAL DEPENDENCIES
The main root of a normalization technique centers on functional dependencies1. This simply means that the values of a component "Y" in a relationship depend on, or are determined by, the value of a component "X". Conversely, one could state that the values of X determine the values of Y (this functional dependency is denoted: X
® Y). Further, a functional dependency is invalid if we have two records with the same "X" value but different "Y" values.

For example, in many data structures a "Social Security Number (SSN)" determines a "Person", which is commonly (but usually not uniquely) referenced by a "NAME". Thus, we say that NAME is functionally dependent on SSN. Keep in mind that the attributes X and Y may be composite structures (more than one attribute or, in other words, column).

The root of normalization centers on ensuring that non-key attributes of a table are functionally dependent on the key of the same table.

THE NORMAL FORMS
A given data structure can be at one of several levels, or stages of completeness, of normalization. These stages are known as normal forms. The eight (most common) normal forms are First Normal Form, Second Normal Form, Third Normal Form, Elementary Key Normal Form, Boyce-Codd Normal Form, Fourth Normal Form, Fifth Normal Form, and Project-Join Normal Form.

FIRST NORMAL FORM
First Normal Form (1NF) is now generally considered part of the formal definition of a relation. Historically, 1NF was intended to disallow multi-valued attributes. 1NF dictates that the domains (allowable values) of attributes must include only atomic (simple, indivisible) values and that any given value of an instance of an attribute must be a single value from the domain of that attribute. In short, a given cell of a column in a table can contain only one value.

The following table violates 1NF because the second row contains more than one value in the COLOR column:
 

CAR COLOR
123 ABC Blue
321 CBA Black
Gray
White

Figure 1: This table violates 1NF.

To ensure a table is in 1NF, one simply needs to decompose grouped attributes into separate rows (or in some cases, tables). The following representation of the above data is in (at least) 1NF:

CAR COLOR
123 ABC Blue
321 CBA Black
321 CBA Gray
321 CBA White

Figure 2: This table is in (at least) 1NF.

SECOND NORMAL FORM
Second Normal Form (2NF) is based on the concept of "full" functional dependency. A functional dependency, X
® Y, is a full functional dependency if removal of any attribute from X means that the dependency does not hold anymore. For example, given a table that tracks hours (HOURS) a given employee (SSN) devotes to a given project (PROJNUM), we note that HOURS is functionally dependent on the combination of SSN and PROJNUM because a given employee can work on more than one project. Removal of either SSN or PROJNUM from the functional dependency results in an incorrect relationship. For example, if we were to remove SSN from the previous functional dependency, we are left with HOURS and PROJNUM, but we donât know which SSN worked those HOURS! Thus, we say SSN, PROJNUM is a full functional dependency of HOURS.

A table is in 2NF if that table is in 1NF and every non-prime (is not involved in a primary key of the table) attribute in that table is fully functionally dependent on the primary key of the table.

For example, the following table is in 1NF but is not in 2NF because PNAME and PLOCATION are dependent on only part of the primary key (PROJNUM and SSN). Likewise, ENAME is also only dependent on SSN. (Primary keys will be denoted in this document by underlining the included columns.)

Employee_Project_Table

SSN PROJNUM HOURS ENAME PNAME PLOCATION

Figure 3: This table violates 2NF.

To correct this schema, we need to create additional tables and decompose the partial dependencies into these new tables, as in figure 4.

(Please note that, for simplicity, these diagrams will not denote the resulting referential integrity, such as foreign keys, that would need to be added to these decomposed schemas.)

Employee_Table

SSN ENAME

Project_Table

PROJNUM PNAME PLOCATION

Hours_Table

SSN PROJNUM HOURS

Figure 4: The table from Figure 3 has been decomposed to form a 2NF schema.

THIRD NORMAL FORM
Third Normal Form (3NF) is based on the concept of "transitive dependency". A transitive dependency can be loosely defined as a dependency that does not involve the primary key. For example, in the table below we see that while all elements have functional dependencies on the key, SSN, there also exist other, transitive dependencies. Namely, DEPTNAME is dependent on DEPTNUM.

Employee_Department_Table

SSN ENAME BDATE DEPTNUM DEPTNAME

Figure 5: This table violates 3NF.

A table is in 3NF if it is in 2NF and no non-key attributes are dependent on other non-key attributes.

We can decompose the above table into 3NF by creating a second table for department. Thus, the following structure is in 3NF:

Employee_Table

SSN ENAME BDATE DEPTNUM

Department_Table

DEPTNUM DEPTNAME

Figure 6: The table from Figure 5 has been decomposed to form a 3NF schema.

There is a subtle difference between 2NF and 3NF. In 2NF, we were concerned about non-key fields being dependent on subsets of the key. In 3NF, we are concerned about non-key fields being dependent on other non-key fields. Another way to say this has been nicely summarized as: any non-key field must be "... dependent on the key, the whole key, and nothing but the key."
[Kent]

ELEMENTARY KEY NORMAL FORM
Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF (by definition, EKNF tables are also in 3NF) that most often occurs when there is more than one unique composite key (more than one column) which overlap (one or more columns are involved in both keys) in a table2. Such cases can cause redundant information in the overlapping column(s). For example, in the following table let's assume that a subject title (SUBJECTTITLE) is also a unique identifier for a given subject in the following table:

Enrollment_Table

STUDENTNUM SUBJECTCODE SUBJECTTITLE
1 CS100 ER
1 CS114 ORM
2 CS114 ORM

Figure 7: This table, although it is in 3NF, violates EKNF.

The primary key of the above table is the combination of STUDENTNUM and SUBJECTCODE. However, we can also see a (non-primary) uniqueness constraint (alternate key) that should span the STUDENTNUM and SUBJECTTITLE columns as well. The above schema could result in update and deletion anomalies because values of both SUBJECTCODE and SUBJECTTITLE tend to be repeated for a given subject.

The following schema is a decomposition of the above table in order to satisfy EKNF:

Subject_Table

SUBJECTCODE SUBJECTTITLE
CS100 ER
CS114 ORM

Enrollment_Table

STUDENTNUM SUBJECTCODE
1 CS100
1 CS114
2 CS114

Figure 8: The table from Figure 7 has been decomposed to form an EKNF schema.

For reasons that will become obvious in the following section, ensuring a table is in EKNF is usually skipped, as most designers will move directly on to Boyce-Codd Normal Form after ensuring that a schema is in 3NF. Thus, EKNF is included here only for reasons of historical accuracy and completeness.

BOYCE-CODD NORMAL FORM
Like EKNF, the only time a table is in 3NF but is not in Boyce-Codd Normal form (BCNF) is when the table contains two or more candidate keys that overlap. Beyond that, there is only a subtle difference between EKNF and BCNF, which I will outline below.

Consider the same example we used to illustrate EKNF, but we now add a column (GRADE) to denote a studentâs grade received in the course. (Further, for illustrative simplicity, let's assume that a student can only take a course once.)

Enrollment_Grade_Table

STUDENTNUM SUBJECTCODE SUBJECTTITLE GRADE
1 CS100 ER C
1 CS114 ORM A
2 CS114 ORM A

Figure 9: This table violates BCNF.

We see here that the GRADE column is dependent only on a given enrollment pair and that the keys are now elementary 3 (which satisfies EKNF). However, SUBJECTTITLE is dependent on SUBJECTCODE. Since the key of the table is STUDENTNUM and SUBJECTCODE, we decompose this structure into the following two tables which satisfy BCNF:

Course_Table

SUBJECTCODE SUBJECTTITLE
CS100 ER
CS114 ORM

Enrollment_Grade_Table

STUDENTNUM SUBJECTCODE GRADE
1 CS100 C
1 CS114 A
2 CS114 A

Figure 10: The table from Figure 9 has been decomposed to form a BCNF schema.

One may note that this would also have happened to solve our EKNF problem in the previous section. For that very reason, most designers seldom worry about EKNF and move straight on to BCNF.

FOURTH NORMAL FORM
The final normal forms are concerned with multi-valued facts. We can also note that they are concerned with composite keys, as they tend to minimize the number of fields involved in a composite key.

A table is in Fourth Normal form if it is in BCNF and all functional dependencies are "single valued". Another way to state this is to say that a table cannot contain two or more independent "multi-valued"3 facts
[Kent]. By "independent", we mean to say that there is no direct connection between the two (or more) multi-valued facts. This vague definition is better handled by example. In the following table (in BCNF, since it is entirely composed of attributes involved in the key), we record people (NAME), instruments they play (INSTRUMENT), and music styles (MUSICSTYLE) they play.

NAME INSTRUMENT MUSICSTYLE
Able Piano Classical
Able French Horn Classical
Able Kazoo Blues
Baker Trumpet Jazz
Able Piano Blues

Figure 11: This table violates 4NF.

We see that redundancy occurs because a given person (NAME) can play more than one INSTRUMENT and play more than one MUSICSTYLE (the fact that "
Able" plays the "Piano" is repeated, as is the fact that she plays the "Blues" and "Classical"). Further, this table seems to suggest a link between instruments and music styles. Can "Able" play "Blues" with a French Horn4?

In other words, we see that there are two independent multi-valued facts in the above table. The first is that a person (NAME) can play more than one INSTRUMENT while the second is that a person (NAME) can play more than one MUSICSTYLE. These facts are independent because these two facts have no bearing on each other. Decomposing this table into two tables (below) solves the problem.

Plays_Table

NAME INSTRUMENT
Able Piano
Able French Horn
Able Kazoo
Baker Trumpet

Styles_Table

NAME STYLE
Able Classical
Able Blues
Baker Trumpet

Figure 12: The table in Figure 11 has been decomposed into a 4NF schema.

One should note that 4NF only applies to tables with three or more attributes (it eliminates overlapping multi-valued dependencies which, by definition, require three or more attributes) and only when all attributes compose the primary key of the table.

FIFTH NORMAL FORM AND PROJECT JOIN NORMAL FORM
Cases where a table is in 4NF but is not in Fifth Normal Form (5NF) are extremely rare. Further, Project Join Normal Form (PJNF) is a slightly stronger (although this is debated) case of 5NF, and in virtually all cases it can be treated as an equivalent. Therefore, PJNF is included here for completeness.

As in 4NF, 5NF considerations apply only to tables with three or more attributes, all of which comprise the primary key.

The formal definition of 5NF and PJNF requires that we must first define a "projection". A projection of a table is a subset of the total number of columns with no duplicate rows. For example, the following table:

Trains_Table

EMPLOYEE CLASSTYPE COMPANY
Durham ORM Microseft
Durham UML Microseft
Durham ER Microseft
Durham Diagramming Microseft
Becker ER Oricle
Becker ORM Microseft
Becker UML Microseft
Becker ER Microseft
Becker Diagramming Microseft

Figure 13: This table, we shall see later, can violate 5NF.

has the following projections:

Trains1

EMPLOYEE CLASSTYPE
Durham ORM
Durham UML
Durham ER
Durham Diagramming
Becker ER
Becker ORM
Becker UML
Becker Diagramming

Trains2

EMPLOYEE COMPANY
Durham Microseft
Becker Oricle
Becker Microseft

Trains3

CLASSTYPE COMPANY
ORM Visio
UML Visio
ER Visio
Diagramming Visio
ER Oracle

Figure 14: Projections of the table in Figure 13.

The training table in Figure 13 may or may not be in 5NF depending on the business rules. Say we have to enforce the rule:

An EMPLOYEE trains a CLASSTYPE for a COMPANY if and only if an EMPLOYEE trains a CLASSTYPE, the EMPLOYEE trains for a COMPANY, and the COMPANY the EMPLOYEE trains for makes a tool that implements the CLASSTYPE the EMPLOYEE trains.

If we enforce the above rule the table in Figure 13 is not in 5NF and must be reduced to three tables represented by the above projections of the original table.

To achieve 5NF, one checks all-key tables for decompositions whose joins result in the same information. A cautionary note, however, is that such decompositions can lead to a loss of constraint knowledge. For example, in the above case, we need to create database code to handle the specified rule between an EMPLOYEE, the CLASSTYPEs they train, and the COMPANY who makes the tool that implements the CLASSTYPE.

The root concept behind 4NF, 5NF, and PJNF is that the tables not in these normal forms can be derived from simpler, more fundamental relationships.
[Simsion] Further, 5NF does not differ from 4NF unless there are other rules (semantic constraints) that dictate correct data population [Kent]. Lastly, 5NF differs from 4NF in that the fact combinations we are concerned with are no longer independent from each other (due to the semantic constraints) [Kent].

NORMALIZATION AND DATA MODELING
The previous discussion centers on what I call "manual normalization". As you may now realize, this normalization technique can be tricky and difficult to explain (the fact that I used five references, not counting my own, to help me explain this topic is a good indication of this). I prefer to use a data modeling technique that, in addition to many, many other benefits, also happens to completely normalize my data structures with little extra effort on my part.

This technique, called Object-Role Modeling (ORM), centers around objects playing roles. We don't worry about entities (tables) and attributes (columns) at the level that ORM is concerned with (called the conceptual level). Because of this, dependencies and semantic constraints are rooted out right away, and the resulting (generated) logical and physical schemas are in "optimal normal form" (assuming the fact types are "elementary"
[Becker]). The details behind ORM are beyond the scope of this paper, but I encourage the reader to explore this technique further. In particular, the reader should reference "Normalization and ORM" [Becker].

SUMMARY
It is important to note that while the normal forms are subsets of each other (for example, a schema in 3NF is in 2NF by definition) it is not necessary to move from one to the other in a sequential fashion. In most cases, initial schemas already meet the criteria for 1NF and 2NF. Further, as we have seen, many structures that are decomposed to meet the requirements of 3NF also happen to meet the requirements of EKNF, BCNF, 4NF, 5NF, and PJNF with no additional decompositions. For this reason, many data modelers strive to ensure that their designs meet BCNF and only worry about ensuring the higher normal forms in special circumstances. Further, the reader will note that authors seldom delve into discussions of EKNF, 4NF, 5NF, and PJNF, as they are often met by the earlier normal forms (up to BCNF).

FOOT NOTES
1) A functional dependency is formally defined as follows: A functional dependency, denoted by X
® Y, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation instance r of R. The constraint states that, for any two tuples t1 and t2 in r such that t1[X] = t2[X], that we must also have t1[Y] = t2[Y] [Elmasri, et al.].

2) A more formal definition of EKNF requires some additional definitions. First, a functional dependency, X
® Y, is said to be "trivial" if and only if X contains (is the same as) Y. The same functional dependency is said to be "full" (as opposed to "partial") if and only if there is a functional dependency from part of X to part of Y (where X and Y are composite attributes, or, in other words, have more than one column). Now, we can define an "elementary functional dependency" as a full, non-trivial functional dependency. Further, a key is an "elementary key" if there exists some elementary functional dependency from it to some other attribute in the table. We can now say a table is in EKNF if and only if all of its elementary functional dependencies begin at whole keys or end at elementary key attributes.

3) A functional dependency is said to be "multi-valued" in the following case: Given a set of (usually, all-key) attributes X, Y, and Z, X is involved in a multi-valued dependency if and only if the set of Y values is only dependent on X. Similarly, Z will also be involved in a multi-valued dependency (is "multi-dependent") with X [Halpin].

4) A bit more explanation may be required here. Many texts, when detailing 4NF, will use an example set that contains obviously unrelated facts. For example, one text uses Person, the Languages they speak, and Sports they play [Halpin]. This often leads to the rebuttal, "Well, I'd never have lumped languages and sports in the same table." True enough. Thus, I have attempted to form a table with a more likely mistake in it. Therefore, this example may initially be a bit confusing, as one tends to associate instruments with a music style. For example, have you ever heard a Steel Guitar played in any form of music other than Country/Western? Perhaps not, but the fact remains that a good steel Guitar player could attempt belt out Beethoven's Ninth Symphony, if they wanted to (shudder at the thought!).

REFERENCES
1) Becker, Scot A., "Normalization and ORM"

2) Elmasri and Navathe, Fundamentals of Database Systems, Second Edition (Chapter 12)

3) Halpin, Terry, Conceptual Schema and Relational Database Design, Second Edition (Section 10.2)

4) Kent, William, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM, 26(2), February, 1983, 120-125

5) Reingruber and Gregory, The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models (Chapter 8)

6) Simsion, Graeme, Data Modeling Essentials: Analysis, Design, and Innovation (Chapters 2 and 7)

In addition to the above references, the author would like to thank (in no particular order) Dr. Terry Halpin, A.J. Durham, Mike Bisek, Pat Hallock, Dr. Gordon Everest, and Nicole Birdsall for providing valuable feedback about this article while it was still in draft format. Any remaining errors are, of course, my own.


This is a revised version of a paper that was originally published as the basis of a Virtual Foxpro User Group lecture.