04bNormalization

Database Management

Mick McQuaid

University of Texas at Austin

03 May 2026

Overview

  • Normalization is the process of transforming a database so that anomalies, insertion, deletion, and update anomalies, are unlikely if not impossible
  • Normalization is usually conducted in phases called forms
  • Pretty much every database I’ve ever worked with has been normalized to at least third normal form, while many advance to fourth normal form or Boyce-Codd normal form

Relational Notation

The convention we’ll observe for writing out relations is as follows.

relationName( primaryKey, foreignKey, field1, field2, \(\ldots\) )

Some books follow the convention that the name of the relation is in all caps. We will not observe that convention here. We will follow the convention of underlining the primary key and italicizing foreign keys. If you are writing in plain text without the ability to underline or italicize, please obey the following convention.

relationName( primaryKey (pk), foreignKey (fk), field1, field2, \(\ldots\) )

1NF: First normal form

A relation is in first normal form, 1NF, if and only if all columns contain only atomic values—that is, each column can have only one value for each row in the table.

Whether an item is atomic is a matter determined by people.

SQL has no concept of non-atomic items. Oracle has an extension to SQL that explicitly defines some items as non-atomic. Hence, any relation specified in SQL is in 1NF unless that relation was specified using an extension to SQL.

Repeating groups

Many database textbooks teach that 1NF eliminates repeating groups, based on the notion that a repeating group is non-atomic. Some textbooks hearken back to the original definition of non-atomic and move the eliminate repeating groups into a more appropriate place in 2NF. Going forward in this class, I would like you to do just that—relegate eliminate repeating groups to 2NF.

Note that this means that you may, if you wish, specify a relational notation for a table in 1NF without reference to a primary key. For example

Supplier(sid, sname, rating, city)

is in 1NF without reference to a primary key, provided that each row is unique.

2NF: Second normal form

A relation is in second normal form, 2NF, if and only if it is in 1NF and every nonkey attribute is fully dependent on the primary key. An attribute is fully dependent on the primary key if it is is on the right side of a functional dependency for which the left side is either the primary key or something that can be derived from the primary key using the transitivity of functional dependencies.

Functional dependencies and 2NF

In many if not most applications, functional dependencies are implied in the ER diagram but not certain. For example, Apple obtains displays for its iPhones from several different suppliers. To uniquely identify a display item might require a concatenation of the item identifier and the supplier identifier. Alternatively, each item might have a separate identifier that incorporates the information about which supplier provided the item.

Functional dependency example

Suppose we know the functional dependencies

{Supplier}\(\rightarrow\){Name, Street, City, State, Zip}

and suppose further that we know that

{Zipcode}\(\rightarrow\){City, State}

Then the following ITEM relation is in 2NF

Item(ItemID, ItemName, Description, Returnable, Perishable, ShelfQty, Notes, Colors, RetailPrice, Cost, Supplier)

Creating a new table to satisfy 2NF

Notice that the information that is dependent on Supplier is no longer present. To create the required tables, I would also specify

Supplier(Supplier, Name, Street, Zipcode)

as a separate relation. But what happened to City and State? The above table would not be in 2NF if I included them because City and State ordinarily depend only on Zip. Since I noted that explicitly above, I’m going to skip the 1NF version of a Supplier table go right to a combination of Supplier and Zip as follows

Zipcode(Zipcode, City, State)

Repetition of the same name for two different things

You may object to using the same word to identify an entity and an attribute. Although we usually add ID to attributes named for entities, I don’t think SQL requires it. Also, SQL offers us a way to qualify ambiguous identifiers.

A questionable functional dependency

What about Colors? Is the following functional dependency legitimate?

{Item}\(\rightarrow\){Colors}

No. Knowing the Item does not determine the Colors, given that Colors has been described as multi-valued in the ER diagram. Therefore, it should be eliminated from the ITEM relation and a new relation should be specified as

Itemcolors(Item,Colors)

where each row uniquely identifies a combination of Item and Colors. There is only one superkey, one candidate key, and one primary key possible.

Third normal form

A relation is in third normal form, 3NF, if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.

What is transitivity? A transitive dependency exists if a non-key attribute determines a non-key attribute.

Codd’s edict

Edgar F Codd, the inventor of the relational databases (he actually called them data banks) asserted the following edict: A non-key column should depend on the key, the whole key, and nothing but the key where “the whole key” refers to 2NF and “nothing but the key” refers to 3NF

Example of a 3NF violation

I asked Opus 4.7 for an example of a 3NF problem. The result (edited):

Given the following relation and functional dependencies

Employee(EmpID, Name, DeptID, DeptName, DeptLocation)

{ EmpID } → { Name, DeptID } (key determines all)

{ DeptID } → { DeptName, DeptLocation } (department determines its own attributes)

DeptName and DeptLocation depend on EmpID only through DeptID. That’s transitive. DeptID is not a superkey, and DeptName/DeptLocation aren’t prime—so this violates 3NF.

Problems this causes (anomalies)

  • Update anomaly — change a department’s location and you must update every employee row in that department.
  • Insertion anomaly — you can’t record a new department until at least one employee is assigned to it.
  • Deletion anomaly — deleting the last employee in a department loses the department’s information entirely.

Decomposition into 3NF

Split into two relations:

Employee(EmpID, Name, DeptID) — DeptID is a foreign key

Department(DeptID, DeptName, DeptLocation)

Now each non-key attribute depends only on its table’s key. The decomposition is lossless (joinable back without spurious rows) and dependency-preserving.

Boyce-Codd normal form

A relation is in Boyce-Codd normal form, BCNF, if and only if every determinant (LHS of a functional dependency) is a candidate key.

Claude Opus says: 3NF allows one loophole: a non-trivial dependency X → A is permitted if A is a prime attribute, even when X isn’t a superkey. Boyce-Codd Normal Form (BCNF) closes that loophole — X must always be a superkey. So BCNF ⊆ 3NF. The tradeoff: every relation has a lossless, dependency-preserving 3NF decomposition, but not always a dependency-preserving BCNF one. That’s why 3NF remains the practical target in most designs.

Fourth normal form

Date (2004), page 385, gives the following definition of fourth normal form: Relvar R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial MVD A \(\twoheadrightarrow\) B is satisified, then all attributes of R are also functionally dependent on A.

To understand this definition, it is necessary to review Date’s definition of multi-valued dependence, MVD, which is as follows. Let R be a relvar and let A, B, and C be subsets of the attributes of R. Then we say that B is multidependent on A—in symbols A \(\twoheadrightarrow\) B (read A multi-determines B or simply A double arrow B)—if and only if, in every legal value of R, the set of B values matching a given AC value pair depends only on the A value and is independent of the C value.

Rationale

Data-intensive systems

  • We can group most computer systems into two categories
    • data-intensive systems
    • compute-intensive systems
  • In this class, we only consider data-intensive systems
  • Data-intensive systems can be grouped into two categories, according to Kleppmann (2017), page 385
    • Systems of record
    • Derived data systems
  • We’re mainly interested in systems of record here

Systems of record

  • A system of record, also known as source of truth,
  • holds the authoritative version of your data.
  • When new data comes in, e.g., as user input, it is first written here.
  • Each fact is represented exactly once (the representation is typically normalized).
  • If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one.

Derived data systems

  • Data in a derived system is the result of taking some existing data from another

  • System and transforming or processing it in some way.

  • If you lose derived data, you can recreate it from the original source.

  • A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database.

  • Denormalized values, indexes, and materialized views also fall into this category.

  • In recommendation systems, predictive summary data is often derived from usage logs.

The importance of derived data

  • Technically speaking, derived data is redundant, in the sense that it duplicates existing information.
  • However, it is often essential for getting good performance on read queries.
  • It is commonly denormalized.
  • You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”

Systems of record and derived systems work together

  • Normalization protects systems of record
  • They must not have insertion, deletion, or update anomalies
  • Normalization may incur a performance penalty
  • Denormalizing a derived system may speed up performance

END

References

Date, C. J. 2004. Introduction to Database Systems. Pearson Education.
Kleppmann, Martin. 2017. Designing Data-Intensive Applications. O’Reilly.

Colophon

This slideshow was produced using quarto

Fonts are Roboto, Roboto Light, and Victor Mono Nerd Font