Data normalization
What a normalized database looks like and why table structure matters.
Data normalization is the process of structuring information in a database to cut down on redundancy and make that database more efficient. Think of normalization as a way to make sure that every field and table in your database is organized logically, so that you can avoid data anomalies when inserting, updating, or deleting records. This process is carried out according to specific rules that dictate how tables should be organized.
Normalization is one part of the larger data cleaning and standardization process, which also involves confirming that your data is accurate, complete, and doesn’t contain duplicate records, as well as ensuring that you’ve selected the appropriate data types for your fields. If you’re starting with denormalized tables, the normalization process will involve creating additional, smaller tables that can be joined to one another by a foreign key. Maybe you’ve become frustrated with having to update the same information in multiple places across your database after a single value changes, or are finding that you’re losing valuable data when a record gets deleted. Normalizing your tables will help in both of these cases.
The principles we’ll cover in this lesson apply to relational database management systems (RDBMS). If you’re using a NoSQL or document-based database like MongoDB, the information below won’t apply.
Simplification and reducing storage: benefits of normalized data
Normalization is all about making your data more efficient, so that your team can find and use the information they need. These benefits and rules may seem like common sense once you have some familiarity with how databases work, but it pays to know the explicit purpose of each table and field within your database. Benefits of normalized data include:
-
Simplifying transactional queries. With normalized data, a query for customer addresses only needs to look in the single field that stores those addresses. If you store customer addresses multiple times in different locations across your database or even keep multiple addresses within the same field, that query will take longer to execute.
-
Decreasing the size of your database. If you repeat customer data in several locations across your database, that means you’ve made space to store that information several times. This may not be a major concern if your database only contains a few tables, but if you’re working on a larger scale, disk space can be at a premium. Reducing duplicate information means cutting storage costs, whether you’re running a local server or relying on a cloud-hosted database.
-
Make database maintenance easier. Think of that same customer data stored several times in your database. Each time a customer changes their address, it will need updated in every instance of a
Customer Address
field, which leaves a lot of room for error. If your data is normalized, you’ll only have oneCustomer Address
field, which joins to other relevant tables likeOrders
.
Data anomalies
Data anomalies are inconsistencies in how information is stored within a database. These flaws with how a database is structured become apparent whenever something goes wrong when a record is updated, added, or deleted. Fortunately, adhering to the rules of normalization can prevent these anomalies from happening in the first place.
Update anomaly
Update anomalies stem from data redundancy. For example, let’s say your database stores customer address information in fields across several tables. A customer changing their address may result in only one of those fields updating to include the new information, leaving you with inconsistent data.
Insertion anomaly
An insertion anomaly occurs when a record can’t be created without certain fields containing data — data that may not exist yet. For example, a denormalized database may be structured so that a customer account can’t be created unless that customer has place an order. Normalizing that database would solve this problem, through the creation of separate Orders
and Customers
tables, with no rule prohibiting null values.
Deletion anomaly
Unintentional information loss is the result of a deletion anomaly. Let’s say a table in your database includes information about university courses and the students that take those courses. If one course was cancelled due to low enrollment, you may inadvertently lose valuable student information by removing that course record. Like with insertion anomalies, breaking your data out into multiple, specific tables would prevent this issue.
Rules of normalization
The rules for normalizing data were first introduced in the early 1970s. These rules are grouped in tiers called normal forms. Each tier builds upon the last — you can only apply the second tier of rules if your data already meets the first tier of rules, and so on. While there are several more normal forms beyond the three listed below, these first three are sufficient for most use cases.
As we covered in our introduction to databases, tables within a database should contain a entity key, also known as a primary key. This field distinguishes each row within a table according to a unique ID, and is helpful when joining tables. Before we can even get into first normal form, your table needs to have an entity key field.
First normal form (1NF)
The first normal form (1NF) dictates that each field within a table should only store one value, and that your table shouldn’t contain multiple fields that store similar information, like columns titled Address1 and Address2.
Here’s an example of a table that we’ll normalize according to first normal form. This table includes information about college courses and who teaches them.
Professor table
Professor ID | Professor name | Course name |
---|---|---|
P001 | Gene Watson | Intro to Philosophy; Ethics |
P002 | Melissa King | Quantum Mechanics |
P003 | Errol Tyson | Macroeconomics |
P004 | Mary Jacobson | Graphic Novels |
We notice that while our fields are distinct, one professor (Gene Watson, in the first row) is teaching two courses, and that information is currently stored within a single cell. If we normalize this table according to 1NF, we’ll need to break our data out into multiple tables:
Normalized professor table
Professor ID | Professor name |
---|---|
P001 | Gene Watson |
P002 | Melissa King |
P003 | Errol Tyson |
P004 | Mary Jacobson |
Normalized course table
Course ID | Course name | Professor ID |
---|---|---|
C001 | Intro to Philosophy | P001 |
C002 | Ethics | P001 |
C003 | Quantum Mechanics | P002 |
C004 | Macroeconomics | P003 |
C005 | Graphic Novels | P004 |
Since one professor can teach more than one course, we’ve broken this data out into two tables. Now, our Professor
table has a one-to-many relationship with our Course
table. This new table structure meets first normal form, and joins the two tables via a foreign key, the Professor ID
field.
Second normal form (2NF)
Second normal form is about reducing redundancy and making sure that every field describes something about what the entity key identifies. To meet 2NF, all fields in a table that aren’t the entity key must be fully dependent on the table’s entity key (which may be a composite key made up of two fields). Let’s look at a new example — a table that includes information about your employees’ birthdays.
Employee birthday table
Employee ID | Birthday | Department |
---|---|---|
E001 | November 18 | Accounting |
E002 | March 29 | Sales |
E003 | June 1 | Marketing |
E004 | February 7 | Accounting |
This table meets 1NF, because each column is distinct and only holds one value within each cell. However, this table has a composite key: Employee ID
+ Birthday
combined make up the table’s entity key. This table does not meet 2NF in its current state, because the Department
field only partially depends on the composite key, since an employee’s department doesn’t depend on their birthday, only on their employee ID. To fix this, we’ll break this information out into two tables:
Normalized employee birthday table
Employee ID | Birthday |
---|---|
E001 | November 18 |
E002 | March 29 |
E003 | June 1 |
E004 | February 7 |
Normalized employee department table
Employee ID | Department |
---|---|
E001 | Accounting |
E002 | Sales |
E003 | Marketing |
E004 | Accounting |
Third normal form (3NF)
A table meets third normal form if (in addition to meeting 2NF) it doesn’t contain any transitive dependency. Transitive dependency happens when Column A depends on Column B, and Column B depends on the entity key. If you want to normalize according to 3NF, you’ll need to remove Column A from the table, since it does not depend on the entity key directly, and place it in a different table with its own entity key.
Orders table
Order ID | Order date | Customer ID | Customer zip code |
---|---|---|---|
R001 | 01/17/2021 | C032 | 99702 |
R002 | 03/01/2021 | C004 | 39204 |
R003 | 06/30/2021 | C054 | 06505 |
R004 | 08/22/2021 | C010 | 84098 |
R005 | 09/27/2021 | C004 | 39204 |
This table isn’t in third normal form because the Customer zip code
field is dependent on Customer ID
, which is not this table’s entity key (the entity key here is Order ID
). Our current structure could lead to unwanted information loss; if customer C032 returned their order and we needed to delete this record, we’d unintentionally lose their zip code information. If customer C004 ever moved and their zip code changed, we’d also have to update it in two places, since they’ve placed multiple orders. To bring this table into 3NF — you guessed it — we’re going to break it out into two tables.
Normalized orders table
Order ID | Order date | Customer ID |
---|---|---|
R001 | 01/17/2021 | C032 |
R002 | 03/01/2021 | C004 |
R003 | 06/30/2021 | C054 |
R004 | 08/22/2021 | C010 |
R005 | 09/27/2021 | C004 |
Normalized customers table
Customer ID | Customer zip code |
---|---|
C032 | 99702 |
C004 | 39204 |
C054 | 06505 |
C010 | 84098 |
Drawbacks to normalization: when to denormalize
Once you reach higher levels of normalization, your database may perform certain analytical queries at a slower rate — particularly those that need to grab a lot of data. Since normalized data demands that a database tap into several tables to perform a query, this can take longer, especially as your database grows in complexity. The tradeoff is that your normalized data takes up less space.
Next: Data cubes
Thinking about your data in more than two dimensions.