Database Normalization 3/4: De-normalization Anomalies
Data Modeling is an important skill for any data worker to understand. Data Engineers in particular should understand this topic and be able to apply it to their work. Normalization is an important part of Data Modeling and part one can be found here.
For this post, we will be going over the first part of the De-normalization Anamolies section of the Database Normalization Wikipedia page.
pt 3: De-normalization anomalies - this post
Anomaly.0
When an attempt is made to modify (update, insert into, or delete from) a relation, the following undesirable side-effects may arise in relations that have not been sufficiently normalized:
If we don’t normalize our database enough, problems/anomalies can happen. This section is a great reminder of Data Manipulation Language (DML) which can be expressed in the following SQL.
>> from part 2:
Normal Forms allow for easier Data Manipulation Language so that you don’t have to do many operations for simple insertions, updates, or deletes.
From the following informative article:
Examples of DML:
INSERT – is used to insert data into a table.
UPDATE – is used to update existing data within a table.
DELETE – is used to delete records from a database table
Update Anomaly.1
The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies.
One of normalization’s objectives is to remove duplicate data.
For example, each record in an "Employees' Skills" relation might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee may need to be applied to multiple records (one for each skill).
Removing duplicates allows for single updates to cover every single instance of our data that we’re updating.
If the update is only partially successful – the employee's address is updated on some records but not others – then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly.
It is normal for different data sources to have slightly different versions of the truth. A much worse scenario is if we have a single source/database saying two completely different things because of an update anomaly like the one described above.
Insertion Anomaly.2
There are circumstances in which certain facts cannot be recorded at all.
This is an important reminder. Not all real life occurrences can be recorded neatly into tables rows and columns.
For example, each record in a "Faculty and Their Courses" relation might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code. Therefore, we can record the details of any faculty member who teaches at least one course, but we cannot record a newly hired faculty member who has not yet been assigned to teach any courses, except by setting the Course Code to null. This phenomenon is known as an insertion anomaly.
A normalized database should take into account for the possibility of edge cases or gray areas like the one described above. These can be handled with being less strict on NOT NULL constraints, but like everything, it’s context-specific decision making.
Deletion Anomaly.3
Under certain circumstances, deletion of data representing certain facts necessitates deletion of data representing completely different facts.
Deletion of data is always scary. You don’t know the future, so you can’t know with certainty that you won’t need the data you are deleting.
The "Faculty and Their Courses" relation described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member, unless we set the Course Code to null. This phenomenon is known as a deletion anomaly.
Being too strict with constraints can lead to not being able to store data that may fit in gray areas. As a callback to my very first data modeling newsletter, we can reference an Entity Relationship Diagram and remember that an entity can have zero references to another.
Handling no references is done in practice by not being so trigger happy with NOT NULL in your table creation SQL commands.
Finale
It really seems like we’ve hit some of the harder topics in other newsletters but this is still good database material to review, comb through, and mellow on. To emphasize the point, Data Engineers should be really good at thinking and speaking through data modeling problems. Data Analysts, Scientists, and any data worker only benefits from these skills as their SQL skills get stronger as a result.
Thanks for reading!