In my, So You Want to be a Data Engineer? post, I list the most fundamental skills for Data Engineers. The list includes; SQL, Python, Data Modeling, ETL, and API knowledge. In my opinion, the one list element with the least accessible learning materials out there is Data Modeling. Also known as Dimensional Modeling.
This long-term series of letters is my attempt to fill that void of accessible materials with digestible parts.
What is a Data Warehouse?
Before we begin, the reader must understand what a Data Warehouse is. A Data Warehouse is soft of a summary database. A Data Warehouse often uses multiple primary business databases and/or 3rd party data sources. It is built for Analysts to query business-relevant data for their analysis all in one location!
I’ll go over Wikipedia’s page on Dimensional Modeling, section by section.
First Wikipedia Sentence
Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.[1]:1258-1260[2]
Ralph Kimball most well known for architecting relational database techniques.
Perhaps the most well known one is the Star Schema. The Star Schema is a way to model your data, or organize, in a data warehouse. You don’t need to know the details of the Star Schema yet, but the basic concept is to set up tables “talking” to each other. The center table, or data, is the numbers. With the numbers, or facts, the strings/dates/etc. can surround it. Using the Star Schema to model your data warehouse can be seen a as little bit too traditional now because of how cheap storage is*, but it is an effective and proven way for Analysts to easily pull information to answer business questions.
*I’ll explain the connection between storage costs and data modeling in another post.
source: Star and SnowFlake Schema in Data Warehousing
The rest of the 1st paragraph
The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, a bottom-up approach.[1]:1258-1260 An alternative approach from Inmon advocates a top down design of the model of all the enterprise data using tools such as entity-relationship modeling (ER).[1]:1258-1260
Breaking down these sentences into the following 4 sections
1. Business processes
Data Modeling can be seen as coming from the business’s needs. If our business only sells teddy bears, it wouldn’t do much good to have a `chainsaws_sold` table in our Data Warehouse. That may be a trivial example, but the important principle is to not put the cart before the horse. Think of purpose before technicalities.
2. Bottom->up Modeling
Data Modeling is how we organize and pare down our business’s primary source data sources. If we only had data sources on teddy bears sold, then we would model our data warehouse organization on what we currently sell. If we were to add a chainsaw table in our source databases later on, then we model that into our databases at that time. But not sooner!
In other words, we model the warehouse only after source data is changed.
No pre-optimization.
3. Alternative top->down
What is the Inmon top-down approach? It means that the data warehouse is designed first and then the primary sources of data, or databases, are built off of the data warehouse. This approach is often more correlated with Entity Relationship Modeling (ERD)s.
4. What is Entity-Relationship Modeling (ERD) and how does it relate to Data Modeling?
ERDs are Data Modeling’s fastidious cousin. Both ERDs and Data Models visualize how tables are organized and how they might be linked together. However, ERDs go several steps further into detail. The lines between tables can have little symbols at the end. These symbols (arrows, two angled lines, two dashes, circles, etc.) define what type of relationship two tables (entities) have.
Ex:
For every Location, one and only one (two dashes) Location may have zero (circle) or many (two 45 degree angled lines) Persons born in it.
Data Modeling is generally not this detailed on its visualizations, but these table relationships exist and it is good to be aware of the logic behind them.
Finale
We’ll hit Wikipedia’s Description next time. Thanks for reading!