The Data Modeling Series pt.2
Digestible Data Modeling Blogging
Data Modeling is an important skill for any data worker to understand. Data Engineers in particular should understand this topic and be able to apply it to their work.
This post is a continuation of a data or dimensional modeling series. This series will probably have 5 parts. For pt.2 , we will go over the Description section in Wikipedia’s page on Dimensional Modeling.
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact; timestamp, product, register#, store#, etc. are elements of dimensions.
Facts and Dimensions is a way of partitioning out columns (aka fields) out to different tables. So, in referencing back to the Star Schema section from pt.1, a fact table is going to have fact fields and dimension table is going to have dimension fields. These tables are linked together in a way that makes JOINs possible in SQL queries. More detail on these table links, or relationships, later.
Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc. Because the different business process areas share some but not all dimensions, efficiency in design, operation, and consistency, is achieved using conformed dimensions, i.e. using one copy of the shared dimension across subject areas.
Conformed Dimensions definition from another Wikipedia page:
A conformed dimension is a set of data attributes that have been physically referenced in multiple database tables using the same key value to refer to the same structure, attributes, domain values, definitions and concepts. A conformed dimension cuts across many facts.
That’s a whole bunch of words. Basically what they are saying is that, the database modeling flows from a business’s needs. The horse, or business, goes before the cart, the data modeling.
Conformed dimensions is something new to me but from what I read, it is a dimension (or string/date/etc.) table that must meet at least one of two requirements.
Identical to the most granular, detailed, dimension
A dimension is a strict mathematical subset of another
This is a little conceptual so I’ll give out a few examples.
Conformed Dimensions ex. 1 — If I have a `days` table, it must be a perfect sub-set of a `weeks` table.
Conformed Dimensions ex. 2 — If I join between `date_dim` table and `sales_fact` table, the data from `date_dim` remains identical if I join between `date_dim` and `returns_fact` table.
Referential Integrity tangent
A conformed dimension is a much more specific form of referential integrity. Referential integrity could be it’s own series, like Data Modeling, to be honest. The main gist of referential integrity is that all references to data within the database, must be accurate and exist.
Referential Integrity ex. 1 — If I have bought a meal, on a University campus, with my student ID, then my ID must exist in the University’s `student` table. Not only must it exist, it has to also be a unique identifier with accurate field values about me.
Referential Integrity counter-ex. 1 — If I drop out of University, my ID is purged from a `student` table. I then attempt to, and succeed, in buying a meal on campus with my Student ID. That purchase's ID is now orphaned, in a way, and won't point to anything. Since I don't exist in the `student` table anymore. This violates referential integrity.
Dimensional modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. It is oriented around understandability and performance.
Dimensional modeling is not necessarily tied to relational databases. One loose example of real life dimensional modeling is a physical Dictionary.
IRL Data Modeling ex. 1 — You heard an obscure word, and are kind of hazy on what it means. You know another source of data, a dictionary, would have the definition. You don't have access to this definition, within your memory. So you look up this word on Webster's. You assume that the word you look up, is the same one you heard, so that relationship holds. Since that relationship holds, you now have access to the word's definition within your brain.
This section went over conformed dimensions and referential integrity. I hope they both give an idea on how fastidious databases can be. It is beneficial to care about small details with regards to Data Modeling. More on de-normalization and breaking all these strict rules in pt.4!
picture for thumbnail :)