For many budding Data Scientists and Engineers, SQL tends to be the dark horse technology. Usually sidelined by a justified emphasis on scripting, or “real”, programming languages. In the SQL world, there are two even lesser languages. They’re not technically SQL, but reports may conflict on that last claim. Either way, their syntax is very similar to SQL. On top of that, all DML and DDL commands are run into query prompts exactly like regular queries.
Data Manipulation and Definition
Data Manipulation Language (DML) and Data Definition Language (DDL) are often given a very brief introduction and quickly forgotten about until you have to dip into the Database Administrator pool. Database Administrators are expected to be well versed in both careful Data Manipulation and thoughtful Definitions.
Engineers who are building an analytics infrastructure centered around a Data Warehouse, or one stop shop for analysis, must become a Database Administrator of a single database. Concerning themselves with topics such as; performance, availability, timeliness, and quality.
Careful Data Manipulation
Data Manipulation Language (DML) looks like INSERT, UPDATE, and DELETE. Meaning, we are adding rows, altering existing database objects, or removing them. Any kind of manual intervention like this has to be carefully done.
Carefully, double and triple checking the affected data both technically and with stakeholders. A technical tactic I have seen used, is making sure that the number of rows affected matches the expected, and rolling back changes if the match is not there.
Thoughtful Data Definitions
Data Definition Language (DDL) looks like CREATE, ALTER, PRIMARY KEY, and INTEGER/TEXT/DATE. Meaning, we are defining how our data is stored, organized, and gathered. Skipping this part of Data Warehouse administration can lead to exorbitant computational costs and inefficient queries.
Ultimately, an Analyst is the end user of a Data Warehouse. I recommend asking them what the most commonly used tables and columns are. If a database object is read often, then it must be efficient to read it. For learning more about efficient reads, I recommend the following book; SQL Performance Explained. It covers just about anything an Engineer would want to know about utilizing indexes for efficient reads.
However, in a Data Warehouse, the data is often stored in a columnar database. Data Warehouses are typically different than Relational Databases. Relational databases use keys to communicate between tables, and efficient key setting and getting is critical. On the other hand, Data Warehouses store data in a columnar fashion. Meaning, that large, analytical queries may perform better because all data from the same column is stored physically close to itself. So then, your DDL optimization comes in the form of column sorting keys or datatype definition, rather than index optimization.
Sorting Keys define which columns to prioritize in queries, much like an index would in a relational database, but with different guts. The other DDL concept is data type definition. I recommend storing your datatypes in a constant file of some sort, so that there are no surprises down the road when it comes to dirty data. Data Quality tests become increasingly important as you go down the road of thoughtful data definitions.
Conclusion
DML and DDL are the dark horses of SQL. SQL is the dark horse of the data industry (perhaps less so in recent years). So, that means DML and DDL are dark horses squared. There are learnings available in DML and DDL, despite how often they are ignored by interview processes.
Setting down a careful and thoughtful process for setting and manipulating database objects allows your insights to become more streamlined. A Data Warehouse that has stringent Database Object definition and manipulation, can spend less time being hampered by bloated computational or storage costs.
Curated Content
Advanced SQL window functions Quiz
Migrating Data Warehouses. Great to read what companies do in practice.
Basic Linux Commands Guide to more easily automate your infrastructure
About the Author and Newsletter
I automate data processes. I work mostly in Python, SQL, and bash. To schedule a consultancy appointment, contact me through ocelotdata.com.
At Modern Data Infrastructure, I democratize the knowledge it takes to understand an open-source, transparent, and reproducible Data Infrastructure. In my spare time, I collect Lenca pottery, walk my dog, and listen to music.
More at; What is Modern Data Infrastructure.