How to Tackle an Intermediate SQL Problem
Walkthrough of a relational database problem co-authored by Milton D'az
There aren’t many intermediate SQL tutorials out there. As soon as you get to JOINs, it seems to either get really hard, really fast, or doesn’t exist. This blog post is an attempt to bridge that gap.
It is co-authored by Milton D’az and Angel D’az. Milton is a Business Analytics graduate student who focuses on writing ETL with Python and SQL.
The Tips
Start with the Data Model / Relational Database
In our opinion, any SQL question ideally starts with questions about the entities and relationships. Without understanding what ingredients we’re working with, we set ourselves up for a bad result. In our opinion, Data Modeling is critical to all data workers.
Filter Early and Often
We treat each query step as bulk ingredients for the following query. Each solve-able step has a query that works and the results from each of those queries are used to do *something*, whether that’s to filter further or update a column in a table.
We hope that this walk-through reviews these intermediate SQL skills and concepts in a way that goes beyond important and fundamental JOINs or GROUP BYs.
Pre-requisites
Since this is a technical post, here is a list of concepts we assume for the reader to already have under their belt
JOINs and ON statements
GROUP BY
Sub, Inner, or Nested Queries
If you would like to query these tables yourself, a MySQL server instance on your local machine is needed.
Schema Description
We have three tables; Inbound, Outbound, and LastChange.
Each of these three tables have three columns; DateComplete, Package, and Environment.
The CREATE TABLE and INSERT data SQL commands are available in Milton’s GitHub page here. Each table should be populated like the following three screenshots.
Note: DateComplete is all NULLs in LastChange
The Question
Using the **Outbound** and the **Inbound** tables, write a SQL script to find the package and environment table combination with the most recent **DateComplete** and update the LastChange table's DateComplete using SQL.
Our first thought is to break this question down into multiple, easier to digest, parts. Before writing a line of SQL queries, we draw out what’s going on. We have two blue source tables and one purple destination table.
The purple destination table is L (LastChange). This is the table that needs to be changed based on data on the blue source tables. Relational database concepts are critical to understanding how the SQL will be written and are nearly always a good starting point.
Gathering all Possible Recipe Ingredients
We think first SQL query we write, should combine our sources into one result.
We SELECT every column, no WHERE conditions, from both tables and we UNION so that we can see all rows from both source tables.
This is the collection of all of our data. And from here on out; we filter, filter, filter.
Filtering One Step Further
Now that we have all rows being returned, we want to filter out any Package/Environment (ABC/YZ) combinations that are not the most recent.
We decided to fit the above UNION query into the FROM statement for another query, as it acts like a table that has all relevant data available to query further.
5 rows are returned with every Package/Environment (ABC/YZ) combination possible, given the data that is available. These are the DateCompletes we want in LastChange table, or our destination table.
The MAX(datecomplete) and GROUP BY here are very important. They remove any duplicate Package/Environment (ABC/YZ) combinations that are do not have the most recent DateComplete value.
We are getting closer to achieving the updates wanted.
Explaining Sub-Queries
A pattern on how we are problem solving emerges. We query a large dataset, then use that result as a table to query a smaller dataset, then use that result as a table to query… and so on. It’ll help if we visualize this process
Each time we wrap the previous query, the resulting data gets smaller, as we are putting more conditions, or grouping, on what we want. There’s several ways to do this process, some more efficient than others, but our focus is that this process exists.
Temp tables of sorts
We need to store these Package/Environment combinations and their MAX dates somehow. In Microsoft SQL Server, a Common Table Expression (CTE) is possible. A Microsoft’s T-SQL would look like this:
But since we’re using MySQL, the same goal can be accomplished with a temp table. Preceding our latest query, we need a single line before the query, so that the results can be stored for our final UPDATE command.
We can SELECT from this temporary table. It is also available to JOIN on.
The Final JOIN (or Countdown)
We have a highly curated table, temp, that holds each ABC/YZ combination and it’s MAX date available.
We can JOIN this table on the destination (LastChange) table in order to update the destination’s DateComplete column. This final step can be easy to mess up, so if you are following along, I recommend having the following command handy to reset LastChange.
We want to update LastChange based on the dates from our temp table.
We don’t want any conditions except in the ON statement from the JOIN. The only condition is that for each row JOINing, the ABC/YZ combinations match.
Since we know temp’s DateComplete is accurate, we want to set LastChange’s DateComplete equal to temp’s DateComplete. And… viola!
In case you wouldn’t like to scroll back for the question:
Using the **Outbound** and the **Inbound** tables, write a SQL script to find the package and environment table combination with the most recent **DateComplete** and update the LastChange table's DateComplete using SQL.
LastChange table, DateComplete column, now has the most recent dates for each Package/Environment combination! Thanks for reading.