What does 'Advanced SQL' mean?
It's all over job requirements but you never get a full picture answer from anyone
I interviewed for a Metrics Analyst position in 2017 and the requirements included intermediate SQL. I wasn’t sure what this really meant and I hoped that my two upper class database classes would be enough. The interview went something like this,
Interviewer: Can you tell me of a challenge you’ve solved in SQL?
Me: Sure thing. I used SQL at my previous internship to create a table for an executive dashboard. To create this table, or view, I wrote a query to handle a recursive one to many relationship. Basically, parent row has many children row and those children rows have many children rows of their own, three times. I used common table expressions to query each hierarchy level before joining on a final view table! I’m not sure how advanced that is. I saw you were looking for intermediate.
*awkward moment*
Interviewer: Ok in case anybody asks you in the future if you know advanced SQL? You do.
Yet, there’s much more to SQL than recursive, or self-joins.
My first answer to what advanced SQL means is; data modeling.
What’s in a Data Modeling name?
Data Modeling can refer to statistical modeling or how your organize your database objects. In this newsletter, I assume Data Modeling to be about databases, as I have a collection of Data Modeling articles.
Two different kinds of using data modeling
As an analyst, advanced SQL definitely entails strong data modeling knowledge. With data modeling, an analyst can tie together (JOINs), or break apart (normalization), available and relevant data sources for a more complete analysis. Analyst queries tend to be more ephemeral, although this is becoming less true with industry leading tools*.
Data modeling, for an engineer, is more about measuring twice before cutting. Our database commands, be they data definition, manipulation, or queries, are often automated by machines. Because these commands are automated, engineer’s data modeling decisions revolve more around minimizing maintenance cost while also maximize impact on team efficiency and effectiveness.
Data Modeling is one of the first places my mind goes to, when I think about advanced SQL but there’s more that may be implied by job descriptions.
My second answer is understanding and using the underlying data structures; index or columns.
Data structures as a cost cutting service
If you’re querying a relational database, use the index as much as possible. If your database administrator has done a good job of indexing, using each table’s index as much as possible will lead to faster queries and lower cloud computational bills. Relational databases use the B Tree data structure to store data. The index is how computers can traverse these B Trees quickly.
Resources
B Trees and it's many variants are covered in much more detail in the first half of Database Internals.
If you also want to learn how and where to use the index. I recommend the very first book, SQL Performance Explained, on my reading book list for Data Engineers.
What about columnar databases?
Columnar databases do not use indexes like relational databases. They use unidirectional vectors, or column stores, to compress their data. This conserves energy in big analytical queries, in which your database is gathering a lot of predictably similar data. More about columnar compression in my previous letter.
Final advanced topics for engineers is transactional management and distributed databases.
What lurks beneath indexes and column stores
Transactional management can be thought of as the lifetime of a query. This lifetime of a query flows through all of the following;
Client Communications Manager - client protocols
Process Manager - Admission, dispatch, and scheduling
Query Parsing and Authorization - optimizer, plan executor, DDL
Transactional storage manager - Access methods, concurrency control with bugger, lock, and log managers
Shared Components and Utilities - Memory and replication management
That’s a lot. So there’s an acronym for us; ACID.
ACID is the set of four standards applied to database transactions. As an analyst, you generally don’t have to think about ACID queries but an engineer might be required to think about infrastructure aspects to SQL.
In 1983, Andreas Reuter and Theo Härder coined the acronym ACID, building on earlier work by Jim Gray[2] who named atomicity, consistency, and durability, but not isolation, when characterizing the transaction concept. These four properties are the major guarantees of the transaction paradigm, which has influenced many aspects of development in database systems.
There’s more to concurrent storage and retrieval
Concurrency management is a dense sub-topic as well. Database storage consistency is a hard topic but if you are reading about lock and log managers, you’re headed in the right direction.
Access methods do not differ very much across common databases because the algorithms for data retrieval are pretty established at the moment. You can read more about these algorithms in Architecture of a Database System. Be warned, it is expensive and dense. However, I will be writing a more fleshed out post on transactional management and distributed databases soon / eventually.
Conclusion
There may be all sorts of technical avenues to pursue but ultimately, truly effective SQL is that which solves business problems. The set of business problems to solve is different for engineers and analysts, so SQL expertise should reflect that difference. Thanks for reading!
Curated Content
*tools like dbt and Dataform are making transformation less ephemeral and more effective to long term business goals.
History of Joins in PostgreSQL and the algorithm used to plan JOIN operations
In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.How to Cultivate Production Excellence with Liz Fong-Jones, Honeycomb
About the Author and Newsletter
I am a small tech business owner of a data infrastructure consultancy shop, Ocelot Data. We have taken on consultancy projects at Stripe, Capital One, and Redmonk. If you would like to contract out a 8-12 week database infrastructure project.
At Modern Data Infrastructure, I democratize the knowledge it takes to understand an open-source, transparent, and reproducible Data Infrastructure. In my spare time, I collect Lenca pottery, walk my dog, and listen to music.
More at; What is Modern Data Infrastructure.