Review of 11 Best Practices For Data Engineers
A Modern Data Infrastructure take on a vendor Email
To {{ Data_Science_Manager }},
Ideally, you already have an engaged Data Engineer on your team, who is willing to lend their skeptical eye to vendor emails! Unfortunately, you’re trying to learn more about Data Engineering for the same reason that you don’t have this eye available. You’re looking to make a pivot to hiring Data Engineers so that your Data Scientists can focus on what they like; models.
Until you've turned that ship, this letter can be your critical eye for vendor emails.
Snowflake is a major Data Warehouse vendor. Snowflake even has billboards. Like many other software vendors, they put out learning content or blogs that have a mix of useful and biased words.
I recently received an email from them titled:
11 Best Practices for Data Engineering
And that kind of perked my eyes because I thought, “Hahah. I do that.” I will review each Best Practice and give my expert opinion, from a Modern Data Infrastructure point of view.
1. ENABLE YOUR PIPELINE TO HANDLE CONCURRENT WORKLOADS
A modern data pipeline that lives in the cloud features an elastic multi-cluster, shared data architecture that enables the handling of concurrent workloads.
If your workflows are in SQL and data are becoming stale before the workflows are done? Then this best practice is true. I will never argue against a hard business requirement for timely data.
However, I don’t see this is as a need for companies that have timely-enough pipelines written in SQL. Don’t pre-optimize with parallel/concurrent computing (Spark, Scala) if you don’t need to.
2. TAP INTO EXISTING SKILLS TO GET THE JOB DONE
Absolutely Maximize SQL as much as you can. Automate your team's existing and recurring SQL queries.
3. USE DATA STREAMING INSTEAD OF BATCH INGESTION
Having extremely fine grain data, like real-time shopping data, is a luxury in a world where data integrity/quality issues exist.
Not a “best practice”, as I would define it for 90%+ of companies.
4. STREAMLINE PIPELINE DEVELOPMENT PROCESSES
Paraphrase: Use a Cloud Data Platform to manage test and cloud environments in ELT development.
Using cloud test and production environments is Software Engineering fundamentals.
As a general proponent of data ops, I believe in bringing software engineering best practices 🙄 over to data teams. This “best practice” is a subset of the following one.
5. OPERATIONALIZE PIPELINE DEVELOPMENT
Paraphrase: Follow Data Ops
I would generally agree. I write generally, because 18 principles seems like a bit overkill in my personal keep it simple value.
6. INVEST IN TOOLS WITH BUILT-IN CONNECTIVITY
Paraphrase: Cloud storage allows you easily store from REST APIs and Databases.
Absolutely agree. Built in connectivity makes your code simpler. Being able to write simple load SQL statement (many flat files -> single database table) like the following, makes life easier:
COPY tablename
FROM S3_dir # copies recursively
USING CREDENTIALS
'aws_credentials'
COPY OPTIONS
7. INCORPORATE EXTENSIBILITY
By using APIs and pipelining tools, you can create a data flow that uses outside code seamlessly.
Hard Agree. Extraction is a fundamental.
3rd party Applications and vendors make your data available through three main methods; CSV export, a dashboard that doesn’t meet your specific questions, and a REST API.
Out of all three, leveraging 3rd party APIs allows for you to put the maximum amount of data generated, into your Data Lake or Warehouse.
8. ENABLE DATA SHARING IN YOUR PIPELINES
Your Data Infrastructure code should all live in the same place. At least, code dealing with Automation, Extraction, and finally Loading*. This eliminates data silos.
*Climbing up from the Data Hierarchy of needs.
9. CHOOSE THE RIGHT TOOL FOR DATA WRANGLING
Everyone wants cleaner data. Data Integrity and Quality issues rarely stem from tool choice, however. They usually stem from business priorities or disparate data sources.
10. BUILD DATA CATALOGING INTO YOUR ENGINEERING STRATEGY
Analysts may have questions about the data in your pipeline such as where it came from, who has accessed it, or which business process owns it. A data scientist may need to view the data in its raw form in order to ensure its veracity… Build a data catalog that keeps track of the data lineage so you can trace the data if needed.
I would rank Data Cataloging or, Data Lineage, as one of the last things a company should do before taking on Analytics Optimization in Spark. Historical views of your data can be achieved with building your own Temporal Tables, paying a 3rd party some money, or using your existing cloud connectivity with AWS Glue metadata repository.
11. RELY ON DATA OWNERS TO SET SECURITY POLICY
I believe this principle goes beyond security. Data Owners/Stakeholders will always know their data the best. Meet your stakeholders where they’re at.
To be specific, this means Data Engineer does the work so that the data is easy to use with whatever tool the stakeholder uses; SQL, Python, or Excel. Showing that level of dedication will make it easier for the stakeholder/owner to have input in which tables or columns are of most interest. Insist on the teamwork.
Conclusion
I agree the most with numbers 5, 6, 7, 8, and 11. Achieving these five fundamentals will get your Data Infrastructure footing on solid ground; without data silos and worst-case-scenario, data loss. Thanks for reading!
Data Engineering Content
This fantastic talk touches on all of the principles of a transparent and reproducible Data Infrastructure covered in this newsletter.
Follow Emilie here.
Columnar Databases for Data Warehouses, explained by Stitch data.
What’s an API? - Technically Newsletter
About the Author and Newsletter
I automate data processes. I work mostly in Python, SQL, and bash. In my spare time, I collect Lenca pottery, walk my dog, and listen to music.
At Modern Data Infrastructure, I democratize the knowledge it takes to understand an open-source, transparent, and reproducible Data Infrastructure.
More at; What is Modern Data Infrastructure.
Review of 11 Best Practices For Data Engineers
Hi Angel, Thanks for linking to my blog post on data lineage: https://tokern.io/blog/data-lineage-on-redshift. One correction is that this an open source side project and there is no commercial angle to it. The project is available on github at: https://github.com/tokern/lineage
Also nice to see more data engineering newsletters on substack. I am also posting on data governance, privacy and security at https://dbadminnews.substack.com