Nicole Mitich is the content marketing manager @ Census. She's carried a love for reading and writing since childhood, but her particular focus is on streamlining technical communication through writing. She loves seeing (and helping) technical folks share their wisdom.
Nicole Mitich is the content marketing manager @ Census. She's carried a love for reading and writing since childhood, but her particular focus is on streamlining technical communication through writing. She loves seeing (and helping) technical folks share their wisdom.
Somewhere along the line, folks began defining data engineers as people who worked in specific technologies like Hadoop or Spark. From there, they gained a reputation for being super complex and technical.
While they can be both of those đ things, that doesnât mean analysts and business users should just ignore everything data engineering-related. Sure, technology changes, but the goal of data engineers remains the same: To turn raw, messy, complex data into high-quality information other teams can use. And because engineering makes the whole data machine run, understanding the lifecycle data moves through makes you a better data collaborator â regardless of your role.
What is a data engineering lifecycle?
The data engineering lifecycle is a method for overseeing data engineering processes including data acquisition, integration, storage, processing, and analysis. This lifecycle incorporates structured and interconnected stages aimed at consistently delivering high-quality data engineering projects. The main objective is to assist data engineers in creating reliable, high-quality data sets that are suitable for their intended use and can aid in business decision-making.
In his talk, Matt split the data engineering lifecycle problems into đ stages (generation, storage, ingestion, transformation, and serving). Plus, as a teaser (if you havenât read this đ already), he discussed concepts from the Fundamentals of Data Engineering to introduce these stages to data practitioners of all flavors, to help them better collaborate with data engineers to deliver outstanding data products.
Data engineering challenges
Matt started out by tackling the two biggest challenges currently facing data engineers: Communication and holistic thinking.
For starters, communication needs to flow both ways. âď¸ Engineers need to clearly express whatâs happening to data, and stakeholders need to clearly express what raw data is or what finished data should look like. đŁ
âWe need communication to make sure data engineers understand what the data is before they turn it into a useful product, and to make sure those products are the right things,â Matt explained. âIf you can become a better communicator as a stakeholder or as an engineer, youâll be way more successful in delivering results.â
Holistic thinking, on the other hand, means getting past the technology and considering the big picture. Thatâs where understanding the data engineering lifecycle comes in clutch. đ
âWe need to think about where data starts, where it ends up, how it flows through the pipeline, and how we maintain quality as it moves,â Matt said. âThatâs what a âholistic view of dataâ means.â
The data engineering lifecycle
In âFundamentals of Data Engineering,â Matt and his co-author, Joe Reis, introduced the lifecycle concept. Essentially, the lifecycle breaks the data pipeline we all know and love into its critical pieces to see where data starts and how we maintain quality as it moves to the end.
In the first stage of the lifecycle, data is born. đŁ Itâs created in a variety of platforms, and data engineers rarely have control over any of them. The engineers need to communicate with app developers and platform experts to understand whatâs being created.
At this stage, the data âworksâ only within its source platform. Itâs not yet ready for consumption by operational analytics, BI, machine learning, reverse ETL, or any other application.
Ingestion
In the next stage, the raw data moves into the engineerâs pipeline. Itâs still messy, but now itâs in a place where we can clean it up. đ§ź
Transformation
In the transformation stage, data engineers start working with the data. It gets modeled, filtered, and joined. Depending on the desired result, engineers might start working with statistics and aggregations. Our homely little data caterpillar is quickly becoming a butterfly. đŚ
Serving
In the final stage of the data engineering lifecycle, the data â transformed into useful information â gets served up to the end user. đ˝ď¸ It might be delivered to a dashboard, a data science team, or a BI platform.
Undercurrents flow across lifecycle stages
Six foundational data engineering concepts flow across the stages of the lifecycle. Engineers need to keep these undercurrents in mind no matter what stage of the pipeline theyâre working with.
Security. As data pros, we can never forget the importance of security. At every stage of the data engineering lifecycle, we need to keep private data private and protected from misuse.
Data management. Matt defines this as best practices like governance, maintaining data quality, and keeping track of what data is collected and where itâs stored.
Data architecture. Weâre not just talking about individual technologies that come and go. This is the big picture of how data gets processed and flows through systems.
Orchestration. Orchestrating data means coordinating all the moving parts of the lifecycle.
Software engineering. OK, yes, data engineers arenât defined by their tools, but they do need to be proficient in the pipelineâs technology.
You donât get data science without data engineering
Monica Rogatiâs infamous data science hierarchy of needs illustration
But you canât get there without a solid foundation of less-sexy layers like instrumentation, infrastructure, and anomaly detection.
âEveryone wants to deliver a really awesome dashboard or a model to transcribe speech or a live operational analytics dashboard for situational awareness,â Matt said. âBut to do that, you have to have these layers underneath and you have to have communication between the different layers.â
Even if you have every layer of the pyramid in place, Matt said, if the data engineers in the basement donât communicate the awesome things theyâre doing with the analysts halfway up, and the analysts donât tell the engineers what kind of insights they want to create, you wonât successfully execute the functions at the top.
So, despite its reputation, data engineering is not a mystical art. đŞ Nor is it simply being really, really good at Airflow. Itâs the mechanism that allows data science to produce the amazing insights weâre all aiming for.
And the better analysts and other stakeholders understand how data engineering works, the better theyâll work with engineers and get the results theyâre after.
Want to learn more? Check out Mattâs full presentation on the data engineering lifecycle from our Summer Community Days here. đ
All of our careers (and the collective careers of our data teams) are the sum of critical decisions made at the right or wrong time. Drizly's data team grew 7x in just 3 years. Here's how.
With rapid growth comes real data organization problems. Check out this SCD session recap centering around a real-world application for dbt: How N26 leveraged dbt during times of hypergrowth.
There's more to a great, meaningful data career than technical skills. To take your career to the moon, you need to invest in people skills to help you understand why your data matters, too.
Allie Beazell
Allie Beazell and
May 21, 2021
Related integrations
No items found.
Get the best data & ops content (not just our post!) delivered straight to your inbox