Events

OA Book Club round up: A data person by any other name… and other data team management lessons | Census

Zachary Pollatsek
Zachary Pollatsek April 14, 2023

Zachary is a data scientist with a strong track record in finding business insights through data science and machine learning. Previously, he worked at Traeger Pellet Grills as a test engineer and product design engineer, but recently, he's decided to pursue his passion for data science through the Flatiron School. Salt Lake City, Utah, United States

What if I told you that just seven years ago, only 15% of businesses actually deployed their big data projects into production? How is that possible? 🤯

Big Data was the biggest trend of the 2010s, yet only a fraction of companies investing in the technology saw any sort of returns from these investments. According to a Gartner press release from the end of 2016, this was the truth in the 2010s. The previous year, only 14% deployed their big data initiatives into production. While there have been significant advancements in both technology and domain knowledge (cloud, distributed systems, etc.), most companies still fail to find success with data-driven insights.

The problem? More often than not, management doesn’t truly understand the complexities of becoming a data-driven organization, or how to make the most of their data teams. 

In this next edition of the Operational Analytics Book Club catch-up series, we are covering the perfect book to address this issue: Data Teams: A Unified Management Model for Successful Data-Focused Teams by Jesse Anderson. This post is half of a two-part series covering the book as well as the discussion from the book club meetings for the first two parts of the book: Introducing Data Teams and Building Your Data Team

Interested in joining a wonderful community of data professionals? Head over to this link to check out the Operational Analytics Club. The book club is simply one portion of this fantastic club; there are also workshops, mentorship opportunities, and much more. 

Alright, let’s dive into the highlight lessons from the first two parts of our book. 👇

I want to preface that this book will not cover how to implement the latest and greatest technologies for data (distributed systems, ETL tools and reverse ETL tools, orchestration tools, etc). Jesse Anderson has worked with countless management teams to determine what works and what leads to failure within a data team. He wrote this book as a guide for management to better understand the pain points and sources of friction among data practitioners.

This book specifically targets C-suite members, upper management, and data team managers who want to expand their knowledge of the data space and better connect/succeed with their technical team members. So if you’re one of those folks (or hope to be someday soon), this book is for you. 🙌

After reading through the first two parts of this book, I can bucket the takeaways into two themes: 

  1. It’s all just data, right? Wrong.
  2. The importance of a seasoned Data Engineer cannot be overstated.

Below, we’ll break down each of them in more detail, and explain how you can apply them for management in the data world. 

It’s all just data, right? Wrong.

When upper management wants to elevate their business to the next level with data analytics, AI/ML, and data science, the majority of them are simply thinking, “Let’s take advantage of our data.”

But how much data?

Is the data considered “big data” or “small data”?

Will distributed systems come into play?

These are all questions that will not pop up to those unfamiliar with the current data landscape; one of the author’s key arguments is that management needs to have (at minimum) baseline knowledge of what comprises “big data.”

In our book club discussion, Jesse mentioned that he coined Jesse’s law: You can’t build a simple distributed system.

A distributed system could be a task that must be broken down into multiple machines or data that is stored across multiple computers. When you add distributed systems into your data pipelines, the complexity increases 10-15x which underscores the importance of management understanding the distinction between the quantity of data you have. 

Another major disconnect that occurs between management and the technical members actually performing the software/data engineering and data science is that management views each data skill set as interchangeable with the next.

“We just need to hire a data scientist and then we’ll begin to drive value from our data,” is a common thought when the C-suite asks their VPs or directors to acquire a “data person.” Jesse clearly delineates the structure of a data team that will optimize an organization’s odds of success; three teams are necessary to maximize the ROI for putting time and capital into data. Here are the one-sentence definitions that Jesse gives in his book:

  1. Data Science: “A data scientist is someone who has augmented their math and statistics background with programming to analyze data and create applied mathematical models.” 
  2. Data Engineering: “A data engineer is someone who has specialized their skills in creating software solutions around big data.”
  3. Operations: “An operations engineer is someone with an operational or systems engineering background who has specialized their skills in big data operations, understands data, and has learned some programming.”

As Jesse points out early on in Data Teams: “From a management’s 30,000-foot view – and this is where management creates the problem – they all look like the same thing. They all transform data, they all program, and so on… This misunderstanding is what really kills teams and projects.” 

The idea that all data professionals are created equal has led many managers to select a data scientist as their first data hire. Bur this choice never unfolds in the direction managers expect.

Data scientists, by nature, are interested in advanced math, statistics, and the creation of machine-learning models. These individuals rarely have the software engineering experience to ingest the data needed for model creation, organize the data in a logical schema, or automate ingesting, cleaning, and transforming data.

The mistake of putting them in the first data person seat often leads to two key issues: 

  1. The data scientist begins spending all of their time cleaning, organizing, and moving data when they would prefer to apply statistics and math to gain insights from the data
  2. Management feels frustrated with the lack of/slow progress achieved by the “data team”

The cartoon below perfectly illustrates how I view a data team comprised of a single data scientist:

The importance of a seasoned Data Engineer cannot be overstated

If you’re looking for one key argument to latch onto from these first two parts of the book, here it is: Businesses will always require data engineers, as ETL tools will never scale to replace these individuals. 

Having a seasoned, veteran data engineer in the infancy of a business will set the organization up for success. Data engineers possess the software engineering skills necessary to build robust data pipelines that won’t break or collapse as soon as something “weird” enters the pipeline. Given their proficiency with software engineering practices, data engineers can create products with well-built and organized code rather than code that routinely breaks due to poor design and programming logic.

But if a company makes the mistake of onboarding data engineers too late, the data engineers cannot save poorly-built data products. As Jesse explains, data teams that have DBAs or SQL-focused developers building data products without the collaboration or mentorship of data engineers will create “‘production’ system[s] held together with more duct tape and hope than anything else.”

So, what sets data engineers apart from their data scientist and analyst counterparts? Here are just a few key skills that businesses should look for in a data engineer.

Distributed systems

If your business truly needs to analyze and gain insight from “big data,” you need distributed systems. Data engineers should at least have an intermediate-level ability with distributed systems since they're fairly complex (see Jesse’s Law). Distributed systems require knowledge of resource allocation, network bandwidth requirements, virtual machine/container creation, proper dataset partitioning, and failure resolution.

Advanced programming

Many data scientists and analysts possess programming skills, but data engineers possess the skills to take programs into production. Jesse clearly explains that advanced programming is not solely understanding a language’s syntax, but also understanding how to perform continuous integration, unit testing, and best software engineering practices.

Domain knowledge

While data engineers enjoy various technical skills, they must grasp how their skills translate to business success. Data engineers must understand the domain in which their company plays in order to design robust data products that effectively solve their customers’ issues and drive business growth. The notion that all technical members must understand business goals at a high-level echoes Fundamentals of Data Engineering (our book club’s previous read).

Now that you know a few key skills required of data engineers, you can clearly see that data engineers are not data scientists. Oftentimes, management naively views data engineers as movers of data from point A to point B; this belief drastically downplays their expertise and skills. For a simple depiction of the difference between data engineers and scientists, Jesse gives a perfect visual.


What do managers need to understand? “Data” and the unique data practitioners within their business.

I found the first two parts of Data Teams extremely enlightening. While this book targets leaders in the data space, many of Jesse’s ideas and concepts can translate to any industry: If management doesn’t understand their employees’ struggles/pain points, distrust, and frustration will slowly bubble to the surface. I’m looking forward to diving into Part 3, which digs into the meat of the book: Managing data teams

In reading the first portion of this book, I can safely say managers and leaders have nothing to lose by reading this book. I would not have found this book without the Operational Analytics Club; I hope to see you for our next Book Club meeting!

👉 If you’re interested in joining the TitleCase OA Book Club, head to this link. Parker Rogers, data community advocate at Census, leads it every two weeks for about an hour.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: