Events

Final OA Book Club wrap-up: Author AMA highlights & last lessons on data engineering fundamentals | Census

Zachary Pollatsek
Zachary Pollatsek January 12, 2023

Zachary is a data scientist with a strong track record in finding business insights through data science and machine learning. Previously, he worked at Traeger Pellet Grills as a test engineer and product design engineer, but recently, he's decided to pursue his passion for data science through the Flatiron School. Salt Lake City, Utah, United States

Welcome to the 3rd and final volume of the TitleCase OA Book Club catch-up series on The Fundamentals of Data Engineering by Joe Reis and Matt Housley! If you missed the first or second volume, you can find those here (Volume 1) and here (Volume 2)

To recap: Part one focused on the decision frameworks covered in the first section of reading for our group. Part two covered more intricate details from the data engineering lifecycle, including storage. This third installment discusses the Ask Me Anything session with one of the authors, Joe Reis, and wraps up our discussion on a fantastic read. 

As our book club meetings conclude, I’m still astounded by the sheer amount of information covered throughout the book. For anyone that has followed this catch-up series or just recently stumbled upon them, the main takeaway from this book should be: Focus on the abstract rather than concrete tools/technologies. 

How can we interpret this theme? Business needs must be the driving factor when selecting each component of the data engineering lifecycle. While the authors went into extensive detail on the technologies regarding each step of the lifecycle, they continuously returned to the idea that the business and its data needs should drive the selection of specific technologies. 

Moving onto some key takeaways from the Ask Me Anything session with Joe Reis, I took three key points from his answers 👇

  1. Companies interested in using data/having data teams must build a culture of open communication and accepting mistakes.
  2. Data teams must find a way to drive growth for their organizations rather than simply becoming a cost sink.
  3. Analytics offering recommendations for future action rather than simply describing the past will become more mainstream within organizations. 

As for the final points regarding the book, I’ve consolidated our discussion takeaways into three main points as well:

  1. Read this book if you’re new to the data world or looking for a refresher as a veteran data practitioner.
  2. As an organization, determine exactly what you need to learn from your data prior to jumping in headfirst.
  3. Focus on the abstract data needs of your organization prior to selecting relevant technologies or tools. 

Below, I’ll dive a little deeper into these key takeaways, and expand a bit on why the OA Club Book Club was so valuable. 🤓

Three key takeaways from the AMA with Joe Reis for data folks of all levels

To help us all get the most out of reading, author Joe Reis joined us for an AMA during this last section of the discussion. He shared a ton of great insight around data engineering best practices, but three, in particular, stood out to me: 

  1. Focus on building a culture of open communication and accepting mistakes 
  2. Drive business growth with your data team (and share your wins) 
  3. Prescriptive analytics are on the verge of dethroning descriptive analytics

Here’s a bit more detail about each of these three points. 👇

1. Build a culture of open communication and accepting mistakes

Joe discussed the importance of facilitating a culture within businesses that praises open communication and accepting mistakes. 👐

He explained that he has seen numerous examples of data teams being punished or ostracized for mistakes or errors that occurred somewhere along the data engineering lifecycle. These mistakes can lead to an “us vs. them” mentality between data teams and the rest of the organization, spurring distrust throughout a company and hindering forward progress on data projects. 

Open communication between stakeholders and data teams is necessary to align everyone on company goals for analysis/reports. The ability to quickly accept mistakes, learn from them and move on separates a healthy company culture from the less-collaborative ones. 

When Joe discussed fostering this type of culture, my mind immediately thought of Ray Dalio’s notion of “radical transparency”. Radical transparency ensures that issues/mistakes are brought up and tackled immediately rather than being thrown under the rug. 

2. Drive business growth with your data team (and share your wins) 

Moving onto the second takeaway from Joe’s AMA: Data teams should strive to promote growth for the organization (revenue, improved analytics, etc.) rather than becoming another cost sink. 🌱

When there’s little to no visibility into the day-to-day work of data teams, it’s easy for management to overlook their work and simply look at costs associated with building the infrastructure for data. Data teams should, instead, promote their own successes and convey specific financial metrics (such as ROI) to management to showcase the value they create. 

Beyond communicating your wins to leadership, data teams also need to understand the costs associated with different parts of the data eng lifecycle so they can identify where to trim expenses. This combination of communicating your wins plus advocating for cost savings makes it easier to demonstrate the value of the data team and its work. 

3. Prescriptive analytics are on the verge of dethroning descriptive analytics

Perhaps the most exciting portion of the AMA with Joe was his breakdown of the ongoing shift from descriptive analytics to prescriptive analytics. 

Today, the majority of analytics in business are descriptive; in other words, these analytics describe something that has already happened. An example of this could be a report that displays customer churn rate over the past 5 years. 

Prescriptive analytics, on the other hand, prescribe actions to take in the future; these types of analytics are still in their infancy as they rely on machine learning and artificial intelligence. Prescriptive analytics aims to answer questions like, “Which marketing campaign should be run next quarter to maximize revenue for a new product launch?” 🤔

This is just one example of the future of analytics, and Joe says these types of analytics are still in the very early stage, as ML and AI technologies are rapidly evolving. However, while we wait for ML and AI tech to catch up to our data dreams, we can begin to shift our analytics strategies from descriptive to prescriptive using tools like reverse ETL to power up downstream marketing and sales platforms. 

Three final learnings from the Fundamentals of Data Engineering: Everyone should read this book & other takeaways

In case you haven’t gathered, there’s a ton of useful information in this book for data professionals of all seasons and stages (even if you’re brand new to the field or adjacent to data engineering). Now that we’ve wrapped up the reading, here are my top takeaways: 👀

1. Seasoned veteran or data newbie? You should read this book

This book has absolutely surpassed my expectations. I was excited to read it with the OA book club, but after finishing it, I view the book as a must-read for anyone remotely connected to the data space. When I say anyone, I truly mean anyone; with data becoming the currency of the future for businesses, everyone will benefit from reading this book (even non-data engineers). 

The Fundamentals of Data Engineering dives into every aspect of the data engineering lifecycle with amazing insights from the authors. As a relatively new member of the data world, this book has given me a rock-solid foundation on the intricacies of data generation, storage, transformation, and so on.

2. Discuss business needs/wants prior to building out your data infrastructure

This second point goes hand in hand with the main overarching theme throughout the book: Focus on what value/information you want to gain prior to selecting any specific components of your infrastructure. 🤝

Members from throughout your organization/business should help determine each component of the infrastructure. This includes both technical and non-technical personnel. Stakeholders and data engineers must determine how their business can benefit from its data and in what ways they will use it prior to building out the infrastructure. 

This idea connects well to a key point from our first discussion: Make reversible decisions whenever possible. Reversible decisions allow your organization to make decisions quickly without fear of consequence. If your company solely makes reversible decisions, it will be a piece of cake to adjust your data pipeline as business needs morph/pivot. 

3. Keep it abstract; specific tools and technologies come second

Last, but not least, I‘ll wrap up our final installment of the catch-up series with the main theme throughout the book: Abstract business needs drive which tools/technologies are selected. 🏎️

In today’s work environment, it’s easy to get roped into the newest hot tech without considering what features of this technology truly augment your current data stack. When building out a new data stack, or adding a new component to your existing stack, always determine what value you need for your business prior to selecting anything. Countless companies have fallen into the trap of selecting technologies that don’t fit their business model. Keep it abstract.

Wrapping up: Joe’s words of wisdom & how to join the OA Book Club

Thanks, everyone, for coming along on this book club recap journey with me! I have learned so much from this book, and I plan to reread a few chapters to ensure I didn’t miss anything. Hearing Joe discuss building a healthy culture, driving growth with data, and the rise of prescriptive analytics during his AMA was fantastic. The main takeaways from Fundamentals of Data Engineering as a whole are: 

  • Wherever you are in your data journey, read this book.
  • Discuss your business needs prior to building anything.
  • Keep it abstract. 

If you’re interested in joining the TitleCase OA Book Club, head here. Parker Rogers, data community advocate at Census, leads a discussion every two weeks for about an hour and will be launching the next iteration of the book club soon. It’s incredible, and I can’t emphasize just how much I’ve gained over my time in the club after just one book. I hope to see you in our next Book Club call! 📚

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: