Events

OA Book Club Catch Up Vol. 1 : Decisions, decisions, decisions from the Fundamentals of Data Engineering | Census

Zachary Pollatsek
Zachary Pollatsek December 08, 2022

Zachary is a data scientist with a strong track record in finding business insights through data science and machine learning. Previously, he worked at Traeger Pellet Grills as a test engineer and product design engineer, but recently, he's decided to pursue his passion for data science through the Flatiron School. Salt Lake City, Utah, United States

When I first joined the Operational Analytics Club, I initially viewed this as simply another means to network with members of the data community.

Instead, I quickly learned there are many opportunities to connect more deeply with folks within this club, including technical workshops, mentorship programs, and, my personal favorite, a book club. I immediately joined the book club to motivate myself to regularly read books on the data world and gain insights from veterans in the data space. 

TitleCase: OA Book Club

Originally, I joined the OA Book Club since their chosen book piqued my interest: Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis and Matt Housley (which, fun fact, I’d actually had the opportunity to chat with Joe at a couple of data meet-ups in the Greater Salt Lake City area).

This book is a must-read for anyone entering into the tech space — whether that's data engineers/scientists looking for a more abstract view of an extremely complex landscape or business leaders who want to better understand the intricacies of the data lifecycle in order to drive strong ROI from the data gathered by their businesses. 

Before we get too deep into the summary of this first part of the book, a note on the overall theme: Focus on the abstract rather than specific tools and technologies. Before reading, I honestly felt lost about which direction to take and what tools I needed to learn before others. 🤷 This book has completely shifted my view on how to understand data engineering; I need to understand the entirety of the data engineering lifecycle without focusing too much on the specific tools. 

This article is the first of a three-part series covering each book club meeting on The Fundamentals of Data Engineering in The OA Club (here’s more info if you’d like to join). Part one covers the discussion regarding chapters 1-3. The two main takeaways I gathered from the first meeting are:

  1. Every decision regarding data needs to be driven by business needs 
  2. Reversible decisions should be prioritized for each step in the data engineering lifecycle.

Let’s get into it! 🤓

Business needs drive decisions on data

At the end of the day, everyone working on data within an organization needs to start by asking themselves: “How does this project/tool/technology create value for my business?” 

There are many metrics you can use to figure out how much value a new project or tool can create, including ROI, opportunity cost, etc. The authors exemplify how a business can instill this question into company culture by discussing their data maturity model. There are three tiers to this model 👇

  1. Starting with Data: These companies are early in building out their data infrastructure and teams. Or, as Housley and Reis write, “Reports or analyses lack formal structure, and most requests for data are ad hoc.” 
  2. Scaling with Data: These companies have formal data practices in place, but they have not established scalable architecture that can shift at a moment’s notice based on business needs or technological disruptions. 
  3. Leading with Data: These companies are truly data-driven. Again, Housley and Reis are spot-on with: “The automated [data] pipelines and systems created by data engineers allow people within the company to do self-service analytics and ML. Introducing new data sources is seamless, and tangible value is derived.”

I truly believe the most successful companies for the foreseeable future will all be data-led companies. Companies that create robust, modular data architectures and base all decisions on driving value for the business will leave their competitors in the dust. 🚀

Business leaders can not afford to stop thinking about their end users or customers; once a company begins to pursue new technologies or projects simply to do them, they have lost sight of the end goal. As the authors write, “Data has value when it’s used for practical purposes. Data that is not consumed or queried is simply inert. Data vanity projects are a major risk for companies… Data projects must be intentional across the lifecycle.

Vanity data projects or “passion projects”, while fun, can quickly take a data team’s focus away from creating value for the organization or end users’ needs. These projects tend to be nothing more than a massive waste of time and capital.

Anything that a data team undertakes simply to use a new technology or tool without considering how the project will benefit their organization could be considered a vanity data project. An interesting point from the book is that “technology distractions are a more significant danger [for companies that lead with data]. There’s a temptation to pursue expensive hobby projects that don’t deliver value to the business.”

A good representation of how to avoid “passion projects” is The Data Science Hierarchy of Needs (shown below). 👇

Think of AI as the top of a pyramid of needs. AI is great, in theory, but before you get to the "nice-to-haves" you need the master the basics: data literacy, collection, and infrastructure.
Think of AI as the top of a pyramid of needs. AI is great, in theory, but before you get to the "nice-to-haves" you need the master the basics: data literacy, collection, and infrastructure.

If your organization does not have a reliable data infrastructure that can collect, move/store, and transform data, then it is probably best to hold off on creating any ML models or using complex AI. Or, as Housley and Reis put it, “[We have] seen countless data teams get stuck and fall short when they try to jump to ML without building a solid data foundation.”

These teams most likely wanted to jump into machine learning without considering how this would drive meaningful value for their organization. 

Reversible decisions keep your business agile

Reversible decisions — those that can be undone if results are subpar or unpredictable — should be the bread-and-butter of all decisions made within a company. Irreversible decisions leave your organization stuck with the outcomes, and these decisions will often cause significant pain down the road. While this concept seems relatively obvious, in practice it is anything but. 

This principle applies neatly to data architecture decisions. As many of us know, creating data architecture within an organization is full of trade-offs: If each decision made can be reversed at a moment’s notice (or close to it), any future business risks can be mitigated.

The authors sum up this sentiment well: “Since the stakes attached to each reversible decision (two-way door) are low, organizations can make more decisions, iterating, improving, and collecting data rapidly.” 🚪 Jeff Bezos coined the terms one-way vs. two-way doors to describe how organizations/people can make decisions 👇

  • A one-way door means that you walk through the door, and if you don’t like what you see, you can’t go back through the door.
  • A two-way door on the other hand allows you to survey the outcome of going through the door, and if you aren’t satisfied, you can go back through the door (i.e. reverse the decision). 🔙

The idea of making reversible decisions goes hand-in-hand with building loosely coupled systems. 🤝 In a tightly coupled system, there are significant inter-dependencies among components of the system. In other words, if one component of the system fails, the entire system fails. In a loosely coupled system, components work independently from one another. If one component fails, the entire system does not collapse. 

Building loosely coupled architecture was a topic that came up throughout the first book club meeting. Many members explained one benefit of the loosely coupled architecture is it allows for plug-and-play solutions, allowing for modularity and aligning with the ideology of making reversible decisions.

In a loosely coupled architecture, if you and your organization realize one tool in your data stack has either become obsolete or no longer suits your business requirements, you can simply replace the tool without significant downtime.

However, if your organization has a tightly coupled architecture, you’ll experience significantly more headaches and struggles when trying to replace one specific tool. So, if you and your organization are trying to completely revamp your data stack, it would be wise to identify any aspects of your architecture that are irreversible and replace them earlier. 

Beyond decisions: Looking ahead at our next learnings

I have learned much more than I could have anticipated just from the first few chapters, and I am thrilled to dive deeper into the subject of data engineering with this group. Again, the key arguments that I have picked up from the authors thus far are: 

  • Business needs ultimately drive all decisions regarding data.
  • Make reversible decisions within your business as much as possible. 

Honestly, I would not have found this book had I not joined the OA club, and I’m ecstatic to be part of a community full of data practitioners who are so willing to offer guidance and advice to someone much earlier in their data career. 

The OA Club and TitleCase have been a blast, and I can’t emphasize just how much I’ve gained in this first segment! I can’t wait to see you at the next Book Club meeting. 📚 👀

👉 If you’re interested in joining the TitleCase OA Book Club, head to this link. Parker Rogers, data community advocate at Census, leads it every two weeks for about an hour.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: