Best Practices

Serious about rETL? You need data observability too | Census

Mei Tao May 31, 2022

Since the debut of dashboards, there have been broken dashboards – something every data engineer laments. 😩

But neither the modern data platform nor the use cases it powers are as simple as they used to be, so what started as minor inconveniences, or perhaps the occasional email from an annoyed employee in the marketing department, have mutated into serious business disruptions (and a glaring spotlight from the C-suite). 🔦

The data downtime equation

Monte Carlo coined the term “data downtime” to refer to this terrifying period of time when your data is partial, erroneous, missing, or otherwise inaccurate.

The incredibly complex and multifaceted nature of modern data operations virtually necessitates that every organization suffers some period of data downtime. Unfortunately, while it's a quantifiable metric with an impact that can be measured and optimized, downtime is typically addressed in a reactive (costly) manner. 💰

Much like Thanos, data downtime is inevitable

Progressive data teams that have (wisely) implemented reverse ETL have made their data more visible, actionable, and valuable. However, when you make your data more valuable, you also make any data downtime more costly – and when you make your data more visible, bad data erodes trust more quickly.

In this post, you'll learn:

  • What data observability is
  • How data observability reduces data downtime
  • How data observability is different than testing or monitoring
  • Why data observability and reverse ETL are better together.

What is data observability?

Data observability is an organization’s ability to fully understand the health of the data across the entire system. Like its DevOps counterpart, data observability uses automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues.

The 5 key pillars of data observability

We’ve broken data observability into five pillars: freshness, distribution, volume, schema, and lineage. These components meld together to provide valuable insight into the quality and reliability of your data. 🤝

  • Freshness: Is the data recent? When was the last time it was generated?
  • Distribution: Is the data within accepted ranges?
  • Volume: Has all the expected data arrived?
  • Schema: What is the schema, and how has it changed? Who has made these changes and for what reasons?
  • Lineage: Where did my data break? Which tables or reports were affected

How data observability reduces data downtime

Data observability solutions connect to your existing stack without having to modify pipelines or write new code. They can monitor data at rest (without having to extract it from where it's stored) with no prior mapping and, by properly leveraging machine learning, with minimal configuration.

With proactive monitoring and alerting in place, data issue detection time is drastically reduced. Issues that might have previously taken teams hours, days or even weeks (‼️) to notice are caught and sent to teams in Slack (or another communication tool) within minutes. ⏱️

A data team coordinating on incident triage

Shortening the time to detection also naturally shortens the time to resolution, as Choozle Chief Technical Officer Adam Woods explains when discussing how his team reduced data downtime by 88%.

“We see about 2 to 3 real incidents every week of varying severity. Those issues are resolved in an hour whereas before it might take a full day,” said Adam. “When you are alerted closer to the time of the breakage, it’s a quicker cognitive jump to understand what has changed in the environment.”

Data observability also shortens the time to resolution with end-to-end lineage, which monitors data assets and pipelines along the entire data lifecycle and is often used alongside reverse ETL change logs and live debuggers. This pinpoints where the breakage has occurred and greatly accelerates root cause analysis.

Field-level lineage provides insights into how, why, and where data pipelines break

So, data observability solutions can reduce time to detection and resolution with active monitoring, rapid triage, and end-to-end lineage, but what about reducing the number of data quality issues in the first place? 🤔

This is where data health analytics come into play. After all, you can only improve what you can measure. Red Ventures Senior Data Scientist Brandon Beidel leverages data observability to set and track data SLAs. In his words:

“The next layer is measuring performance. How well are the systems performing? If there are tons of issues, then maybe we aren’t building our system in an effective way. Or, it could tell us where to optimize our time and resources. Maybe 6 of our 7 warehouses are running smoothly, so let’s take a closer look at the one that isn’t.”

Data observability helps inform where to allocate data quality resources

Data observability tools can also be used to tag and certify tables, letting data analysts know which are good to leverage for their reports rather than having to choose between risking pulling an out-of-date table or pinging a data engineer for the fifth time that day.

How data observability is different than testing or monitoring

Similar to how software engineers use unit tests to identify buggy code before it's pushed to production, data engineers often leverage tests to detect and prevent potential data quality issues from moving further downstream. This approach was (mostly) fine until companies began ingesting so much data that a single point of failure just wasn’t feasible.

I’ve encountered countless data teams that suffer consistent data quality issues despite a rigorous testing regime. It’s deflating – and a bad use of your engineers’ time.

The reason even the best testing processes are insufficient is because there are two types of data quality issues: those you can predict (known unknowns) and those you can’t (unknown unknowns).

Some teams will have hundreds of tests in place to cover most known unknowns but they don’t have an effective way to cover unknown unknowns. Some examples of these unknown unknowns covered by data observability include:

  • A Looker dashboard or report that is not updating, and the stale data goes unnoticed for several months—until a business executive goes to access it at the end of the quarter and notices the data is wrong.
  • A small change to your organization’s codebase that causes an API to stop collecting data that powers a critical field in your Tableau dashboard.
  • An accidental change to your JSON schema that turns 50,000 rows into 500,000 overnight.
  • An unintended change happens to your ETL, ELT, or reverse ETL that causes some tests not to run, leading to data quality issues that go unnoticed for a few days.
  • A test that has been a part of your pipelines for years but has not been updated recently to reflect the current business logic.

In a Medium article, Vimeo Senior Data Engineer Gilboa Reif describes how using data observability and dimension monitors at scale helps address the unknown unknowns gap that open source and transformation tools leave open. “For example, if the null percentage on a certain column is anomalous, this might be a proxy of a deeper issue that is more difficult to anticipate and test.”

This chart shows a sample Vimeo use case where, prior to an incident triggered by Monte Carlo, 60 percent of users are onboarded from the web, 30 percent are onboarded through iOS, and 10 percent are onboarded through Android. In this use case, Monte Carlo triggered an alert since the percentage of users onboarding from iOS increased to 65 percent at the expense of the web, which has decreased down to 25 percent. Image originally published in Medium.

Choozle CTO Adam Woods says data observability gives his team a deeper insight than manual testing or monitoring could provide. “Without a [data observability tool], we might have monitoring coverage on final resulting tables, but that can hide a lot of issues. You might not see something pertaining to a small fraction of the tens of thousands of campaigns in that table, but the [customer] running that campaign is going to see it. With [data observability] we are at a level where we don’t have to compromise. We can have alerting on all of our 3,500 tables.”

In short, data observability is different and often more comprehensive than testing because it provides end-to-end coverage, is scalable, and has a lineage that helps with impact analysis. If that's not enough, it's more proactive than monitoring because it takes the next step with incident triage and impact analysis.

Why data observability and reverse ETL are better together

The two components of risk are the likelihood of an event and its severity.

The promise of reverse ETL is to make data more actionable, within the tools and systems leveraged by different organizational departments in the day-to-day workflows.

In some cases, the data may be used as part of sophisticated automation. For example, offering a coupon to website visitors with a profile that indicates they are less likely to purchase. These use cases often surface product information to inform operations teams at Product-Led companies, a concept known as Operational Analytics.

Because these use cases hold tremendous value to the company, the prospect of data downtime becomes a more severe event. This holds true as data directly informs and fuels external-facing processes as well. It would completely change the tone of a customer interaction if their usage were only half of what the CRM/CS tool says.

Can you imagine if companies operated with the same reactive, ad-hoc attitudes toward their website or product reliability that many data teams still do regarding their data quality? 😬

The likelihood of data downtime has also increased alongside the volume of data and the increasingly complex underlying infrastructure. Simply put, more data and more moving parts increase the chances something goes awry.

I want to be clear that the benefits of reverse ETL far outweigh any potential risk. Let’s face it: Data is at its best when it is unchained from dashboards and actionable – and it’s also at its best when it can be trusted.

Just like modern braking systems actually allowed cars to operate at higher speeds, data observability can do the same for data teams. To foster a culture of innovation and data-driven excellence, teams need to move boldly knowing they will be the first to know of any issues.

As we embark on an exciting journey to elevate data from the back-office to optimize business processes, let’s also be sure our next step is to proportionally elevate our approach to data quality through data observability.

Interested in getting started with data observability? Reach out to the Monte Carlo team. Have thoughts on how data observability pairs with reverse ETL? Share them in The Operational Analytics Club.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: