Best Practices

Analytics engineers own the Modern Data Stack | Census

Sylvain Giuliani
Sylvain Giuliani September 01, 2020

Syl is the Head of Growth & Operations at Census. He's a revenue leader and mentor with a decade of experience building go-to-market strategies for developer tools. San Francisco, California, United States

For the first 10 years of my career, the roles and team members in basic data warehousing projects remained static. First, there was the Business Analyst who would spend the majority of their time supporting the business with basic reports and simple analytics. Their computer desktop would be littered with a series of Excel documents containing complex pivot tables and VLOOKUPs to other spreadsheets. They possessed an intimate understanding of how the business operated and what metrics were required to create insights, forecasts and projections that would allow leadership to help make better business decisions.

The second was the Business Intelligence Developer role. As data volumes increased and data requests from the analyst became more formalized, Business Intelligence Developers would be responsible for building daily or nightly extract, transform and load processes to pull data automatically and store it in highly structured data warehouses with metrics, calculations, and mappings built right into the code. They created canned reports to replace some of the more tedious Excel-based reporting defined by the Business Analysts and broke down the data silos of an organization, ultimately giving the business a single source of the truth.

From a technology stack perspective, ETL tools such as SQL Server Integration Services, Informatica or even raw code would be used to create source to target mapping logic to load data into data warehouses, usually built on database technologies such as Oracle, Teradata, SQL Server or DB2 instances. Tools such as Cognos or SQL Server Reporting Services would then provide enterprise-wide reporting.

In the last 5-7 years, however, the business intelligence landscape began to shift as the big data revolution came about. With storage cost shrinking and large cloud-hosted data centers becoming the norm, the amount of data an organization can capture has grown exponentially by leveraging tools and services such as AWS Redshift, GCP BigQuery or Snowflake.

An Explosion of data

Data analysis exploded from the millions of rows to the billions or trillions range. As this shift happened, business intelligence tools also moved to the cloud which created products such as Looker, Mode, and PowerBI. They have specifically been designed to handle this new volume of data and offer new or expanded analytics capabilities such as:

  • Ad Hoc and Self Service Reporting
  • Slice and Dice
  • Collaboration and Sharing
  • Live and In-memory Data Analysis
  • Integration of a variety of data sources
  • Advanced Visualizations
  • Maps
  • Robust Security
  • Mobile View.
  • Natural Language Q & A Reporting
  • Basic Modeling and Artificial Intelligence

Predictive Analytics evolved into Prescriptive Analytics. Nightly data refreshes can now be done in real-time, processing millions of rows every few seconds. Staging Databases have become massive infinite scaling data lakes. In addition, single server data warehouses have become multi-parallel processing data warehouses running on multiple clusters and nodes giving organizations the power to throw an infinite number of compute cycles at their queries and data processing.

More Data, More Responsibilities

As these changes to technology and capabilities came to life, the Business Analyst and the Business Intelligence Developer roles can no longer solely support all the needs being driven by this change. The Business Analyst role is now supplemented by a Data Scientist role.

The Data Scientists may not be as well versed in the intricacies of the business, but they bring Ph.D. level statistical analysis to the table allowing for in-depth complex analytics previously not possible by Business Analysts. Data scientists leveraged languages such as Python, R or SAS to create complex analytical models to drive the business forward.

The Business Intelligence Developer role has become generally more fragmented due to the various new steps in a Modern Data Stack. Although basic data warehousing principles are still required in this architecture, the data warehouse is just a single component rather than the sole home for all reporting and data. By having a robust and clean data lake, Business Analysts and Data Scientists no longer need to wait for a highly structured data warehouse to field reporting requests to answer problems requested by the C-suite with due dates of yesterday. The new role of Analytics Engineers has emerged as the leader in helping bridge the gap between Data Analysts and Data Scientists and the lightning-fast, exploratory analytics they need to create.

They help maintain flexible but clean data lakes while creating easy to process data pipelines to be leveraged by downstream data warehouses, automation, or real-time analytics. They have enough technical chops to understand concepts such as source control, software deployment life cycles and the other technical nuances of a software development project but also have enough business acumen to ensure the correct data points are being stored and categorized in all sections of the modern data stack so all teams that require the data have what they need to operate without confusion and with autonomy.

The Analytics Engineers are well versed in the swiss army knife of cloud-based tools. They are able to extract, transform, load, map and merge data from source data using tools such as dbt, SnowPipes, AWS Glue, Python to name a few. Data can then be loaded into leading edge MPP Data Warehouses such as AWS Redshift, Snowflake for structured and clean reporting.

It may seem as allowing the Data Analysts and Data Scientists to connect directly to data lakes would eliminate the need for Analytics Engineers or the Business Intelligence Developers, but this has been found to create two major problems.

The first is that without proper care, data lakes have been known to become data swamps. Data is stored in mass volume, but without basic best practice architecture and engineering, the data is uncategorized, duplicated, stored incorrectly in addition to a myriad of other issues making the data unusable for Analysts and Scientists alike.

The other issue is that without an intimate understanding of the data, correct structures, and relationships, the recipients of the data run the risk of creating incorrect metrics that do not align to similar metrics being created by other business units to answer similar problems. The issue from the early data warehousing days of silo ghost IT departments with their own excel reports, reporting different numbers from adjacent business teams begins to reemerge. There is no longer a single source of truth achievable as everyone can run wild with an unorganized modern data stack that should be owned and facilitated by the Analytics Engineers.

What does the future hold?

Enabling the Analytics Engineers in-turn enables all other players in a well structured data organization. But an Analytics Engineers, like a good car mechanic, need the correct tools to do their job correctly.  This is why you see more and more organizations adopting a modern data stack and leverage tools such as dbt, Fivetran, BetterException, etc the Analytics Engineer have all the power to do just that. If you are interested in building a better data organization or data stack, contact us, we do that for a living!

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: