Best Practices

Why dbt is disrupting ETL & ELT | Census

Sylvain Giuliani
Sylvain Giuliani February 02, 2021

Syl is the Head of Growth & Operations at Census. He's a revenue leader and mentor with a decade of experience building go-to-market strategies for developer tools. San Francisco, California, United States

Every day, petabytes and petabytes of data are collected, operated on, and stored for a vast range of analytical purposes all across the world. Without pipelines to get this data and use it properly, large scale data science simply wouldn’t be possible. Traditionally, one of two processes, dubbed ETL and ELT, were used to grab large amounts of data, pick apart the bits that mattered, and then load these into a data lake or data store. However, both of these pipelines have their drawbacks, and in 2020 - as the world becomes ever more dependent on analytics and real time data - ETL and ELT simply aren’t the sharpest swords anymore.

In this article, I’m going to compare ETL and ELT, summarizing how each works, how they’ve been conventionally used both in the past and today, and why most leaders in data science think they’re obsolete.

ELT, ETL, What’s the difference?

ELT stands for Extract, Load, Transform, while its partner ETL similarly signifies Extract, Transform, Load. These three steps are crucial processes in any important data transformation. Whether you realise it or not, they’re used in millions of applications all across the globe. Every time you purchase an item from your nearby grocery store, your transaction, whether it be anonymous or identified, will be shuffled down one of these pipelines for financial and marketing analytics. Let’s see how ELT stacks up against ETL.

If we’re collecting data from various sources, such as multiple stores across a country, or let’s say, many different instruments at different points on a water dam to give a scientific example, we need to gather all this data together, and then take only the parts that matter to the analytics we want to create. This could be the net sales from each store, and in that case, you’d need to normalize and total up all the transactions. Or, in the dam example, it could be all the water pressure readings that need to be listed. This is the transform process, and it is essential for creating analytics. Specifically, it is what allows us to use business intelligence tools like Tableau or Periscope.

When we’re using ELT - that is Extract, Load, Transform - our intention is to save our primary machine from performing all these computationally expensive operations, by doing them on a data server instead. Instead of using the grocery store’s computer to total up the transactions, we send the raw data to a data lake or other storage machine, and only then do we enact the transform stage to get the overall profit - adding up transactions and subtracting costs, for example.

ELT is great for large amounts of data where we’re just doing simple calculations, such as the grocery store example. We can extract the data from all our sources, eg. the card readers, load them into our data storage, and then transform them so we can easily conduct analytics.

On the other hand, you have ETL. This works better with the dam example. In total we don’t collect a lot of data necessarily, but there are many different kinds of readings, and it’s likely you’ll want to perform lots of calculations on it for insightful analytics. Preferably we also want this data real time, so we can prevent any floods! In this case, Extract, Transfer, Load is better suited. Instead of sending our raw data to a data storage and then doing our operations, we perform them as they are being sent to the data storage, in what is called “transformation stages”. This way we can establish a continuous stream of data that is being processed before it is even loaded into the data storage!

These two analogies are good at highlighting the advantages and disadvantages of ETL vs ELT. With ELT, it’s great when you have a ton of data, such as hundreds of transactions, but you only want to perform some relatively simple operations, like calculating profit, or mapping sales to time in the day.

Meanwhile, ETL works better for real time cases where we don’t have tons of data, but we do have lots of specialized data that needs to be sorted properly, and therefore more calculations are needed.

Tools for ETL and ELT

Luckily for the modern world, we don’t have to do a ton of programming to create a streamlined pipeline for our data anymore! There are many ETL and ELT tools that allow us to perform these functions, from a wide variety of data sources to an extensive range of data warehouses or machines.

For example, the Hevo no-coding data pipeline is popularly used among retailers and other physical businesses who want to collect sales data, or activity information about their stores. But it’s also great for real time data too, so if you want to measure the foot traffic outside your storefront and map that over time, you can use Hevo too!

There’s also Fivetran, which is built around pre-built connectors and functionality for a “plug and play” experience.

dbt - a better approach!

ELT and ETL might sound like perfectly reasonable methods to use to get data from A to B for analytics, but they’re actually both pretty inconvenient on their own. With ELT and ETL, you have to know exactly what analytics you want to create before loading your data. Luckily, they’re quite trivial with modern tools like Fivetran, Airflow, Stitch, etc, and cloud warehouses like BigQuery, Snowflake, and Redshift.

Even then, the hard part still remains in the transform layer. The transform layer is a crucial element in your data pipeline but if it’s bottlenecking you from getting the most relevant insights, then there must be a better way.

With some more advanced pipeline techniques however, we can increase our options and allow us to create many different kinds of analytics without having to resend data through the pipeline and keep doing different transformations on it!

It’s called dbt, or Data Build Tool, and it’s a super flexible command line data pipeline tool that allows us to collect and transform data for analytics really fast, and really easily! There’s no need with dbt to completely reprogram your pipeline.

dbt is still built on SQL like conventional databases, but it has additional functionality built on top of it using templating engines like jinja. This effectively allows you to bring more logic (i.e. loops, functions, etc.)  into your SQL to access, rearrange and organise your data. Kind of like programming your dataset but with much more flexibility and options.

With this code, you can then use dbt’s run command to compile the code and run it on your SQL data to get exactly the parts you need in the transformations you’re looking for. It can also be quickly programmed, tested, and modified without huge waiting times for it to run through all your data, meaning you can create new, better versions of your programs on a tight schedule.

dbt does not entirely replace ELT and, but it does allow for significantly more flexibility - it super boosts your “T”ransform layer/stage. With dbt, you can aggregate, normalize and sort the data again and again, however you like, without constantly updating your pipeline and resending.

dbt isn’t a replacement for ETL and ELT, but these pipeline methods stand-alone are becoming obsolete with modern technology taking its place. Whether you follow ETL or ELT, one thing for sure is that dbt is such a big improvement for the T(ransform) layer in every way that you can think of.

So what’s next?

I encourage you to take dbt for a spin. You can start with the quick start guide here and join their super helpful community here. Trust me, when you start using dbt, you will wonder how you did any data modeling work before.

And after that? Check out our reverse ETL dbt integration to turn your models into reality inside all of your operational tools.

Do more with your dbt models than just reporting and start syncing for free today with Census.

Related articles

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps:

Best Practices
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue

Remember when the days of "Dear [First Name]" emails felt like cutting-edge personalization?

Product News
Why Census Embedded?
Why Census Embedded?

Last November, we shipped a new product: Census Embedded. It's a massive expansion of our footprint in the world of data. As I'll lay out here, it's a natural evolution of our platform in service of our mission and it's poised to help a lot of people get access to more great quality data.