Best Practices

Analytics Engineering with dbt: A FinTech application | Census

Carlo Scalisi
Carlo Scalisi October 04, 2022

Carlo is a Senior Data Analyst at N26, a European challenger bank. He has extensive experience in consumer tech and is passionate about analytics engineering, clean code, and his home country of Italy.

A new ETL tool has recently seen remarkable adoption among tech companies, both large and small, thanks to its versatility and powerful data engineering features. dbt allows users to set up complex production-grade data pipelines using SELECT statements. The software will take care of the admin set-up (e.g., permission grants and table clean-ups), clearing up time for the data engineers that usually perform these tasks.

In this post, I will present a real-world application for dbt—the primary analytics tool in my company, N26—including the path to its adoption, its benefits, and the lessons learned from our implementation.

A data backstory

At N26, I'm part of the Data Analytics (DA) team. Although this group belongs to the larger Data organization (including Data Science and Data Engineering), the data infrastructure topics we wrangle are pretty different.

N26 is a fully-licensed neo-bank from Germany. As an app-centric company, they
N26 is a fully-licensed neo-bank from Germany. As an app-centric company, they're fully focused on creating an excellent product.

As the business grew and the organization evolved, so did the data challenges we faced—and the data team's structure mirrored that. Before I joined the company in 2018, we had only 1 FTE data analyst. We then had a hiring surge to bring in new analysts to support the expanding marketing, operations, and product departments. Then, in 2019, N26 launched the US branch of their business, so we spun up a new local data warehouse, and, more importantly, we incorporated dbt in our data services architecture.

Fast-forward three years... We today want to add a new analytics engineer to maintain and contribute to our ever-growing data projects. Undoubtedly, this is a testament to the impact that dbt has had on our team’s work: It’s boosted our productivity and augmented our scope.

Before dbt, most of the data pipelines at N26 were created and managed by the Data Warehouse Engineering (DWH) team, which sits within Data Engineering. Almost all the pipelines were ingesting data stored in backend services, data from third-party external sources (Salesforce), or data produced by our website/app tracking. The only way data analysts could set up data pipelines or create data models independently was to use automation and a light ETL microservice we built in-house.

N26 runs the open-source version of dbt, but there
N26 runs the open-source version of dbt, but there's a paid, web-based IDE, too.

Over time, the service’s scope expanded, but, at its core, it queried vendors’ APIs (e.g., Google Ads) to either push or pull data to and from the DWH. Some more sophisticated data models, like the customer attribution model, were developed and run there, but adding new data models to the service was cumbersome. To add new models, we needed data analysts that could code ETL pipelines confidently in a mixture of Python and SQL on a service with few standardizations. We realized that our set-up had several bottlenecks in tooling and data governance, so we decided to try something new. Enter dbt. 

A reform movement

dbt core is an open-source tool capable of creating complex data pipelines leveraging the SQL SELECT statement. dbt labs’ claims that “anyone who knows SQL can build production-grade data pipelines” —and we can validate that claim. Data analysts, who only know SQL for data retrieval operations (like SELECT statements), can create robust data transformation pipelines with dbt. Effectively, you no longer need to know “admin” SQL to create and manage data models.

The core tool has rich features: It's data warehouse agnostic, offers automated documentation and a front-end to access it, has built-in CI/CD integration, and allows easy testing and dependency management. We use the open-source version, so we host the project on our infrastructure, but there is also a popular, paid product, dbt cloud, which offers a browser-based IDE and is hosted by dbt labs.

The introduction of dbt has revolutionized how the data analytics team works at N26. It allowed us to easily add business logic to raw data sources without taking precious time from the DWH team. Now, we can fully own the design, creation, implementation, and maintenance of the core data models that power any modern digital business. Some advanced data models already existed, but having dbt drastically improved the quantity and quality of the pipelines we could create as a team.

dbt cloud in action
dbt cloud in action

There was some prep work for implementing dbt in our architecture. We had to create hundreds of markdown files to document all the existing tables in the DWH since that's the only way to properly use the ref and source functions which are the backbone of dbt’s visual lineage graphs for dependencies. Since both functions leverage jinja templating, we all had to learn exactly how that worked. On the plus side, though, we now have documentation connected to what’s in the DWH, so it's worlds better than the static Confluence pages set-up we previously had.

Next, we converted the analytics models we had at the time to dbt logic. Sometimes this meant struggling with handling recursive model alternation and data type definitions at the column level since dbt infers the data type from whatever source table you use. 

Then we introduced a system to trigger on-demand job execution through our automation tool, Jenkins, and its “build with parameters” page. It's useful for manually re-triggering a pipeline if the previous build didn’t succeed or for fine-tuning model runs and picking what to update.

Luckily, dbt comes with powerful step-by-step logging, which we integrated into our Elasticsearch/Kibana stack using Python’s standard logging module and our custom library.

A look at N26 today

Three years into dbt’s adoption, our data projects have become quite large and complex: We run six dbt projects, with well over 800 data models running daily or weekly. Models are grouped using tags based on the team they were created for, with several core models as foundations for the project. At N26, dbt is a containerized application on a Kubernetes (k8s) cluster, leveraging the built-in scheduler as an automation server, while Jenkins is used for on-demand execution and CI/CD.

All the dbt models
All the dbt models

We started simple, with only tables (incremental and not) and seeds (CSV uploads), because we believe that the level of abstraction of your dbt project should increase according to the complexity of the project itself. As N26 became more complex and multi-faceted, so did our dbt project. Today, we run tables, seeds, snapshots, macros, hooks, tests, and sources with automated freshness checks on raw data sources. These are all different object types; however, our 10+ direct acyclic graphs (DAGs) are still relatively simple, grouping objects based on schedule frequency (daily, weekly, etc.)

Not everything went out super smoothly, though. We did not fully anticipate the need for solid data architecture knowledge to stand up an efficient and robust analytics engineering service. Most of the team at that point was only used to creating one-off queries used for ad-hoc analyses, data retrieval, and dashboards and had no experience with data pipelines. This is why we plan to boost data warehousing knowledge in our team by adding analytics engineers to the roster. To support the business during hypergrowth, with priorities changing rapidly, we did not initially enforce strict development guidelines for data models, which has made for a somewhat tight dependency network in our DAGs.

As a result of this interdependency, the daily DAG, the biggest one, is heavy and fragile because of model overload in it. While this only affects internal analytics, as our team’s work is never customer-facing, it does sometimes mean that dashboards will take longer to update in the morning - an annoyance all the same.

<p>Daily DAG
Daily DAG's lineage graph

We run tests on data models through the dbt test functionality and we cover all the core models. While we can monitor dependencies within the dbt projects, thanks to its rich dependency management, the service it runs on is unaware of dependencies outside of it. That light ETL service I mentioned in the beginning? It produces some raw data that's powering many models.

The concept of “status” of any dbt run is not fully fleshed out and can’t be referenced during job execution, and we discovered that k8s is better tuned for high-availability services (i.e. backend) than cronjob-based ones. There have been multiple cases of jobs running twice because the old k8s pod didn’t shut down properly or the pods spontaneously restarted. But even with all the bumps in the road, dbt still changed the way the N26 Data team works for the better.

The road ahead

What’s next? Certainly, dbt is here to stay, but we are changing how we maintain the project and rethinking who will be responsible for developing it. We plan to hire analytics engineers to help us plan and execute the healthy growth of our dbt data projects and ensure sound data warehouse principles. On the other hand, we want to separate core business analytics responsibilities from development and data modeling duties and move the former out to the business departments.

We are also considering adding a workflow orchestrator like Airflow to our infrastructure. With that addition, we hope to address inter-service dependency issues, improve non-happy paths management, and expand the concept of job status. Introducing higher severity tests to guarantee data quality is on the roadmap, too — and we are just about to launch automated SQL format checks to standardize how our code looks!

All in all, going with dbt was a great decision. It has turbocharged the output of our team, both quality and quantity wise and is the right tool to support the expansion of our data platform as N26 expands its business with new products and new geographies.

☀️ Catch Carlo's full talk from Summer Community Days here.

😎 Then, head on over to the Operational Analytics Club to share your thoughts with Carlo and others in the community!

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: