Engineering

Architecting Multi-tenant Data Integrations with Census Embedded

Sean Lynch
Sean Lynch January 12, 2024

Sean is co-founder and Chief Product Officer (CPO) at Census. He loves enabling data-driven organizations, so he's energized by introducing the world to Data Activation. San Francisco, California, United States

The concept of multi-tenant (”tenant” meaning user or customer) is the backbone of modern SaaS products. The idea is simple: Rather than each customer having their own dedicated instance of an app, they all use the same instance and only see the data, views, and workflows that apply to them as a customer. Using a single, shared instance, makes running the service much easier to scale, and functionally is the core of the “As-A-Service” part of SaaS.

Multi-tenancy is an easy concept to understand, but implementing it requires a bit more thought and it’s extremely important to get right. Nothing can affect customer confidence in your app than showing details about Customer A to Customer B.

At a high-level, architecting data integrations for multi-tenancy is thinking about when data is “split” from a single repository into individualized repositories for each customer. Or vise versa, when multiple streams are merged together, and how each customers ownership of that data is maintained.

Building integrations with Census Embedded on top of data warehouses gives you a number of complementary approaches for ensuring your customers’ data is siloed, and because they’re complimentary, they can be used in combination to provide strong multi-tenancy guarantees. Census Embedded integrations are usually one of two patterns:

  • Activation - Taking data you have and sending it to your customers.
  • Ingestion - Getting data from your customers into your application.


Activation

In activation cases, your goal is to take your own data and get it sent to your end users.

Though in some cases you may be activating the same data set to every customer, the most common pattern activation pattern is using Census Embedded to send a customers data to them. For example, this might be a log of activity they’ve generated in your app that they’d like pushed into a marketing tool as events.

To accomplish this, each customer’s integration needs to be configured to split your data repository into the subset of data specific to them. Each integration has multiple resources.

The integration “stack” is a set of resources connecting data to the destination

In this stack, there’s three places where the split can occur:

  1. Separate tables/views for each customer in your data source - This is the first potential layer of data isolation. It requires setting up a new database object for each customer when new customers are onboarded. This can be a bit heavy weight, but tools like dbt can provide macros that makes it easy to define N tables once.
  2. Use row-level security to filter data - Row-level security is an under appreciated capability in most data warehouses. It allows you to dynamically apply a filter to data based on who is querying a table. Use this capability in combination with a per-customer connection credential to filter data automatically to each customer.
  3. Per-customer Models - This is the easiest step to start with. For each customer, create a model that selects only the data that applies to them before creating a sync configuration.

Splitting higher in the stack requires more duplication of configuration downstream of the split. But the main benefit is that these splits can be used in combination. Maintaining multi-tenant isolation when sending data to your customers is incredibly important. Having multiple layers of enforcement can protect you from headaches if one of these layers were to break.


Ingestion

What about data flowing in the other direction? In this world, you’re using Census Embedded to load data from your customers data sources’ into your product. Maintaining multi-tenancy guarantees are just as important here.

The first step of Ingestion integrations is having a customer provide a connection to their data source. You can have your customer point to a specific table, or optionally have them provide a SQL query and save that as a model.

The next step is loading that data into your app. How your app will receive that data is up to you. You can use one of Census’s existing destinations for data including S3 files, database tables, even webhooks. You can also choose to implement a Custom Destination connector to talk to your API, which also gives you the ability to expose more fine-grained errors.

The architectural decision to be made in this case is thinking about at what point data will be merged, and how it will remain associated to your customers once it is.

  • Isolated staging - Load data into separate destinations, for example individual parquet files in S3 that are identified by customer IDs in the file name. These files can then be accessed as external tables using a datalake approach (eg Snowflake will let you query S3 files as “external tables”).
  • Merge on load - Load data into a single resource directly. In this case, you’ll need to ensure two things:
    1. All data being loaded includes a Customer Identifier - You can use a static mapping in a sync configuration to provide this customer ID.
    2. Loaded data cannot conflict or overwrite - Your integration should treat data as appends. If, for any reason, a unique ID for a record is required, that identifier should also include the customer ID to ensure data from Customer A doesn’t accidentally overwrite data from Customer B.


Either Way, Use Workspaces

Regardless of which integration type you’re building, we recommend you use Census Workspaces for each one of your customers. Using Workspaces provides an isolated space for each customer to ensure that their sources and destinations are not accidentally reused for other customers.

Workspaces require a bit more management work as you’ll need to replicate connections, syncs, and models across each Workspace but this also allows for simple customization for each customer and can save you a lot of pain as well. The Management API makes it easy to create tokens per workspace, and manage user access. And of course, Workspaces are the only way to go if you plan to grant your end users access to the Census UI as well.


Wrapping Up

Hopefully this clarifies how to enforce multi-tenant constraints within Census. As always, if you need a hand getting Census Embedded configured for your environment, don’t hesitate to reach out to our support team at support@getcensus.com.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: