Best Practices

Stars, tables, and activities: How do we model the real world? | Census

Stephen Ebrey
Stephen Ebrey November 15, 2022

Stephen is the founder of Sawtelle Analytics, but he has worked in operations, product management, and data across gaming, advertising, and eCommerce companies. He loves trying out new products and being part of growing startups, and recently moved from a full-time job at Omaze to data consulting.

For 25 years, Ralph Kimball and Margy Ross’s The Data Warehouse Toolkit has provided the mainstream method of data modeling, alternatively called Kimball-style warehouses, dimensional models, star schemas, or just “fact and dimension tables.”

In fact, it’s so ingrained in the data industry that nearly every data conference has a talk arguing for or against its continued relevance – or suggesting a new alternative altogether.

The History

For those of you who are new to the concept, fact tables are lists of events that rarely change (i.e. orders), and dimension tables are lists of entities that often change (i.e. a list of customers). To keep this straight in my mind, I often think about how a consultant once explained it to me:

You can tell what’s a fact and what’s a dimension by explaining your report with the word ‘by.’ 

Sales by customer, refunds by product, etc… it’s always fact by dimension.  Here’s an example 👇

fact_orders

dim_customer

‎‎The idea here is simple: It wouldn't make sense to include the email, location, or lifetime value (LTV) of the customer on the order. Why? Those things change, and you don’t want to be updating year-old orders when your orders table has millions of rows.

The Case For Kimball

The basic organization – orders, customers, and products being separate tables – makes sense to most people. It’s easy to join the tables together and to know what to put in which table.

Even if you want to make Wide Tables, combining fact and dimensions is often the easiest way to create them, so why not make them available? Looker, for example, is well suited to dimensional models because it takes care of the joins that can make Kimball warehouses hard to navigate for business users.

But one of the main advantages of Kimball is that it provides a blueprint. 📘 I’ve worked with companies that aren’t modeling their data from ELT tools at all, and fact and dimension tables are a great model for a data warehouse. 

It’s also a common standard that new team members will be able to contribute to quickly. By comparison, if you skipped the star schema and just created Sales by Product and Sales by Customer tables, you could easily avoid problems incurred by someone asking for Sales by Customer by Product table.

The Case Against Kimball

Every modern talk about Kimball – even the ones that are generally positive – has a long list of things about Kimball that are no longer relevant to modern cloud data warehouses. 

Some of the more extreme Kimball practices (like creating a city dimension and referring to it as city ID 123 instead of the actual city name) only made sense when analytical databases were row-based and did not store repeated names efficiently. 

Plus, a huge portion of his book talks about storage costs, which are practically free nowadays. This, then, raises the question of whether dimensional modeling is an artifact of a constrained time, like how QWERTY keyboards were created to slow down typing on antiquated keyboards.

Generally, the arguments against Kimball fall into 2 main buckets:

  • Business people don’t understand joins. This is the argument for Wide Tables or One Big Table, such as a table called sales_by_customer_by_product. Certain BI tools, such as Metabase, provide a good self-service interface when dealing with a single table, but are useless with star schemas.
  • Separating activities into multiple fact tables makes it hard to track a user’s behavior over time. If a customer gets an email that’s in fct_email_events, visits the website which creates a few rows in fct_page_views, adds something to their cart which is recorded in fct_events, and then makes a purchase which is stored in fct_orders, it becomes hard to tell a cohesive story about why the customer made the purchase. 🤔 This is the argument for activity schemas, which combine all fact tables into a single one with a few standard columns.

Everyone has their own incentives

Tools that work well with Kimball models will suggest that you use Kimball models, while ones that do well with Wide Tables or Activity Schemas will push for those. As Aram Panasenco on dbt slack said: 

While Narrator is a cool tool — and there’s nothing wrong with implementing a schema that works well with it — you should know whether you’re implementing something for best practices or for a tool. 🧰 I’d personally love to see more discussion of giant fact tables combining all your important events (not limited to just 3 columns) that play well in a particular tool. It’s a fascinating idea!

Modern data modeling provides the answers

With modern data modeling tools like dbt, it’s very easy to create fact and dimension tables, then spin up Wide Tables as needed. Maybe your CX team wants to see Orders by Customer in Metabase, or your marketing team wants to sync Orders by Product to Salesforce with Census. Either way, it takes very little time to create views for these use cases.  

You can also create a view that turns your fact tables into a strict Activity Schema if you want to try out Narrator. Then, if you later implement Looker, you can join the fact and dimension tables directly in your Explores. Kimball modeling as the core (but not the extent) of a modern data warehouse is hard to beat in flexibility and ease of use!

👉 Want to start sending data to all your go-to-market tools to make it actionable for your teams? Book a demo with a Census product specialist.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: