Tutorials

Intro to Survival Analysis in Python | Census

Allie Beazell
Allie Beazell May 06, 2021

Allie Beazell is director of developer marketing @ Census. She loves getting to connect with data practitioners about their day-to-day work, helping technical folks communicate their expertise through writing, and bringing people together to learn from each other. Los Angeles Metropolitan Area, California, United States

Two of the hottest data trends right now are machine learning and deep learning. But before you jump completely to the cutting edge, there are a variety of analytical tools and techniques that still, arguably, provide even more business value: revenue cohort analyses, marketing attribution modeling, and survival analyses.

For this piece, we’re going to focus on the last item in that list: the statistical technique of survival analysis, which determines the expected duration of time until an event occurs. We’ll take a look at what survival analysis is, its business applications, how it works, and an example of survival analysis in Python.

Let’s dive in.

What is survival analysis?

Survival analysis helps us analyze the expected duration of time before a specific event occurs. The time of an event reflects the time until something of interest happens (e.g. the occurrence of a heart attack, diagnosis of cancer, death or failure of a device, etc.).

This data analysis can help us understand our customer and product lifecycle in a variety of ways, such as predicting the cost of medical care, accessing reliability, and estimating customer life span. The collection or analysis of data begins with the occurrence of an event until failure or death and has several incredibly valuable business applications across nearly every industry.

Business applications for survival analysis in Python

As we said up top, survival analyses in Python aren’t an industry-specific technique. No matter who your customers are, what your product is, or what stack you use to support both, your team will benefit from gaining a better understanding of key metrics, such as:

  • Active user survival rate
  • Product time to purchase
  • Campaign effectiveness evaluation
  • Employee churn estimation
  • Machine lifecycle measurement

Here’s a breakdown of these common use cases for survival analysis in Python.

Active user survival rate

Survival analysis can help you predict or identify customers whose survival rates (how long they’ll be active users, not a life and death thing here ☠️) are low or decreasing. This information can help you head off customer satisfaction and retention issues before their crux, and make it easier to pull together an actionable marketing strategy.

For example, if you saw that a cohort of users associated with a specific account or team type had decreasing active user survival rates (or had fully become inactive), you may check into their account usage and see that those folks aren’t getting as much of out your product as possible. Knowing this, you can have your AEs reach out to reengage those folks by showing them the features that they would get the most value out of, saving your accounts.

Product time to purchase

Survival analysis isn’t just good for understanding your customer’s activity rate. This method can also help you and your data team predict each customer’s time to purchase a product to improve your revenue forecasting.

Using survival analysis in this way can also help you predict and assess the percentage of customers that will stay subscribed to a given service over time.

Campaign effectiveness evaluation

It helps to monitor the effectiveness of a particular campaign on the survival rate so you can get a full picture of a customer’s lifetime value (LTV). Survival analysis gives you the ability to gain further insight into each of your campaign’s effectiveness.

For example, real estate and mortgage companies can leverage survival analysis to get a better understanding of time to mortgage redemption, which makes for more accurate account forecasting.

Employee churn estimation

The use cases for survival analysis in python don’t end with your customers, however. This technique can help your people ops teams gain insight into employee lifecycle and churn, too.

By calculating the survival rate for each employee (similarly to how you’d measure it for customer lifecycles), you get an estimate for how long the average employee stays with your company and can either nurture to prevent or start to prepare for departures over time.

Machine lifecycle measurement

If your company relies on machines to get things done (like pretty much any company today), survival analysis can help you predict when core components and tools will need to be replaced. Just like employee and customer lifecycle estimation, you’re measuring the time to failure for each specific machine you’re interested in.

For example, in manufacturing, companies can apply survival analysis to predict when a particular part of a process or assembly line will fail and need repair. This helps organizations avoid costly downtime and stay on top of operational health.

Math basics behind survival analysis

To understand how survival analysis works, we need to understand the basics of some core math principles. We’ll spare you a deep dive and just intuitively go over the relevant concepts below:

  • Survival time
  • Survival function
  • Hazard function
  • Kaplan-Meier method
Survival time

Survival time, denoted T, simply represents the time until an event occurs. For example, it can refer to the time until a customer churns, the time until a customer converts, or the time until a machine fails.

Survival function

The survival function, denoted as S(t), represents the probability that the event of interest has not occurred by some time t.

For example, by plotting the survival function for the three clients in the graphic above (known as client_A, client_B, and client_C), we know that client_C is most likely to churn before client_A and client_B. We can also tell that it’s very likely that client_C will churn in the first two weeks.

Hazard function

The hazard function, denoted at h(t), represents the conditional probability that the event will occur given that it has not occurred before. Generally, this is not the focus of survival analysis, and instead, the emphasis is on the survival function.

Kaplan-Meier method

The Kaplan-Meier method is a type of survival function that is most commonly used in applications. As shown in the survival function example above, the survival curve can help you determine a fraction of patients surviving a particular event (again, duration of accounts, not death ☠️). This involves the computation of survival probabilities from the observed event. There are three assumptions that the Kaplan-Meier method makes:

  1. Participants who dropped out may have the same survival prospects
  2. The survival probabilities will remain the same for the participants who joined late or early
  3. The event will occur at the specified time

To learn about the underlying mathematics behind survival analysis, check out this awesome article from Square.

Now, let’s dive into the nitty-gritty of survival analysis in python.

Coding implementation for survival analysis in Python

Python provides us with an amazing library called lifelines for survival analysis. In this demonstration, we’re using Kaplan Meier Estimation for the survival analysis. The dataset has the duration and the censoring for the heart attacks and survival of the patients.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from sklearn.impute import SimpleImputer
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
fromscipy import stats
from lifelines.datasets import load_waltons

#reading data into dataframes
df=load_waltons()

#Let's assume that the preprocessing of data has been done and now implementing the Kaplan Meier method to perform survival analysis. For the sake of this analysis, let's assume that we're looking at the duration of customers until they churn.
kmp=KaplanMeierFitter()
X=df['T']    #Represents duration
Y=df['E']    #Represents whether sample was censored or not
kmp.fit(X,Y)
kmp.plot().    #In order to plot the survival curve
plt.xlabel(""# of weeks until churn"")
plt.ylabel(""Survival likelihood"")
plot.show()

Here, we have the dataset that has a column T and E representing duration and censoring, respectively. We have applied the Kaplan Meier Estimator to our data and we plot the result. We get the following output.

Harness survival analysis in python to improve customer insights

Overall, survival analysis in python has significant value when used in the right context, such as helping you gain more insight into your customer and campaign lifecycles, as well as the longevity of your equipment. When you understand how to successfully leverage survival analysis, you can level up the insights gained into customer behavior, operations management, and marketing initiatives.

If you’re looking to further supercharge your customer data insights and operationalize your data, reach out for a 1:1 demo or start your free Census trial.

Related articles

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps:

Best Practices
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue
How Retail Brands Should Implement Real-Time Data Platforms To Drive Revenue

Remember when the days of "Dear [First Name]" emails felt like cutting-edge personalization?

Product News
Why Census Embedded?
Why Census Embedded?

Last November, we shipped a new product: Census Embedded. It's a massive expansion of our footprint in the world of data. As I'll lay out here, it's a natural evolution of our platform in service of our mission and it's poised to help a lot of people get access to more great quality data.