Improving Data Quality with Entity Resolution

Daisy McLogan
Daisy McLogan January 05, 2024

I'm a customer Data Architect at Census, and I help our customers implement best practices when it comes to cleaning, transforming, and activating their data.

In this blog post, we will explore the concept of Entity Resolution and its significance in enhancing data quality for marketing purposes.

Signs you could benefit from Entity Resolution.

customerIf you find that your marketing campaigns are not reaching the intended audience or if you are experiencing a high rate of duplicate customer records in your database, it might be a sign that you could benefit from implementing Entity Resolution. Entity Resolution helps to identify and merge duplicate or similar records, allowing you to have a more accurate and consolidated view of your customer data, or customer 360. By eliminating duplicate records, you can improve the effectiveness of your marketing efforts and avoid wasting resources on targeting the same customer multiple times.

Another sign that you might benefit from Entity Resolution is if you are struggling with data inconsistency. Inconsistent data, such as different spellings or variations of customer names, addresses, or contact information, can lead to errors in your marketing campaigns and hinder your ability to effectively segment your audience. Entity Resolution can help standardize and reconcile inconsistent data, ensuring that you have reliable and consistent information to work with.

Challenges in Data Quality

Data quality is a common challenge faced by many marketers. Poor data quality can negatively impact marketing efforts and lead to inaccurate targeting, wasted resources, and missed opportunities. Some of the key challenges in data quality include:

  • Duplicate records: Duplicate customer records can create confusion and lead to redundant marketing efforts.
  • Inconsistent data: Inconsistent or incomplete data can result in errors and make it difficult to segment and target the right audience.
  • Data integration: Integrating data from multiple sources can be challenging and may result in data discrepancies and inconsistencies.
  • Data validation: Ensuring the accuracy and reliability of data can be a time-consuming task, especially when dealing with large datasets.

Addressing these challenges requires a systematic approach, and Entity Resolution can play a crucial role in improving data quality for marketing purposes. Entity Resolution is commonly used in a CDP or Master Data Management approach but can be implemented on its own. 

Benefits of Entity Resolution for Marketing

Entity Resolution offers several benefits for marketing:

1. Improved customer data accuracy: By resolving and merging duplicate or similar records, Entity Resolution helps to ensure that your customer data is accurate and up-to-date. This allows you to make better-informed marketing decisions and deliver personalized and targeted campaigns.

2. Enhanced customer segmentation: Entity Resolution enables you to create more precise customer segments by eliminating duplicate records and reconciling inconsistent data. This helps you to tailor your marketing messages and offers to specific customer groups, increasing the effectiveness of your campaigns.

3. Cost and resource savings: By eliminating duplicate records and improving data quality, Entity Resolution helps you avoid wasting resources on redundant marketing efforts. This can result in cost savings and increased efficiency in your marketing operations.

4. Improved campaign effectiveness: With accurate and reliable customer data, you can better understand your target audience and create more impactful marketing campaigns. Entity Resolution helps you to identify key customer insights and trends, enabling you to deliver relevant and timely messages to your customers.

By leveraging the benefits of Entity Resolution, you can improve the quality of your data and optimize your marketing efforts to drive better results.

3 ways to implement Entity Resolution: AWS Entity Resolution, Python RecordLinkage, Zingg

There are multiple ways to implement Entity Resolution depending on your requirements and technical capabilities. Here are three popular options:

1. AWS Entity Resolution: AWS offers a powerful and scalable Entity Resolution service that leverages machine learning and data matching algorithms to identify and resolve duplicate or similar records. It provides a comprehensive set of features and integration capabilities with other AWS services.

2. Census Entity Resolution: This solution is best for data teams and marketing teams looking to increase the quality of their customer data, offering complex matching capabilities for dynamic operational environments and updating their CRM, marketing Platform, BI  and warehouse data with clean data.

3. Python RecordLinkage: Python RecordLinkage is a Python library that provides tools for record linkage and deduplication. It offers various matching algorithms and methods for comparing and linking records based on similarity measures. (see our tutorial)

4. Zingg: Zingg is an open-source Entity Resolution toolkit developed by [Your Company Name]. It provides a flexible and customizable framework for resolving entities and can be integrated into your existing data processing pipelines.

These are just a few examples, and there are other Entity Resolution tools and frameworks available in the market. Choose the one that aligns with your technical requirements and resources.

Best Practices for Effective Entity Resolution

To ensure effective implementation and utilization of Entity Resolution, consider the following best practices:

1. Understand your data: Gain a deep understanding of your data sources, data quality issues, and data characteristics. This will help you make informed decisions and choose the right Entity Resolution approach.

2. Establish data governance: Implement data governance practices to ensure data quality, consistency, and integrity throughout the entity resolution process. Define data standards, data cleansing procedures, and data ownership responsibilities.

3. Use multiple matching criteria: Instead of relying on a single matching criterion, use a combination of attributes and matching algorithms to increase the accuracy of entity resolution. Consider factors like name, address, email, phone number, and social media handles.

4. Regularly update and maintain your entity resolution system: As new data comes in and your customer database evolves, it's important to regularly update and maintain your entity resolution system. This will help you identify and resolve new duplicates and inconsistencies.

5. Monitor and measure the effectiveness: Continuously monitor the performance and effectiveness of your entity resolution system. Measure key metrics such as precision, recall, and F1 score to assess the accuracy and impact of entity resolution on your data quality and marketing efforts.

By following these best practices, you can ensure that your entity resolution efforts are successful and contribute to improved data quality for marketing purposes.

Recommended next steps to start implementing Entity Resolution

If you are considering implementing Entity Resolution to improve your data quality for marketing, here are some recommended next steps to get started:

1. Assess your current data quality: Evaluate the state of your data and identify areas where data quality issues are impacting your marketing efforts.

2. Define your goals and objectives: Clearly define what you want to achieve with Entity Resolution. This could be reducing duplicate records, standardizing data, or improving data consistency.

3. Choose the right Entity Resolution tool: There are various Entity Resolution tools available, such as AWS Entity Resolution, Python RecordLinkage, and Zingg. Research and select the tool that best fits your requirements and budget.

4. Develop a data integration and cleaning strategy: Determine how you will integrate and clean your data to prepare it for Entity Resolution. This may involve data preprocessing, data deduplication, and data standardization.

5. Implement and test the Entity Resolution solution: Implement the chosen Entity Resolution tool and test its effectiveness in improving data quality. Monitor the results and make any necessary adjustments.

By following these steps, you can start harnessing the power of Entity Resolution to enhance your data quality for marketing purposes.

Understanding Entity Resolution

Entity Resolution is the process of identifying and merging duplicate or similar records within a dataset. It involves comparing and matching data attributes to determine if two or more records refer to the same real-world entity.

The goal of Entity Resolution is to create a single, consolidated view of an entity by eliminating redundancy and inconsistencies. We like to call this customer 360 or Golden Records. This is particularly valuable in marketing, as it allows for more accurate customer profiling, segmentation, and targeting.

Entity Resolution can be achieved through various techniques, including deterministic matching, probabilistic matching, and machine learning-based approaches. The choice of technique depends on the complexity of the data and the desired level of accuracy.

Overall, Entity Resolution is a powerful tool for improving data quality in marketing. By ensuring that your customer data is accurate, consistent, and up-to-date, you can enhance the effectiveness of your marketing campaigns and drive better results.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: