Tutorials

Snowflake external functions for data enrichment with AWS example | Census

Stephen Allwright
Stephen Allwright September 29, 2022

Stephen is a data science leader based in Oslo, Norway. After making the self-taught career transition from finance to data himself, he now enjoys helping others do the same by writing simple-to-understand tutorials that are accessible for all skill levels. If you’d like to read more of Stephen’s writing, you can find it at https://stephenallwright.com/. Oslo, Norway

Your Snowflake instance is likely full of data that’s been collected over the years. While, on the surface, it might seem like you can’t use data from years ago, don’t dismiss it just yet – that data is useful

It can provide value to you even today, guide your key decisions and help you to understand changes in your company. But what if you could enrich it even further by incorporating code from outside of Snowflake? 📈

This is where External Functions in Snowflake come in — and that’s exactly what we'll be looking at in this article.

What are external functions in Snowflake?

An external function is a user-defined function (UDF) that calls code that’s been stored and executed outside of Snowflake. Whether that code is on AWS, GCP, or Microsoft Azure cloud platforms, these functions give you the possibility of accessing API endpoints using SQL called within Snowflake. 

Simply put, with external functions, you: 

  • Increase available possibilities within Snowflake
  • Reduce the need for data pipelines when using external endpoints

How easy is it to set up Snowflake external functions?

It sounds like external functions are great, right? So the big question is: How easy is this to actually set up? 🤔

The answer to that question depends on your starting point. 📍 If you have an existing API that you would like to make requests on then this is a relatively easy process. However, if you need to create and maintain a new API then you’ll need some software engineering expertise to help with the process (which requires significantly more time and effort).

Step-by-step example for creating external functions for customer data in Snowflake

Let’s look at a typical use case for using external functions.

In this tutorial, we’ll create an external function to enrich sales leads with a ranking score created by an existing machine learning model API. The end result? A much more valuable dataset that you can utilize in an external tool to better prioritize sales leads. 🔥

What you need to get started 

To complete this tutorial there are a few items that you will need to have in place 👇

  • An existing API endpoint in a cloud platform of your choice (AWS, GCP, or Azure)
  • Account admin role in Snowflake

For this example, we’ll assume that we’re creating this integration with an API endpoint in AWS.

Integrate a machine learning API with Snowflake 

First, we’ll create an API integration object between Snowflake and the API so we can use our external machine learning model’s API in the external function.

To do this, we use the command CREATE OR REPLACE API INTEGRATION in our account admin role:

CREATE OR REPLACE API INTEGRATION <api_integration_name>
  api_provider = <api_provider_name>
  api_aws_role_arn = <IAM_role_ARN>
  api_allowed_prefixes = <resource_url>
  enabled = true
;

CREATE OR REPLACE API INTEGRATION leads_score_api_integration
  api_provider = aws_api_gateway
  api_aws_role_arn = '<IAM_role_ARN>'
  api_allowed_prefixes = ('https://…')
  enabled = true
;

The specific fields for this command will change slightly depending on the cloud platform that you’re using, but for our AWS example, the fields here are:

  • <api_integration_name> - The name we want to give to our API integration
  • api_provider - The object name of the API provider we’re using (there are default object names for each platform)
  • api_aws_role_arn - The Amazon Resource Name (ARN) for our Identity and Access Management (IAM) role
  • api_allowed_prefixes - The resource invocation URL

Authenticate Snowflake with your cloud platform

Now that the integration object between Snowflake and the API has been created, we need to authenticate that usage so that the request will be trusted. 

You can do this in your IAM role within the cloud platform GUI. The details are found by running the command DESCRIBE INTEGRATION.

DESCRIBE INTEGRATION leads_score_api_integration;

For AWS, the fields returned by DESCRIBE INTEGRATION that we will need to create a trusted relationship are:

  • API_AWS_IAM_USER_ARN
  • API_AWS_EXTERNAL_ID

Create an external function object in Snowflake

Since the integration and trusted relationship between our Snowflake instance and the API endpoint have been created, we can use this integration object to create our external function that will enrich our sales leads data. 

To do so, we’ll run the command CREATE EXTERNAL FUNCTION with details of the incoming and outgoing data.

CREATE EXTERNAL FUNCTION <name>(<arg_name> <arg_data_type>)
    RETURNS <result_data_type>
    api_integration = <api_integration_name>
    AS <url_of_proxy_and_resource>
    ;

CREATE EXTERNAL FUNCTION leads_score(number_of_days_since_signup integer, number_of_visits integer, total_time_spent float)
    RETURNS variant
    api_integration = leads_score_api_integration
    AS '<url_of_proxy_and_resource>'
    ;

The fields in this command that we need to fill in are:

  • <name> - The name we want to give to our external function
  • <arg_name> and <arg_data_type> - The external function’s input arguments and corresponding data types
  • <result_data_type> - The data type that we can expect to receive from the model
  • <api_integration_name> - The name of the integration that we created earlier
  • <url_of_proxy_and_resource> - The invocation URL of the proxy service that Snowflake uses to call the model

Use Snowflake external functions to enrich data in a view or table

Now, we can use the external function object that we created in the previous section to enrich data in our instance. To put the function to work, we can use it as part of a CREATE statement for a view, a table, or a simple ad-hoc select statement. 

The syntax for using the external function in a query is similar to calling a function in other programming languages. Let’s look at a simple select statement example where we want to score our sales leads.

SELECT lead, leads_score(number_of_days_since_signup, number_of_visits, total_time_spent) as lead_score FROM sales_leads;

Incorporating this function into a create statement will look very similar – we’ll simply create a new view with our newly scored sales leads.

CREATE OR REPLACE VIEW scored_sales_leads AS 
SELECT lead, leads_score(number_of_days_since_signup, number_of_visits, total_time_spent) as lead_score FROM sales_leads;

Parsing the JSON object from the remote service into Snowflake

While the process of using our new external function may seem simple on the surface, Snowflake is doing some intense work behind the scenes that we can’t see. In order to do this work though, the returning data needs to be in the correct format.

When our leads_score external function is run, the remote service returns a JSON object that is parsed by Snowflake. For Snowflake to be able to interpret that data correctly behind the scenes, it should be in the following format:

{
"statusCode": <http_status_code>,
"body":
        {
            "data":
                  [
                      [ 0, “High value” ],
                      [ 1, “Medium value” ],

                      [ 2, “Low value” ],
                      ...
                  ]
        }
}

Let’s walk through each part of this data structure:

  • statusCode - The status of the request
  • body - The only element in the body is the data object
  • data - This is the data that is parsed by Snowflake. It should contain one row for each row sent by Snowflake and contain:
    • The row number as a 0-based index
    • The values returned by the function, which in our case is the ranked score of the sales lead

Advantages and disadvantages of using external functions

There’s no doubt that using external functions in your Snowflake instance will bring extra value. The key advantages of using them – which we’ve also seen in this tutorial – are

  1. Data enrichment directly in Snowflake without the need for external data pipelines
  2. The code for the remote service can be written in the programming language you wish
  3. The external code can use functions and libraries that are not accessible natively within Snowflake

While there are some tempting upsides, there are also a number of disadvantages that should be considered when using external functions, including:

  1. Time and opportunity-cost required to create and maintain the integration with the API (as well as the API itself)
  2. Snowflake has no guarantees regarding frequency and duplications when calling external functions
  3. The remote service has to be hosted on either AWS, Azure, or GCP
  4. Each cloud platform has specific requirements for its integrations

Send enriched Snowflake data to CRM tools with the help of Census

In this tutorial, we were able to enrich our sales data with an external machine learning model to better prioritize our sales leads. Although this is helpful, the real value comes from using this data in the tools that our sales professionals use to contact these leads. To do this, you need a tool that can easily and securely move this data. This is where Census comes in. 💪

Census is the leading reverse ETL tool available, allowing you to move data from Snowflake into all of your tools without needing to spend time setting up and maintaining these integrations. Instead, you can focus your time and resources on what matters: Getting value from and taking action on your data.

👉 Want to start syncing your data? Book a demo with one of our product specialists to learn how we work for your specific operational analytics needs.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: