Tutorials

How to unload data from Snowflake | Census

Michel Zurkirchen
Michel Zurkirchen October 25, 2022

Michel Zurkirchen is an Amsterdam-based digital analyst and a regular contributor to the Census blog. After starting his career in digital marketing, Michel started learning Python and has been an enthusiastic learner and practitioner of data analytics ever since. Amsterdam, North Holland, Netherlands

If you're one of the new kids on the Snowflake block, you might be wondering how you can unload your data from Snowflake so that you can put it to use in other places. Maybe you just moved here from the Redshift block and are wondering why Snowflake doesn't have an UNLOAD command – or you're just in a general state of confusion. 😶‍🌫️

Either way, we'll help you get started on your Snowflake data unloading journey by showing you how to get the job done. While it might be tempting to jump in and get started on your own, trust us: Spending a few minutes to read this article can save you time and money in the long run.

(We'll even throw in a few things you should consider before you get started).

Why unload data from Snowflake?

The first step in this process is always the same: Ask yourself why you're unloading data to begin with. There are different ways to get the job done, but which way is the best depends on your situation. 👇

Are you using the data in a different business tool?

Gathering data is great, but at some point, you'll want to do something with it. You can build a dashboard in Tableau, merge it with other data in your CRM, or do a number of other useful things with it. 

But you run the risk of wasting a lot of your time if you automate an export process, only to find out later that a single manual export would have sufficed. Conversely, you don't want to perform the same manual export every day – unless you consider it a meditative experience. So, evaluate whether you’ll need this data in your tools more than once before deciding whether you’ll manually or automatically unload your data.

Maybe you’re moving data to a different data service?

Your company probably uses multiple data storage services, including (at a minimum) a data lake and a data warehouse. At some point, then, you'll want to easily unload data from Snowflake to another cloud environment, such as Amazon S3 or Google Cloud Storage. Great news: You can. 👍

Backup your data

Regardless of how you're using your data, make sure the raw data is stored safely. If you're manipulating your data in any way, back it up – or it could be gone for good. Depending on the scale of what you're about to do, a suitable backup could be as simple as a different table in the same database. However, if you have a more complex case, it might mean copying your entire database to a different platform. 

How to unload data from Snowflake

Now that we’ve gone through some different reasons why you might be unloading data from Snowflake and how that changes your approach, it’s time to get to the good stuff: The how. 

The COPY INTO command enables you to copy an entire table or a query result to a Snowflake stage, from where you can download the data to your local system. Alternatively, you can immediately unload the data to Amazon S3, Google Cloud Storage, or Microsoft Azure. The syntax for the command is as follows.

COPY INTO { internalStage | externalStage | externalLocation }
     FROM { [<namespace>.]<table_name> | ( <query> ) }
[ STORAGE_INTEGRATION = <integration_name> ]
[ PARTITION BY <expr> ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
                    TYPE = { CSV | JSON | PARQUET } [ formatTypeOptions ] } ) ]
[ copyOptions ]
[ VALIDATION_MODE = RETURN_ROWS ]
[ HEADER ]

Let's break this down line by line. COPY INTO specifies the destination to unload data to, whether that’s external or internal.

-- Internal or external stage
COPY INTO @your-stage/your-data.csv

-- Google Cloud Bucket
COPY INTO 'gcs://your-bucket/your-folder/'

-- Amazon S3
COPY INTO 's3://your-bucket/your-folder/'

-- Microsoft Azure
COPY INTO 'azure://your-account.blob.core.windows.net/your-bucket/your-folder/'
FROM specifies the source, and can be either a query or a table.

-- Copying an entire table
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER

-- Copying using a query
FROM (SELECT C_CUSTOMER_ID, C_FIRST_NAME, C_LAST_NAME, C_EMAIL_ADDRESS FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER)

STORAGE_INTEGRATION is only needed when unloading to an external storage location. The integration contains authorization to make it possible for Snowflake to write data to the external location. Setting up the integration is a separate process that you need to perform before unloading your data.

STORAGE_INTEGRATION = YOUR_INTEGRATION

PARTITION BY splits the data into separate files based on a string input. For this purpose, we'll split our data based on the first letter of the last name of our customers. Make sure you don't PARTITION BY any columns containing sensitive data, or you might inadvertently cause a data leak. 😱

PARTITION BY LEFT(C_LAST_NAME, 1)

We'll set our file type to CSV for this example (though we have a few different options, but we'll get back to in the next section). We'll also compress it to GZIP. If you plan on repeatedly unloading data, you’ll need to ensure consistency. In that case, consider creating a named file format, with which you can define file type, compression, and other parameters upfront.

-- Setting a file type and compression
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP')

-- Applying a named file format
FILE_FORMAT = (FORMAT_NAME = YOUR_NAMED_FORMAT)

The copyOptions are optional parameters you can include to further tweak the result of COPY INTO. We'll stick to using only one of them, which will output the path, name, size and number of rows of each file that you unloaded, so you can verify that everything went correctly.

DETAILED_OUTPUT = TRUE

VALIDATION_MODE only accepts one possible value, which is RETURN_ROWS. Including VALIDATION_MODE = RETURN_ROWS tells Snowflake to return the rows instead of writing to your destination, so you can determine whether you have the correct data. This is definitely a good precaution to take so that you don't spend time and budget writing the wrong data. Once you're certain that you're all good to go, remove VALIDATION_MODE to write the data to your destination.

With HEADER you can specify whether you want the data to include… 🥁🥁🥁

Headers! Shocking. It can be set to TRUE or FALSE, with FALSE being the default.

HEADER = TRUE

If we combine all of the above, we get the query below. Barring any errors, such as your storage integration not being set up correctly, executing this query will result in your Snowflake data being unloaded to your Google Cloud Storage. Note that we left out VALIDATION_MODE = TRUE. Always confirm at the destination that you successfully unloaded data.

COPY INTO 'gcs://your-bucket/your-folder/'
FROM (SELECT C_CUSTOMER_ID, C_FIRST_NAME, C_LAST_NAME, C_EMAIL_ADDRESS FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER)
STORAGE_INTEGRATION = YOUR_INTEGRATION
PARTITION BY LEFT(C_LAST_NAME, 1)
FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP')
DETAILED_OUTPUT = TRUE
HEADER = TRUE;

If you unloaded to an internal stage rather than an external location, you could download your data to your local machine with the GET command, like so.

-- Linux/macOS
GET @%your-table file:///your-local-folder/;

-- Windows
GET @%your-table file://C:\your-local-folder\;

Which file type to unload to

Before actually unloading data, consider which file type is the best option. Each file type comes with its own pros and cons, but a good starting point is checking which file types can handle your data. For example, JSON can only be used for VARIANT columns, whereas you can’t use Parquet when you have nested data in VARIANT columns. 

You also need to determine which file types your destination can handle. Storing your data as a Parquet file isn't a problem for major data platforms such as Google Cloud Storage, but you won't be making any friends with your colleague who wants to do a quick analysis in Excel. 😬

Also, think of how much data you’re unloading and ingesting. It doesn't really matter what you do if your data is only a few MBs, but if you're unloading terabytes of data, you’ll want to reduce your file size by choosing Parquet when possible – plus it will greatly cut down on costs. 

Where to go from here

Hopefully, you ended up a little wiser than before you started, and you now feel secure in handling whichever Snowflake unloading task comes your way. But if all the manual or automatic unloading setup sounds like a bear, consider Census reverse ETL instead. 

Census has over 85 destination integrations for your Snowflake data, helping you get your data where you need it (without the incurred frustration from the added step). With sync speeds over 4,000 records per second, detailed observability, and 50 fields for free, it’s easy to get started and grow to any scale. 🚀

Book a demo with a product specialist to see how Census can help you operationalize your data. 💪

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: