Tutorials

Redshift UNLOAD command: SQL, SDK, and CLI examples | Census

Volodymyr Grin
Volodymyr Grin August 18, 2022

Volodymyr is a talented engineer with a knack for solving challenging problems with simple, yet elegant solutions. He takes pride in his work and enjoys mentoring junior engineers. He specializes in writing technical tutorials in his spare time. Ukraine

If you’ve been around the Amazon Redshift block a time or two, you’re probably familiar with Redshift’s COPY command. Well, allow us to introduce you to its partner in crime: the UNLOAD command. 🦹♂️Redshift’s UNLOAD command allows Redshift users to export data from a SQL query run in the data warehouse into an Amazon S3 bucket – essentially doing the reverse of the COPY command. ◀️

In this article, we’ve captured SQL, CLI, and SDK examples so you’re totally aware of what data unloading options are available to you (and what you need to pay extra-close attention to). Welcome to your complete Redshift data unloading guide. 👐

Why would you want to unload data from Redshift?

Before running your first UNLOAD command, consider your goal. Why do you want to unload your data from Redshift? Without a clear goal in mind, you’re susceptible to inefficiency traps that might delay your data operations. 

Maybe you want your data into other business tools

Do you want to move your data from Redshift into other business apps (like Salesforce or HubSpot)? Within your organization, different business teams have different needs and expertise, so some teams might use Excel while others would be prone to using a CRM. 

Regardless of the different types of software used across your org, data is at the heart of all your business operations. ❤️ It plays an critical role across all your teams – empowering your teams to do their best work from a single source of truth.

Because most software that utilizes your data has a feature to import specific file types, you can use the UNLOAD query to export data out of Redshift in CSV or JSON (or other formats as needed). And, if you give your business users access to the specified buckets, you can cut yourself out as the middle man and give them self-serve access to the data they need, when they need it. 💡

Or you want to migrate it into another data storage service

Like many companies, you probably use multiple data storage services, like a relations database, a data warehouse, and a data lake. When you run the data UNLOAD query, you can generate data files and use them to move across different storage destinations – even in different cloud environments. ☁️

I mean, who doesn’t love flexible data? Better yet, you can use this unloading feature as part of you rETL pipeline to let other AWS services consume the exported data.

Either way, you should backup your data

As a best practice, before you run a manipulation query to a Redshift table, you should always create a backup to prevent a loss of data (just in case). Needless to say, there are a number of ways to back up data in Redshift, but unloading data into files can be a secure and simple backup option. 🔒

When you unload data into an S3 bucket, you store that data in an isolated external location where it can then be used as a table snapshot.

What are some Redshift-specific methods to unload your data?

Redshift users can unload data two main ways: 

  1. Using the SQL UNLOAD command
  2. Downloading the query result from a client

UNLOAD SQL

The most convenient way of unloading data from Redshift is by using the UNLOAD command in a SQL IDE. You can open Redshift’s native editor by going to https://console.aws.amazon.com/redshift/.

This method is particularly effective when you want to export really large data sets because you can define various parameters in the syntax to control export operation. The sample code below shows a simple example.


UNLOAD ('select * from students')
TO 's3://your-bucket/students/' 
IAM_ROLE 'arn:aws:iam::1234567890123:role/MyRedshiftRole'
CSV;

The first parameter of the UNLOAD command is the SELECT query that you’ll use to define what data you’re exporting. Psstt… 👂 Don’t forget your parentheses and quotes! 

You can add WHERE conditions or JOIN with other tables just like your normal SQL queries, but the TO parameter is critical; this is where you define the target S3 bucket where you want to unload the query result data into. 

As a prerequisite, you will need to have the S3 write permission in your role. For a CSV export, you can simply add the CSV option.

But your customization preferences don’t end there! Redshift provides many more options to manage your data export, as shown.


UNLOAD ('SELECT * FROM ...')
TO 's3://bucket/prefix'
authorization
[ option [ ... ] ]
where option is
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ] 
| HEADER           
| DELIMITER [ AS ] 'delimiter-char' 
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'   
| ENCRYPTED [ AUTO ]
| BZIP2  
| GZIP 
| ZSTD
| ADDQUOTES 
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| CLEANPATH
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ] 
| REGION [AS] 'aws-region' }

AWS SDK

AWS provides SDKs for most major programming languages to allow users to programmatically interact with AWS resources. Using the SDK, you can also run SQL queries including UNLOAD, COPY, DML, and DQL queries with Redshift. 

For example, you can execute a select query, get the result in DataFrame, transform it, and then export it into S3 bucket or into your local storage. To give you an idea of the syntax, we will use examples of Python SDK, boto3. The SDKs for other languages will work similarly.

To initialize the library, import it and declare a client.


import boto3
client = boto3.client('redshift-data')

Then, you can send a query request to your Redshift cluster, like so.


response = client.execute_statement(
    ClusterIdentifier='string',
    Database='string',
    DbUser='string',
    Parameters=[
        {
            'name': 'string',
            'value': 'string'
        },
    ],
    SecretArn='string',
    Sql='string',
    StatementName='string',
    WithEvent=True|False
)

Put your SQL query inside the SQL field. After it’s executed, it will return a response with a UUID of your query result. 


{
    'ClusterIdentifier': 'string',
    'CreatedAt': datetime(2022, 7, 1),
    'Database': 'string',
    'DbUser': 'string',
    'Id': 'string', # UUID of your requested query
    'SecretArn': 'string'
}

From the response, get the Id value and use the ‘get_statement_result’ to retrieve your query result.


response = client.get_statement_result(
    Id='string',
    NextToken='string'
)

If your data set is big, you will have to paginate using the NextToken value. For more details about the response, check the boto3 RedshiftDataAPIService document.

AWS CLI

In addition to the SDK, you can use AWS CLI to interact with Redshift, but before using it, you’ll need to download and install the command line tool. For install instructions, check this AWS document. The Redshift command will look like the following.


aws redshift-data execute-statement 
    --region us-west-2 
    --secret arn:aws:secret:us-west-2:123456789012:secret:my-secret 
    --cluster-identifier mycluster-dev 
    --sql "select * from table_name limit 100" 
    --database dev

Notice that the command structure is similar to that of SDK. Write your query in the --sql option, then, when the command is executed, you will be able to receive an Id as part of the response. Using that value, you can retrieve the query result.


aws redshift-data get-statement-result --id f3r3fdfsdf-sdf3-sdf3-113f-dsfjkwe8wr23

SQL client

If your data set is not particularly large (or if you just want an ad hoc way to export your data), downloading it from your SQL client directly is the simplest and easiest way. 

Exporting data in CSV format is the most common feature in SQL client software, but you do have other formats available. After you execute your query, a result will show up in your client with an option to download that result into your local storage. Since each software has a different menu, the best way to find the option is by referring to its user guide. 📖

Redshift UNLOAD parameters and options

Whether you use AWS SDK, the CLI, or your SQL client, you’ll need to know what options you have at your disposal to export the data exactly as you want it. Check out these UNLOAD options so you can see all the possibility that you can achieve with the UNLOAD feature. The more you know, right?

  • File types: You can choose either CSV, JSON, or PARQUET. If you plan to unload your data in order to import it into another software, you’ll need to know what file type is compatible with the recipient app. If the target tool has an import feature, it’s likely to support CSV format.  
  • PARTITION BY ( column_name [, ... ] ) [INCLUDE]: If you want to split your exported data by specific columns (like a date field), add this option with the date column name. When you add the INCLUDE option, that column will be included in your export.
  • BZIP2, GZIP, or ZSTD: By adding one of these options, your data will be exported into one of these compression file types. Using this option, you can make a big difference in file size.
  • ADDQUOTES: This option ensures a potential wrong delimitation with a false comma in a column. For example, if a value for a column is “apple, orange, banana”, this single column can be recognized as three columns due to the commas. Using wrapping double quotes, you can prevent a potential malformat.
  • PARALLEL: By default, when you execute the UNLOAD query, it will generate multiple files. To create a single file, add PARALLEL FALSE.
  • DELIMITER AS 'delimiter_character': Sometimes, a comma exists in a column, which is fine unless it leads to a malformat in your CSV file. By default, the delimiter will be a comma (,), but igf you want to change it to a pipe (|) or a tab (\t), consider this delimiter option. 
  • Max file size: You can specify the maximum size of the files that are unloaded into your S3 bucket (between 5MB and 6.2GB). By default, the maximum size is 6.2GB. 

Is there a better way?

AWS offers three major ways to unload data from Redshift, plus you can use your SQL client to quickly download query result, but the UNLOAD command is just a vehicle. Is unloading your data to S3 really the best way to achieve your goal? 🤔

If you're pulling your data out of Redshift and into S3 just to move them to another system, consider Census rETL instead. Census has over 85 destination integrations for your Redshift data – from Algolia to Zendesk – helping you get your data where you need it (without the incurred frustration from the added step). With sync speeds over 4,000 records per second, detailed observability, and 50 fields for free, it’s easy to get started and grow to any scale.

Book a demo with a product specialist to see how Census can help you operationalize your data. 💪

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: