$ viu feature_img.png

Terraforming Snowflake ❄️

$ whoami

Lidia-Ana-Maria Baciu

$ date

2022-07-18

Terraforming Snowflake ❄️

It should go without saying that data is a critical asset for any organization. As a result, it is important that the platform handling all this data is able to do so with scalability and speed in mind.

Enter... 🥁🥁🥁 Snowflake!

Snowflake is a cloud platform for data x, where x = [warehousing, lakes, engineering, science, application, sharing]. So everything that is data-related basically.

Terraform, on the other hand, is an infrastructure-as-code tool that allows you to define cloud resources in human-readable configuration files that you can version, reuse and share.  

What better combo than Snowflake ++ Terraform could there be ?

In this article, we will demonstrate how we used Terraform to set up our Snowflake resources when working on Weather Nowcasting - deploying a model on edge. On top of that, we will present how to load the content of an Azure container into a Snowflake table.


There are more than 1700 providers to manage lots of different resources and services. Amongst those, you can also find some providers for Snowflake, but the most used one is provided by chanzuckerberg. We have based our Terraform template on the one mentioned above and will describe step by step how you can create Snowflake resources with our template.

Snowflake account information

In order to connect Snowflake to the Terraform code, you will need to find out some information about your account in the first place.

An account identifier uniquely identifies a Snowflake account within your organization, as well as throughout the global network of Snowflake-supported cloud platforms and cloud regions. [snowflake.com]

The URL provides all the information needed: https://app.snowflake.com/<region-id>.<cloud-provider>/<account-locator>/worksheets

Account locator is an identifier assigned by Snowflake when the account is created. Its value can be seen either in the URL, or by running the following command in a Worksheet: SELECT current_account() as YOUR_ACCOUNT_LOCATOR;

Snowflake region id represents the cloud region where the account is located. Its value can be seen either in the URL, or by running the following command in a Worksheet: SELECT current_region() as YOUR_SNOWFLAKE_REGION_ID;
The value returned must then be searched (in this table) and its corresponding cloud region id selected.

At this point, your information should be, for example:

SNOWFLAKE_ACCOUNT = xy12345
SNOWFLAKE_REGION = north-europe.azure


Connect Snowflake & Terraform

In order to establish a connection between Snowflake and Terraform, you also need to provide some credentials, namely a user and a password.

Create RSA key for authentication

We will use the RSA key, but there are other ways of authenticating. You can check them out here.

The following commands will generate a public/private key pair:

cd ~/.ssh
openssl genrsa -out snowflake_key 4096
openssl rsa -in snowflake_key -pubout -out snowflake_key.pub

Now you should be able to view your public key, which should have a content similar to the following one:

-----BEGIN PUBLIC KEY-----
<PUBLIC_KEY>
-----END PUBLIC KEY-----

Create service user for authentication

In order to create a new user, which will be used for establishing the connection to Terraform, you can run the following commands in a Worksheet. Notice how we are directly linking the user to the newly created RSA key as authentication method:

USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE USER "tf-snow" RSA_PUBLIC_KEY='<PUBLIC_KEY>' DEFAULT_ROLE=PUBLIC MUST_CHANGE_PASSWORD=FALSE;
GRANT ROLE SYSADMIN TO USER "tf-snow";
GRANT ROLE SECURITYADMIN TO USER "tf-snow";

Up until this point, you should have the following information:

SNOWFLAKE_USER = tf-snow
SNOWFLAKE_PASSWORD = <PUBLIC_KEY>


Project structure

Figure 1. Structure of the Terraform folder

The root folder contains the main.tf, variables.tf and .terraform.lock.hcl files. The later represents the dependency lock file and you can read more about it here.

We have chosen to create only two environments, prod and staging. Thus, you will notice that these are the names of the tfvars that we will further present.
Moreover, in the code you will often see var.env or local.env_upper, which is the upper case version of var.env.

In the backend folder we have the prod.tfvars and staging.tfvars. Each file contains a key value, which corresponds to the name of the file which will contain the state of the resources at any time, e.g. prod.tfvars has the content key = "prod.tfstate"

Figure 2. Backend files

The vars folder contains prod.tfvars and staging.tfvars, but they contain references to the environment variables. For example, prod.tfvars has the content set to:

env             = "prod"
subscription_id = env.ARM_SUBSCRIPTION_ID
tenant_id       = env.ARM_TENANT_ID

You can notice two more folders, namely azure and snowflake. These are the modules that will later be described in more detail. You can read more about Terraform modules here.


Set up the Snowflake provider

In order to define the Snowflake provider, you should include the following blocks of code in your main.tf file:

terraform {
  required_version = ">= 1.1"
  required_providers {
    snowflake = {
      source  = "chanzuckerberg/snowflake"
      version = "0.31.0"
    }
  }
}

provider "snowflake" {
  features {}
  alias = "terraform"
  role  = "ACCOUNTADMIN"
}

module "snowflake" {
  source = "./snowflake"
  env    = var.env
}

Now we have to create a main.tf file in the snowflake module, as this is the source defined in the previous step. Its content should contain the following:

terraform {
  required_providers {
    snowflake = {
      source  = "chanzuckerberg/snowflake"
      version = "0.31.0"
    }
  }
}

provider "snowflake" {
  alias = "terraform"
  role  = "ACCOUNTADMIN"
}

Next, we will continue to describe all the necessary steps to set up the resources properly. You can find all the code in our Github repository.


Initialize Snowflake

There are 3 mandatory resources that we need to create. These are: database, schema and warehouse.

resource "snowflake_database" "main" {
  provider = snowflake.terraform
  name     = "WEATHER_${local.env_upper}"
}

resource "snowflake_warehouse" "weather" {
  provider       = snowflake.terraform
  name           = "WEATHER_WHS_${local.env_upper}"
  warehouse_size = "xsmall"
  auto_suspend   = 60
}

resource "snowflake_schema" "datawarehouse" {
  provider   = snowflake.terraform
  database   = snowflake_database.main.name
  name       = "DATAWAREHOUSE_${local.env_upper}"
  is_managed = false
}

Our scenario included retrieving the content of a JSON file in real time from an Azure container. Thus, we only needed a table with a single column of type VARIANT.

resource "snowflake_table" "sensor" {
  provider = snowflake.terraform
  database = snowflake_database.main.name
  schema   = snowflake_schema.datawarehouse.name
  name     = "WEATHER_JSON_${local.env_upper}"
  column {
    name    = "var"
    type    = "VARIANT"
    comment = "Raw sensor data"
  }
}

In order to be able to customize our JSON input, we have chosen to create a custom file format:

resource "snowflake_file_format" "json" {
  provider             = snowflake.terraform
  name                 = "JSON_DQ_${local.env_upper}"
  database             = snowflake_database.main.name
  schema               = snowflake_schema.datawarehouse.name
  format_type          = "JSON"
  strip_outer_array    = true
  compression          = "NONE"
  binary_format        = "HEX"
  date_format          = "AUTO"
  time_format          = "AUTO"
  timestamp_format     = "AUTO"
  skip_byte_order_mark = true
}

Create storage integration

A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations. [snowflake.com]

We reached the part where the chosen cloud provider is becoming important; and in this case we are working with Azure. Thus, for this step, on your Azure account you should have created already a storage account (Standard general-purpose v2) and a container. You can now link Snowflake to the container like below:

  provider                  = snowflake.terraform
  type                      = "EXTERNAL_STAGE"
  storage_provider          = "AZURE"
  azure_tenant_id           = data.azurerm_client_config.current.tenant_id
  name                      = "AZURE_INTEGRATION_${local.env_upper}"
  comment                   = "A storage integration for Azure blobs."
  enabled                   = true
  storage_allowed_locations = ["azure://<storage-account-name>.blob.core.windows.net/<container-name>"]
}

Create notification integration

A notification integration is a Snowflake object that provides an interface between Snowflake and third-party cloud message queuing services. [snowflake.com]

On your Azure account, you should have already created a queue inside your storage account, which you can then also link to Snowflake.

resource "snowflake_notification_integration" "azure_queue" {
  provider                        = snowflake.terraform
  name                            = "WEATHER_NOTIFICATION_${local.env_upper}"
  enabled                         = true
  type                            = "QUEUE"
  notification_provider           = "AZURE_STORAGE_QUEUE"
  azure_storage_queue_primary_uri = "https://<storage-account-name>.queue.core.windows.net/<queue-name>"
  azure_tenant_id                 = data.azurerm_client_config.current.tenant_id
}

Create a stage

A stage is used for loading data from files into Snowflake tables and unloading data from tables into files. [snowflake.com]

For this step is important to note that you need to generate a SAS Token for your container.

resource "snowflake_stage" "snowstage" {
  provider    = snowflake.terraform
  name        = "SNOWPIPE_STAGE_${local.env_upper}"
  url         = "azure://<storage-account-name>.blob.core.windows.net/<container-name>"
  database    = snowflake_database.main.name
  schema      = snowflake_schema.datawarehouse.name
  credentials = "AZURE_SAS_TOKEN='${var.sas_token}'"
  file_format = "FORMAT_NAME = ${local.qualified_file_format_name}"
}

Create a pipe

A pipe is created for defining the COPY INTO <table> statement used by Snowpipe to load data from an ingestion queue into tables. [snowflake.com]

Now that we have successfully linked the Azure blob storage to Snowflake, we can create the pipe that will be used to load the data from the blob storage into our snowflake table.

resource "snowflake_pipe" "snowpipe" {
  provider = snowflake.terraform
  database = snowflake_database.main.name
  schema   = snowflake_schema.datawarehouse.name
  name     = "WEATHER_PIPE_DQ_${local.env_upper}"

  copy_statement = "copy into ${local.qualified_sensor_table_name} from @${local.qualified_stage_name} file_format = (format_name = ${local.qualified_file_format_name})"
  auto_ingest    = true
  integration    = snowflake_notification_integration.azure_queue.name
}

Manual steps

At this point, all your resources are created and everything should be set up. But if you try to test the data flow, you will notice an annoying error similar to:

COULD NOT ACCESS QUEUE; THIS REQUEST IS NOT AUTHORIZED TO PERFORM THIS OPERATION USING THIS PERMISSION. (STATUS CODE:403; ERROR CODE:AUTHORIZATIONPERMISSIONMISMATCH)”

Now it's time to move again to your Snowflake worksheets and to grant access to your storage container. In order to do that, you have to follow two steps.

The first one consists of finding the information for you storage integration. This can be done with the command: desc STORAGE INTEGRATION <NAME OF YOUR STORAGE INTEGRATION>;
Now you will see a table with all the properties of your storage integration. The important ones are:

  • AZURE_CONSENT_URL, which you should go to and allow permission
  • AZURE_MULTI_TENANT_APP_NAME, which is the form SnowflakePACInt1111_1234567899876 - and you should only take into account the first part SnowflakePACInt1111
    Moving to your Azure storage account, you should add the roles Storage Queue Data Contributor and Storage Blob Data Contributor to SnowflakePACInt1111

The second step is to perform the same steps, but for the notification integration: DESC NOTIFICATION INTEGRATION <NAME OF YOUR NOTIFICATION INTEGRATION>;


Applying changes

As you may already know, in order to apply the changes made to your Terraform code on the Snowflake resources, you need to run some commands. We will present briefly what needs to be done at this step.

Terraform initialize

used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control. It is safe to run this command multiple times. [terraform.com]

e.g. terraform init -backend-config backend/staging.tfvars

Terraform format

used to rewrite Terraform configuration files to a canonical format and style. This command applies a subset of the Terraform language style conventions, along with other minor adjustments for readability. [terraform.com]

e.g. terraform fmt -check -recursive

Terraform validate

validates the configuration files in a directory, referring only to the configuration and not accessing any remote services such as remote state, provider APIs, etc. [terraform.com]

e.g. terraform validate

Terraform plan

creates an execution plan, which lets you preview the changes that Terraform plans to make to your infrastructure. [terraform.com]

e.g. terraform plan -var-file vars/staging.tfvars

Terraform apply

executes the actions proposed in a Terraform plan. [terraform.com]

e.g. terraform apply -var-file vars/staging.tfvars


Conclusion

Now everything should be set up and as soon as you upload a file into your container, its content should be copied into the table. For example, let's consider a JSON file with the following content:

{
  "Body": {
    "ts": "2022-04-08T15:07:29.130826",
    "tz": "CEST",
    "message": {
      "temperature": 29.17,
      "humidity": 22.5,
      "pressure": 994.53,
      "light": 159.26,
      "uv": 0.04,
      "ir": 278
    }
  }
}

We will be able to see this row in our table, as can be observed in Figure 3.

Figure 3. Table entry from JSON file

In this blog post we showed how to ingest data in real time from Azure storage containers into Snowflake, while creating the resources using Terraform.

Using terraform plan we verify that the changes applied to the resources are indeed what we are trying to do. Then, by terraform apply you actually apply those changes!

By managing our Snowflake account through Terraform we automated this error-prone process. Thus, there is no need to manually update the same resource in every environment. On top of that, by using CI/CD we simplified even more the management of resources.

Indeed, the Snowflake provider for Terraform is documented, but you will come to find out that not all the resources are updated or that some contain copy-paste bugs sometimes. Just don't take everything for granted and ask yourself if some properties make sense!

Finally, we challenge you to try and recreate this project and give an update to us about how it went!


Got an idea?

Let's talk!

sign up to our weekly AI & data digest ❤️