Snowflake: A Data Engineering Tool
Snowflake began as the best cloud storage tool. It set out to simply store data, while also providing the infinite scalability of the cloud. Right out of the gate it offered incredible flexibility with its decoupled storage and compute. You could model your data in a Kimball dimensional model, a Data Vault, or even a loosely structured data lake all in the same instance. It also handled table clustering, auto scaling, security, caching, and many other features most data engineers and analysts don’t want to worry about. However, it fell short when it came to data extraction and transformation. Other platforms like Redshift, BigQuery, and Synapse integrated so well with their respective cloud data stacks, making data engineering and processing as simple as it gets. Snowflake, on the other hand, has three options for loading and transforming data.
Option 1: Snowpipe and Stages
Snowflake has a proprietary continuous data loading solution called Snowpipe. It uses named Snowflake objects called ‘pipes’, which contain a COPY statement to take data from a stage and load it into a table. A stage is a location where data files are stored in the cloud. These can be internal (within Snowflake itself) or external (on a cloud platform like AWS S3 or Azure Data Lake Storage). Snowpipe can be called by a public REST API endpoint or by using a cloud event messaging service like AWS SNS.
- Log files are uploaded to an AWS S3 bucket
- AWS SNS or S3 Event Notifications notify Snowpipe new data is available
- Snowpipe grabs the new data and loads it into the target table
There are numerous benefits to this approach:
- It’s serverless, so you don’t need to configure or monitor resources.
- It’s continuous, so it’s a good approach for constantly flowing data or small incremental loads.
- You’re only billed for the resources Snowflake uses to move the data, as opposed to the large billing model for virtual warehouses.
- If you already have a cloud data lake, this is a simple way to migrate data into Snowflake using native technology.
However, Snowpipe may not be appropriate for the following reasons:
- Snowpipe alone doesn’t solve Change-Data-Capture (CDC), which requires the use of streams
- Snowpipe requires additional cloud resources:
- Cloud data store
- Event messaging service
- Snowflake file format and copy options
- Integration between your Snowflake instance and your cloud instance is required.
- File size, queuing, etc.
Option 2: Third-Party ETL Tools
From the beginning, Snowflake created a robust partner network and integrations with some of the most popular ETL tools in the market. The ETL tools also create native integrations with Snowflake and have simplified the process of extracting, transforming, and loading data from various sources into Snowflake. While each ETL tool is different, they should all be able to offer the following:
- One single tool for extraction, transformation, and loading of data
- Native CDC capabilities
- Orchestration and automation of ETL/ELT processes
- Less setup than using Snowpipe and external stages
- No-code solutions for data transformations
There are many use cases where a third party ETL tool is the right choice, and when implemented successfully, can save time and money for data engineering teams. There are also reasons not to use third party tools:
- Price for proprietary software can be very expensive
- Some tools are not fully managed, meaning your team will have to set up the cloud infrastructure to serve the ETL tool
- Potentially less scalability than Snowflake
- Difficult to switch from one tool to another
- Each tool requires additional skills in order to implement effectively
- Continuous, near-real-time, and real-time loading is immature or nonexistent
There are other options for loading data into Snowflake:
- BULK loading via COPY command
- Data loading wizard in the Snowflake UI
The data loading wizard allows you to load data directly into tables with no actual code. This is intended for smaller datasets and in rare circumstances and should not be used for regular data loading. BULK loading via the COPY command allows you to manually load data by staging it and then copying it to a table. This requires a running virtual warehouse. This too should not be used for regularly scheduled loads or any kind of volatile or business-critical data.
The Current State of Data Engineering in Snowflake
While Snowflake started off as a cloud storage and querying platform, competition has forced it to create more data engineering capabilities. Platforms like Databricks are starting to encroach on data storage, and Snowflake has responded by adding new features:
Tasks are named objects that execute a SQL statement at a scheduled time. The emphasis is on the scheduled aspect, as this is the first semblance of orchestration in Snowflake. A task will use a virtual warehouse to run a SQL statement, such as calling a stored procedure. In other words, you can combine tasks and stored procedures to create data pipelines!
Because tasks are named database objects, you can use them in third party ETL solutions as you would a table or view. Certain limitations may apply depending on the tool.
These are the last piece of the puzzle for creating a data engineering environment. Streams are named objects created off an existing table. A stream will have the same columns as the table, but with three additional columns: METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID.
- METADATA$ACTION specifies if the row is an insert or delete
- METADATA$ISUPDATE specifies if the insert or delete is an update
- METADATA$ROW_ID specifies the unique and immutable row for the table.
With these three columns, you can implement Change-Data-Capture on tables. This article shows how to build a relatively simple Type II slowly changing dimension and all the upsert logic associated with it.
Streams and stored procedures still need to be used together for CDC, but the process is simplified. Instead of comparing the ODS table with the new raw data, the stream will determine the table changes for you. The stream and corresponding view that manages the upsert logic are used in a simple stored procedure to expired changed and deleted records, insert new and current changed records, and append metadata to all records.
Bringing it all together
With these native Snowflake objects, data processing no longer requires third party tools. Achieving a robust persistent staging layer with auditability is now possible with streams, and when chained with tasks and called by stored procedures, the entire process can be automated. Pipes can continuously bring new data into Snowflake, so your data will never be stale. The data orchestration and processing components that were previously in a third-party tool now resides entirely within Snowflake, and the code is entirely SQL.
Interested in Learning More about Snowflake as a Data Engineering Tool?
Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership here.
Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at Snowflake.com.