Why Choose Open-Source Technologies?
In 2022, almost every enterprise has some cloud footprint, especially around their data. These cloud platforms offer closed-source tools which, while offering many benefits, may not always be the best choice for some organizations. First and foremost, these proprietary services can be expensive. In addition to paying for the storage and compute needed to store and access data, you also end up paying for the software itself. You could also become locked into a multi-year contract, or you might find yourself locked into a cloud’s tech stack. Once that happens, it’s very difficult (and expensive) to migrate to a different technology or re-tool your tech stack. To put it simply, if you ever reach a roadblock your closed-source tool can’t solve, there may be no workarounds.
Since closed-source technologies can create a whole host of issues, open-source technologies may be the right choice. Open-source tech is not owned by anyone. Rather, anyone can use, repackage, and distribute the technology. Several companies have monetized open-source technology by packaging and distributing it in innovative ways. Databricks, for example, built a platform on Apache Spark, a big-data processing framework. In addition to providing Spark as a managed service, Databricks offers a lot of other features that organizations find valuable. However, a small organization might not have the capital or the use case that a managed service like Databricks aims to solve. Instead, you can deploy Apache Spark on your own server or a cloud compute instance and have total control. This is especially attractive when addressing security concerns. An organization can benefit from a tool like Spark without having to involve a third party and risk exposing data to the third party.
Another benefit is fine-tuning resource provisioning.
Because you’re deploying the code on your own server or compute instance, you can configure the specifications however you want. That way, you can avoid over-provisioning or under-provisioning. You can even manage scaling, failover, redundancy, security, and more. While many managed platforms offer auto-scaling and failover, it is never so granular as it is when you provision resources yourself.
Many proprietary tools, specifically ETL (Extract, Transfer, Load) and data integration tools, are no-code GUI based solutions that require some prior experience to be implemented correctly. While the GUIs are intended to make it easier for analysts and less-technical people to create data solutions, more technical engineers can find it frustrating. Unfortunately, as the market becomes more inundated with new tools, it can be difficult to find proper training and resources. Even documentation can be iffy! Open-source technologies can be similarly peculiar, but it’s entirely possible to create an entire data stack – data engineering, modeling, analytics, and more – all using popular open-source tech. These tools will almost certainly lack a no-code GUI but are compatible with your favorite programming languages. Spark supports Scala, Python, Java, SQL and R, so anyone who knows one of those skills can be effective using Spark.
But how does this work with cloud environments?
You can choose how much of the open-source stack you want to incorporate. A fully open-source stack would simply be running all your open-source data components on cloud compute instances: database, data lake, ETL, data warehouse, and analytics all on virtual machine(s). However, that’s quite a bit of infrastructure to set up, so it may make sense to unload some parts to cloud-native technologies. Instead of creating and maintaining your own data lake, it would make sense to use AWS S3, Azure Data Lake Storage gen2, or Google Cloud Storage. Instead of managing a compute instance for a database, it would make sense to use AWS RDS, Azure SQL DB, or Google Cloud SQL and use an open-source flavor of database like MySQL or MariaDB. Instead of managing a Spark cluster, it might make sense to let the cloud manage the scaling, software patching, and other maintenance, and use AWS EMR, Azure HDInsight, or Google Dataproc. You could also abandon the idea of using compute instances and architect a solution using a cloud’s managed open-source offerings: AWS EMR, AWS MWAA, AWS RDS, Azure Database, Azure HDInsight, GCP’s Dataproc and Cloud Composer, and those are just data-specific services. As mentioned before, these native services bear some responsibility for maintaining the compute/storage, software version, runtimes, and failover. As a result, the managed offering will be more expensive than doing it yourself, but you’re still not paying for software licensing costs.
In the end, there’s a tradeoff.
There’s a tradeoff between having total control and ease of use, maintenance, and cost optimization, but there is a myriad of options for building an open-source data. You have the flexibility to host it on-premises or in the cloud of your choice. Most importantly, you can reduce spend significantly by avoiding software licensing costs.
Interested in Learning More About Open-Source Technologies?
Here at Strive Consulting, our subject matter experts’ team up with you to understand your core business needs, while taking a deeper dive into your organization’s growth strategy. Whether you’re interested in modern data integration or an overall data and analytics assessment, Strive Consulting is dedicated to being your partner, committed to success. Learn more about our Data & Analytics practice HERE.
Snowflake: A Data Engineering Tool
Snowflake began as the best cloud storage tool. It set out to simply store data, while also providing the infinite scalability of the cloud. Right out of the gate it offered incredible flexibility with its decoupled storage and compute. You could model your data in a Kimball dimensional model, a Data Vault, or even a loosely structured data lake all in the same instance. It also handled table clustering, auto scaling, security, caching, and many other features most data engineers and analysts don’t want to worry about. However, it fell short when it came to data extraction and transformation. Other platforms like Redshift, BigQuery, and Synapse integrated so well with their respective cloud data stacks, making data engineering and processing as simple as it gets. Snowflake, on the other hand, has three options for loading and transforming data.
Option 1: Snowpipe and Stages
Snowflake has a proprietary continuous data loading solution called Snowpipe. It uses named Snowflake objects called ‘pipes’, which contain a COPY statement to take data from a stage and load it into a table. A stage is a location where data files are stored in the cloud. These can be internal (within Snowflake itself) or external (on a cloud platform like AWS S3 or Azure Data Lake Storage). Snowpipe can be called by a public REST API endpoint or by using a cloud event messaging service like AWS SNS.
- Log files are uploaded to an AWS S3 bucket
- AWS SNS or S3 Event Notifications notify Snowpipe new data is available
- Snowpipe grabs the new data and loads it into the target table
There are numerous benefits to this approach:
- It’s serverless, so you don’t need to configure or monitor resources.
- It’s continuous, so it’s a good approach for constantly flowing data or small incremental loads.
- You’re only billed for the resources Snowflake uses to move the data, as opposed to the large billing model for virtual warehouses.
- If you already have a cloud data lake, this is a simple way to migrate data into Snowflake using native technology.
However, Snowpipe may not be appropriate for the following reasons:
- Snowpipe alone doesn’t solve Change-Data-Capture (CDC), which requires the use of streams
- Snowpipe requires additional cloud resources:
- Cloud data store
- Event messaging service
- Snowflake file format and copy options
- Integration between your Snowflake instance and your cloud instance is required.
- File size, queuing, etc.
Option 2: Third-Party ETL Tools
From the beginning, Snowflake created a robust partner network and integrations with some of the most popular ETL tools in the market. The ETL tools also create native integrations with Snowflake and have simplified the process of extracting, transforming, and loading data from various sources into Snowflake. While each ETL tool is different, they should all be able to offer the following:
- One single tool for extraction, transformation, and loading of data
- Native CDC capabilities
- Orchestration and automation of ETL/ELT processes
- Less setup than using Snowpipe and external stages
- No-code solutions for data transformations
There are many use cases where a third party ETL tool is the right choice, and when implemented successfully, can save time and money for data engineering teams. There are also reasons not to use third party tools:
- Price for proprietary software can be very expensive
- Some tools are not fully managed, meaning your team will have to set up the cloud infrastructure to serve the ETL tool
- Potentially less scalability than Snowflake
- Difficult to switch from one tool to another
- Each tool requires additional skills in order to implement effectively
- Continuous, near-real-time, and real-time loading is immature or nonexistent
There are other options for loading data into Snowflake:
- BULK loading via COPY command
- Data loading wizard in the Snowflake UI
The data loading wizard allows you to load data directly into tables with no actual code. This is intended for smaller datasets and in rare circumstances and should not be used for regular data loading. BULK loading via the COPY command allows you to manually load data by staging it and then copying it to a table. This requires a running virtual warehouse. This too should not be used for regularly scheduled loads or any kind of volatile or business-critical data.
The Current State of Data Engineering in Snowflake
While Snowflake started off as a cloud storage and querying platform, competition has forced it to create more data engineering capabilities. Platforms like Databricks are starting to encroach on data storage, and Snowflake has responded by adding new features:
Tasks are named objects that execute a SQL statement at a scheduled time. The emphasis is on the scheduled aspect, as this is the first semblance of orchestration in Snowflake. A task will use a virtual warehouse to run a SQL statement, such as calling a stored procedure. In other words, you can combine tasks and stored procedures to create data pipelines!
Because tasks are named database objects, you can use them in third party ETL solutions as you would a table or view. Certain limitations may apply depending on the tool.
These are the last piece of the puzzle for creating a data engineering environment. Streams are named objects created off an existing table. A stream will have the same columns as the table, but with three additional columns: METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID.
- METADATA$ACTION specifies if the row is an insert or delete
- METADATA$ISUPDATE specifies if the insert or delete is an update
- METADATA$ROW_ID specifies the unique and immutable row for the table.
With these three columns, you can implement Change-Data-Capture on tables. This article shows how to build a relatively simple Type II slowly changing dimension and all the upsert logic associated with it.
Streams and stored procedures still need to be used together for CDC, but the process is simplified. Instead of comparing the ODS table with the new raw data, the stream will determine the table changes for you. The stream and corresponding view that manages the upsert logic are used in a simple stored procedure to expired changed and deleted records, insert new and current changed records, and append metadata to all records.
Bringing it all together
With these native Snowflake objects, data processing no longer requires third party tools. Achieving a robust persistent staging layer with auditability is now possible with streams, and when chained with tasks and called by stored procedures, the entire process can be automated. Pipes can continuously bring new data into Snowflake, so your data will never be stale. The data orchestration and processing components that were previously in a third-party tool now resides entirely within Snowflake, and the code is entirely SQL.
Interested in Learning More about Snowflake as a Data Engineering Tool?
Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership here.
Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at Snowflake.com.