Adventures in Snowflake Cost Management

Pay for use is both an exciting and challenging aspect of using the Snowflake Data Cloud. I’ve lead workshops and proudly proclaimed “Snowflake has brought an end to capacity planning!” And it has.  You never have to figure out how much storage or processing power you are going to need. You don’t have to plan for three-year storage needs and hope that you’ve not bought too little. It’s a constant dance – but no more. With Snowflake you can just add whatever data you need and only pay for what you are using. The same is true for the query processing power. When Black Friday hits, you have power on demand and yet you’re not paying for power all year long.

Now budget planning? That’s a different story… Typically, you will have bought a certain size machine to run your database or contracted for a certain amount of cloud capacity…and whether you use it a little or a lot, you pay the same. When you see your Snowflake costs sky rocket, you’ll start to think about usage in ways you never had to before. 

Here are some tips for being more efficient with your Snowflake spend.

Think Small, Run Big

Thinking time and development time should be done on an x-small or small compute warehouse. When it comes time to run a job or a long query, that’s when you spin up a larger warehouse, run the job, and then shut the warehouse down. You have capacity on demand, so you will want to size your warehouse to optimize cost both in what Snowflake charges and in human capital. Why wait for 2 hours on a long job when you can run it in 15 minutes by using a warehouse 8 times the size? For the most part, you’ll see run times cut in half and the cost doubled at each size up. So, it’s cost neutral to use a bigger warehouse but saves human cost.

Sometimes even the Snowflake cost is saved by running a more expensive, larger warehouse. How so? If the compute warehouse is too small, it may have to spill data to local or even remote cache. Disk drives are a lot slower than ram. When you use a larger sized warehouse, you also get more ram. Thus, the query or load can complete so much faster that you are saving more than the extra cost of being large.

One Expandable Warehouse for All

It is typical for companies to assign each team or business unit their own warehouse. It’s one of the ways companies can manage cost charge-back. However, it’s inefficient to have multiple warehouses with their meters running up charges when a single shared warehouse will do. To handle overuse, you set it up as a multi-cluster that will spawn other instances when there is demand and shrink them when demand goes away. You use roles or tags to handle divvying up the shared cost across those using the warehouse.

Break Large Load Files Into Many Smaller Ones

Snowflake is a massively parallel database. Each node in a Snowflake warehouse cluster has 8 processes. A large sized warehouse has 8 nodes, 32 processes. If you try to load a single large file, only one of the processes is used. If you have the file broken up (Snowflake recommends 100-250mb chunks), then all 32 processes will work in parallel rocketing your loading performance.

Judicious Use of Cluster Keys

Snowflake builds micro-partitions when data is loaded. For 90% of the scenarios, you can just let Snowflake do its thing and you will get great performance. This is one of the ways Snowflake is so cost effective, it doesn’t take an army of tuning DBAs to operate. However, there are going to be times when you will need to put a cluster key on a table to get the performance needed. And poor performing queries cost extra money.

There was a 40 billion row table that was joined to a 3 billion row table in a view that brought reporting to its knees. Clustering both tables on the join keys enable the report to run in less than 2 minutes. For more information on clustering see Snowflake’s documentation.

Lift and Shift Still Needs Tuning

One of the common mistakes is to assume that “if it worked in the old system, it should work in Snowflake”. You will encounter performance issues (and thus cost issues) whose solution will not lay in adjusting Snowflake warehouses. 

Here are just some recent tuning scenarios I’ve encountered:

There was a data load that was running $500 to $1500 per day. 8 billion rows of inventory were loaded per day. Every item in every store across the world was scanned. The loading procedure used a MERGE.  So, 8 billion searches to find the right row and update the data. And yet, there was no history. Once the merge happened the current value was the only value. Thus, a merge wasn’t needed at all.  In effect, the table was a daily snapshot of inventory and the data coming in was all that was needed. Removing the merge took a process from 8 hours on a very expensive 64 node Snowflake warehouse to a couple minutes of a 32-node snowflake warehouse. A savings of $15k-$30k per month was realized.

Just because “the query worked on XYZ database” doesn’t mean everything is okay. A very expensive and long running query on Snowflake was fixed by discovering a cartesian join. When all the proper keys were added to the join, the query ran fast.

Oftentimes in mature systems – there are views built upon views built upon views. A slow report sent me spelunking through the “view jungle”. I discovered one of the views had a join to a table where no fields from that table were used plus a distinct. At a half billion rows, this unnecessary join and thus unnecessary distinct caused the performance problem.

The take away is that a good deal of the work will be taking a fresh look at the problem and not taking “the old system” as gospel for the new system.

Monitor the Spend

Snowflake has views to help you monitor cost and performance. They are located in the Snowflake database in the ACCOUNT_USAGE schema. If you have multiple accounts, the combined values are in the ORGANIZATION_USAGE schema. There are prebuilt Tableau, PowerBI, Sigma and other dashboards you can download. There is no substitute, however, for getting familiar with the views themselves.

Strive is a proud partner of Snowflake!

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

ABOUT SNOWFLAKE

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

An Example of a Living Data Mesh: The Snowflake Data Marketplace

The enterprise data world has been captivated by a new trend: Data Mesh. The “What Is Data Mesh” articles have already come out, but in this publication, I want to highlight a live, in production, worldwide Data Mesh example – The Snowflake Data Marketplace.

As in every “new thing” that comes down the pike, people will change the definition to suit their purposes and point of view, and I am no different. Zhamak Dehghani, a Director of Emerging Technologies at ThoughtWorks, writes that Data Mesh must contain the following shifts:

  • Organization: From central controlled to distributed data owners. From enterprise IT to the domain business owners.
  • Technology: It shifts from technology solutions that treat data as a byproduct of running pipeline code to solutions that treat data and code that maintains it as one lively autonomous unit.
  • Value: It shifts our value system from data as an asset to be collected to data as a product to serve and delight the data users (internal and external to the organization).
  • Architecture: From central warehouses and data lakes to a distributed mesh of data products with a standardized interface. 

It is on this principal that I take departure and advocate the Snowflake Data Cloud. I believe that the advantages that have always been in a centralized data store can be retained, while the infinite scale of Snowflake’s Data Cloud facilitates the rest of the goals behind Data Mesh.

With so much to understand about the new paradigm and its benefits, or even grasping what an up and running Data Mesh would look like… to date, even simplified overview articles are lengthy. As I wrestled with coming to my own understanding of Data Mesh and how Strive could bring our decades of successful implementations in all things data, software development, and organizational change management to bear, I was hit by a simple notion. There is already a great example of a successfully implemented, world-wide, multi-organization Data Mesh – The Snowflake Marketplace.

There are more than 1,100 data sets from more than 240 providers, available to any Snowflake customer. The data sets from the market become part of the customer’s own Snowflake account and yet are managed and kept up to date by providers. No ETL needed and no scheduling. When providers update their data, it is updated for all subscribers. This is the definition of “data as a product”.

In effect, The Snowflake Data Cloud is the self-service, data-as-a-platform infrastructure. The Snowflake Marketplace is the discovery and governance tool within it. Everyone that has published data into the Marketplace has become product owners and delivered data as a product.

We can see the promised benefit of the Snowflake Marketplace as Data Mesh in this – massive scalability. I’m not speaking of the Snowflake platforms near infinite scalability, impressive as that is, however considering how every team publishing data into the market has been able to do so without the cooperation of another team. None of the teams that have published data have had to wait in line to have their priorities bubble up to the top of IT’s agenda.  A thousand new teams can publish data today. A hundred thousand new teams can publish their data tomorrow.

This meets the organizational shift from centralized control to decentralized domain ownership, and the data as a product, and technically with data and the code together as one product. 

Data consumers can go to market and find data that they need, regardless of which organization created the data. If it’s in the Snowflake Marketplace, any Snowflake customer can use the data for their own needs. Each consumer of the data will bring their own compute, so that nobody’s use of the data is impacting or slowing down the performance of another team’s dashboards.

Imagine that instead of weather data published by AccuWeather and financial data by Capital One – it’s your own organizations customer, employee, marketing, and logistics data. Each data set is owned by the business team that creates the data. They are the team that knows the data best. They curate, cleanse, and productize the data themselves. They do so on their own schedule and with their own resources. That data is then discoverable and usable by anyone else in the enterprise (gated by role-based security). Imagine that you can scale as your business demands, as new businesses are acquired, as ideation for new products occur. All facilitated by IT, but never hindered by IT as a bottle neck.

With Snowflake’s hyper scalability and separation of storage and compute, and its handling of structured, semi-structured, and unstructured data, it’s the perfect platform to enable enterprise IT to offer “data as self-serve infrastructure” to the business domain teams. From there, it is a small leap to see how the Snowflake Data Marketplace is, in fact, a living example of a Data Mesh with all the benefits realized in Zhamak Dehghani’s papers.

As a data practitioner with over 3 decades of my own experience, I am as excited today as ever to see the continuous evolution of how to get value out of data and deal with the explosion in data types and volumes. I welcome Data Mesh and the innovations it is promising, along with Data Vault 2.0, cloud data hyper-scale databases, like Snowflake, to facilitate the scale and speed to value of today’s data environment.

Strive is a proud partner of Snowflake!

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

ABOUT SNOWFLAKE

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

Exercising Data Governance Best Practices – How to Stay the Course

Have you ever planned to wake up early in the morning to work out, but instead chose to lie in bed and catch up on some sleep? This can happen even after you have committed—mentally, at least—to a new workout regimen.

That’s because the hard part isn’t resolving to do something new; it’s adjusting your daily habits and generating enough momentum to carry the changes forward. This requires discipline and drive.

The same challenges apply to data governance initiatives. If you have ever been part of a data governance program that hesitated, backfired or stopped completely in its tracks, you know what I’m talking about. Companies are accruing ever-increasing amounts of data and want to be able to transform all that information into insights the same way you want to get in shape. The first step is data governance, but getting your organization to buy-in to a new program conceptually is the easy part. Taking action and sticking to it can be much more challenging.

Indeed, many organizations believe that simply implementing technology—like a Master Data Management system—will improve the health of their data. But if you simply buy workout equipment, do you get healthier? Tools will help streamline your organizational processes and complement information governance and information management, but building and maintaining a culture that treats data as an asset to your organization is the key to ongoing success.

Below are some key factors to building good habits to generate momentum once your data governance program is underway:

1. Impart a sense of urgency for the program.

For every organization with a plan to manage its data assets, there needs to be a sense of urgency to keep the plan in place. The reasons are unique from organization to organization, but they might be driven by compliance, customer satisfaction, sales, revenues, or M&A. Regardless of the reason, it needs to resonate with senior leadership and ideally be tied to the company’s strategic goals in order to be most effective.

2. Communicate, communicate, communicate.

The cornerstone to a successful data governance program is a well-organized (cross-departmental) communication plan. A solid plan helps remove the silos and maintain cross departmental support for the initiative. Seek your champions throughout the organization and meet with key stakeholders regularly to document their pain points. It is important to get people engaged early to keep the excitement going.

3. Operationalize change within the organization.

Your delivery will need to be agile in nature because the plan you put in place will naturally evolve. The goal is to learn what works within your organization early on to ensure you deliver value quickly and the process is sustainable moving forward. Complete tasks iteratively and agree upon a small set of high-value data attributes to aid in validating your data governance process. In addition, manage your data elements to ensure their best quality.

4. Make the plan as RACI as possible.

Actively listen to your supporters and put together a plan that encompasses a RACI (Responsible, Accountable, Consulted & Informed) model so that everyone on the team knows their role across the process. This plan will keep your team focused and guide your initiatives moving forward. You’ll raise your odds of success by forming a strong governance organizational structure with roles and responsibilities in place (for data ownership, stewardship and data champions), along with approvals that complement your existing change management process.

4. Measure, Communicate, Repeat.

Keep in mind that “you can’t manage what you don’t measure.” You’ll need to face the facts and communicate your findings. It’s wise to document and implement KPIs (Key Performance Indictors) so that you can measure the progress of your initiative over time. Linking the KPIs to revenue or sales loss, for example, can be a strong indicator to help drive change. As you learn more about your data, it’s important to communicate what’s meaningful to your stakeholders and continue to move forward.

Similar to continuing on a workout regimen, data governance demands a discipline that takes time and patience to fine tune. This requires changing years of undisciplined behaviors regarding data within your organization, and the change will not happen overnight. Changing these behaviors is an ongoing process that needs to resonate throughout an organization’s culture in order for success to occur.

In addition, it’s important to keep things fresh. When working out, you need to rotate though different core muscle groups and vary the routine to keep things interesting and progressive.  It’s the same with data governance initiatives. Don’t let people get bored with the same repetitive activities day in and day out. Try conducting data discovery sessions where team members present findings from an internal or external dataset that would be interesting to other team members. You can also share successes and learnings from past data related projects to drive discussion.  Another suggestion is to discuss future cross-departmental data projects (or “wish list” items) that can lead into great data roadmap discussions.  The objective is to keep everyone engaged and finding value in meetings so that the team continues to show up and make progress.

Remember that data governance is a journey that requires commitment and hard work. As with exercise, just working out for a month is a great start, but it’s with continued dedication that you really start to notice the change. If you want to take your organization to the next level, you need to develop the discipline toward information management that your organization requires for long-term sustainable success. For those with little experience in implementing or maintaining a data governance plan, experienced consultants can be of great value.

Strive Can Help With Your Data Governance Needs! 

Here at Strive Consulting, our subject matter experts’ team up with you to understand your core business needs, while taking a deeper dive into your organization’s growth strategy. Whether you’re interested in modern data integration or an overall data and analytics assessment, Strive Consulting is dedicated to being your partner, committed to success. Learn more about our Data & Analytics practice HERE.

Contact Us

Have Your Data and Query It Too!

“Have your cake and eat it too.” How would it make sense to have cake and not be able to eat it? And yet, we have, for decades, had similar experiences with enterprise data warehouses. We have our data; we want to query it too!

Organizations spend so much time, effort, and resources building a single source of truth. Millions of dollars are spent on hardware and software and then there is the cleansing, collating, aggregating, and applying business rules to data. When it comes time to query… we pull data out of enterprise data warehouse and put it into data marts. There simply is never enough power to service everybody who wants to query the data.

With the Snowflake Data Cloud, companies of all sizes can store their data in one place – and every department, every team, every individual can query that data. No more need for the time, expense, effort, and delay to move data out of an enterprise data warehouse and into data marts.

The advance of the ‘data lake’ promised to be the place where all enterprise data could be stored. Structured, semi-structured, and unstructured data could be stored together, cost effectively. And yet, as so many soon found out – data ended up needing to be moved out to achieve the query performance desired. More data marts, more cost, and more time delay to get to business insight.

Snowflake solved this problem by separating data storage from compute. Departments and teams can have their own virtual warehouse, a separate query compute engine that can be sized appropriately for each use case. These query engines do not interfere with each other.  Your data science team can run massive and complex queries without impacting accounting team’s dashboards.

Snowflake does this having designed for the cloud from the ground up. A massively parallel processing database, Snowflake is designed to use the cloud infrastructure and services of AWS, quickly followed by Azure and GCP. Organizations get all the scalability promised by “Hadoop based Big Data” in an easy to use, ANSI Standard SQL data warehouse, that delivers the 5 V’s of big data (Volume, Value, Variety, Velocity and Veracity). Not to mention all of these benefits come with industry leading cost and value propositions.

Speaking of Variety… Snowflake has broken out of the “data warehouse” box and has become ‘The Data Cloud’. All your data types: structured, semi-structured and now, unstructured.  All your workloads: Data Warehouse, Data Engineering, Data Science, Data Lake, Data Applications, and Data Marketplace. You have the scalability in data volume and in query compute engines across all types of data and use cases.

With the Snowflake Data Cloud, you truly can have all your data and query it too. Extracting business value for all departments and all employees along the way.

 

Want to learn more about the Snowflake Data Cloud? 

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

About Snowflake

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

Contact Us

Why Choose Open-Source Technologies?

In 2022, almost every enterprise has some cloud footprint, especially around their data. These cloud platforms offer closed-source tools which, while offering many benefits, may not always be the best choice for some organizations. First and foremost, these proprietary services can be expensive. In addition to paying for the storage and compute needed to store and access data, you also end up paying for the software itself. You could also become locked into a multi-year contract, or you might find yourself locked into a cloud’s tech stack. Once that happens, it’s very difficult (and expensive) to migrate to a different technology or re-tool your tech stack. To put it simply, if you ever reach a roadblock your closed-source tool can’t solve, there may be no workarounds.

Since closed-source technologies can create a whole host of issues, open-source technologies may be the right choice. Open-source tech is not owned by anyone. Rather, anyone can use, repackage, and distribute the technology. Several companies have monetized open-source technology by packaging and distributing it in innovative ways. Databricks, for example, built a platform on Apache Spark, a big-data processing framework. In addition to providing Spark as a managed service, Databricks offers a lot of other features that organizations find valuable. However, a small organization might not have the capital or the use case that a managed service like Databricks aims to solve. Instead, you can deploy Apache Spark on your own server or a cloud compute instance and have total control. This is especially attractive when addressing security concerns. An organization can benefit from a tool like Spark without having to involve a third party and risk exposing data to the third party.

Another benefit is fine-tuning resource provisioning.

Because you’re deploying the code on your own server or compute instance, you can configure the specifications however you want. That way, you can avoid over-provisioning or under-provisioning. You can even manage scaling, failover, redundancy, security, and more. While many managed platforms offer auto-scaling and failover, it is never so granular as it is when you provision resources yourself.

Many proprietary tools, specifically ETL (Extract, Transfer, Load) and data integration tools, are no-code GUI based solutions that require some prior experience to be implemented correctly. While the GUIs are intended to make it easier for analysts and less-technical people to create data solutions, more technical engineers can find it frustrating. Unfortunately, as the market becomes more inundated with new tools, it can be difficult to find proper training and resources. Even documentation can be iffy! Open-source technologies can be similarly peculiar, but it’s entirely possible to create an entire data stack – data engineering, modeling, analytics, and more – all using popular open-source tech. These tools will almost certainly lack a no-code GUI but are compatible with your favorite programming languages. Spark supports Scala, Python, Java, SQL and R, so anyone who knows one of those skills can be effective using Spark.

But how does this work with cloud environments?

You can choose how much of the open-source stack you want to incorporate. A fully open-source stack would simply be running all your open-source data components on cloud compute instances: database, data lake, ETL, data warehouse, and analytics all on virtual machine(s). However, that’s quite a bit of infrastructure to set up, so it may make sense to unload some parts to cloud-native technologies. Instead of creating and maintaining your own data lake, it would make sense to use AWS S3, Azure Data Lake Storage gen2, or Google Cloud Storage. Instead of managing a compute instance for a database, it would make sense to use AWS RDS, Azure SQL DB, or Google Cloud SQL and use an open-source flavor of database like MySQL or MariaDB. Instead of managing a Spark cluster, it might make sense to let the cloud manage the scaling, software patching, and other maintenance, and use AWS EMR, Azure HDInsight, or Google Dataproc. You could also abandon the idea of using compute instances and architect a solution using a cloud’s managed open-source offerings: AWS EMR, AWS MWAA, AWS RDS, Azure Database, Azure HDInsight, GCP’s Dataproc and Cloud Composer, and those are just data-specific services. As mentioned before, these native services bear some responsibility for maintaining the compute/storage, software version, runtimes, and failover. As a result, the managed offering will be more expensive than doing it yourself, but you’re still not paying for software licensing costs.

In the end, there’s a tradeoff.

There’s a tradeoff between having total control and ease of use, maintenance, and cost optimization, but there is a myriad of options for building an open-source data. You have the flexibility to host it on-premises or in the cloud of your choice. Most importantly, you can reduce spend significantly by avoiding software licensing costs.

 

Interested in Learning More About Open-Source Technologies? 

Here at Strive Consulting, our subject matter experts’ team up with you to understand your core business needs, while taking a deeper dive into your organization’s growth strategy. Whether you’re interested in modern data integration or an overall data and analytics assessment, Strive Consulting is dedicated to being your partner, committed to success. Learn more about our Data & Analytics practice HERE.

Contact Us

How Snowflake Saved My Life

Let me tell you a story of how the Snowflake Data Cloud saved my life. I know it sounds dramatic, but just hear me out.

A few years ago, I worked with a multi-billion-dollar wholesale distributor that had never implemented a data warehouse. Their main goal? Consolidate all data into one location and enable KPI roll-ups from across their disparate systems. However, in this case, they did not want to invest in additional licensing. So, my team set about building a traditional data warehouse leveraging their current platform, SQL Server. Initially, it was a successful four-layer architecture with Staging, Consolidation, Dimensional Model, and Tabular Cubes, with the end visualization solution being Power BI… but within a few months, issues began to surface.

The number of sources feeding into this platform had increased dramatically and this increase started to impact load times. Initially, the batch load processes were running between two and three hours, but over time increased to taking 5, 6, sometimes 7 hours to run! We needed a long-term solution, but in the short term, keep the platform running to deliver data to the organization.

What we were experiencing were challenges with Constraints, Indexing, Locks, Fragmentation, etc… To mitigate these issues, I personally took the step of waking up every morning at 3:00AM to log in and ensure certain process milestones successfully completed in a timely manner. If those milestones were not achieved, the batch process would either stall, fail, or run excessively long and the last thing I wanted was to explain to the business why they were not going to have data until 9:00, 10:00, 11:00AM. After a couple weeks of doing this, it became apparent – we needed a better solution, and fast!

In the past, I had some experience with Big Data platforms, but decided to research options outside of established technologies, such as Cloudera or Hadoop-based solutions and instead looked into something new – Snowflake. Snowflake is the world’s largest and most efficient data management platform, where organizations can access, share, and maintain their data, so I thought, why not? Let’s give it a shot!

We set up a proof of concept initially trying to mimic the 4-layer architecture we had set up in SQL Server. After seeing limited success, as well as being laughed at for even trying it, we took a step back, reevaluated our approach, and flipped the architecture from ‘Extract Transform Load’ toward ‘Extract Load Transform’.. And… Eureka! With this change, we were able to reduce overnight batch runtimes from the 5, 6, 7 hour SQL Server to less than 20 minutes. In fact, our average runtimes for our load processes were around 17 minutes, but now I’m just showing off.

Not only did this have an incredible effect on our ability to deliver data in a timely manner, but it also enabled an increase in the frequency in which we processed data. You see, with the SQL Server we were never able to update data more than once a day, but with Snowflake, we could run the batch process every 20 minutes and quickly deliver requested changes to the models, measures, and dimensions.

The implementation process went from taking weeks to taking days, or even hours, resulting in some very happy stakeholders. With these results, coupled with the fact that I no longer had to wake up at 3:00AM to verify successful batch processes…Snowflake truly saved my life.

Want to learn more about the Snowflake Data Cloud? 

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership here.

About Snowflake

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at Snowflake.com.

Subscribe

Snowflake: A Data Engineering Tool

Snowflake began as the best cloud storage tool. It set out to simply store data, while also providing the infinite scalability of the cloud. Right out of the gate it offered incredible flexibility with its decoupled storage and compute. You could model your data in a Kimball dimensional model, a Data Vault, or even a loosely structured data lake all in the same instance. It also handled table clustering, auto scaling, security, caching, and many other features most data engineers and analysts don’t want to worry about. However, it fell short when it came to data extraction and transformation. Other platforms like Redshift, BigQuery, and Synapse integrated so well with their respective cloud data stacks, making data engineering and processing as simple as it gets. Snowflake, on the other hand, has three options for loading and transforming data.

Option 1: Snowpipe and Stages

Snowflake has a proprietary continuous data loading solution called Snowpipe. It uses named Snowflake objects called ‘pipes’, which contain a COPY statement to take data from a stage and load it into a table. A stage is a location where data files are stored in the cloud. These can be internal (within Snowflake itself) or external (on a cloud platform like AWS S3 or Azure Data Lake Storage). Snowpipe can be called by a public REST API endpoint or by using a cloud event messaging service like AWS SNS.

For example:

  1. Log files are uploaded to an AWS S3 bucket
  2. AWS SNS or S3 Event Notifications notify Snowpipe new data is available
  3. Snowpipe grabs the new data and loads it into the target table

There are numerous benefits to this approach:

  • It’s serverless, so you don’t need to configure or monitor resources.
  • It’s continuous, so it’s a good approach for constantly flowing data or small incremental loads.
  • You’re only billed for the resources Snowflake uses to move the data, as opposed to the large billing model for virtual warehouses.
  • If you already have a cloud data lake, this is a simple way to migrate data into Snowflake using native technology.

However, Snowpipe may not be appropriate for the following reasons:

  • Snowpipe alone doesn’t solve Change-Data-Capture (CDC), which requires the use of streams
  • Snowpipe requires additional cloud resources:
    • Cloud data store
    • Event messaging service
    • IAM
    • Snowflake file format and copy options
  • Integration between your Snowflake instance and your cloud instance is required.
  • File size, queuing, etc.

Option 2: Third-Party ETL Tools

From the beginning, Snowflake created a robust partner network and integrations with some of the most popular ETL tools in the market. The ETL tools also create native integrations with Snowflake and have simplified the process of extracting, transforming, and loading data from various sources into Snowflake. While each ETL tool is different, they should all be able to offer the following:

  • One single tool for extraction, transformation, and loading of data
  • Native CDC capabilities
  • Orchestration and automation of ETL/ELT processes
  • Less setup than using Snowpipe and external stages
  • No-code solutions for data transformations

There are many use cases where a third party ETL tool is the right choice, and when implemented successfully, can save time and money for data engineering teams. There are also reasons not to use third party tools:

  • Price for proprietary software can be very expensive
  • Some tools are not fully managed, meaning your team will have to set up the cloud infrastructure to serve the ETL tool
  • Potentially less scalability than Snowflake
  • Difficult to switch from one tool to another
  • Each tool requires additional skills in order to implement effectively
  • Continuous, near-real-time, and real-time loading is immature or nonexistent

Other Options

There are other options for loading data into Snowflake:

  1. BULK loading via COPY command
  2. Data loading wizard in the Snowflake UI

The data loading wizard allows you to load data directly into tables with no actual code. This is intended for smaller datasets and in rare circumstances and should not be used for regular data loading. BULK loading via the COPY command allows you to manually load data by staging it and then copying it to a table. This requires a running virtual warehouse. This too should not be used for regularly scheduled loads or any kind of volatile or business-critical data.

The Current State of Data Engineering in Snowflake

While Snowflake started off as a cloud storage and querying platform, competition has forced it to create more data engineering capabilities. Platforms like Databricks are starting to encroach on data storage, and Snowflake has responded by adding new features:

Stored Procedures

Stored procedures allow you to execute procedural logic. Procedures can take parameters and execute tasks like DML actions. As of October 2021, stored procedures are only available in Javascript.

Tasks

Tasks are named objects that execute a SQL statement at a scheduled time. The emphasis is on the scheduled aspect, as this is the first semblance of orchestration in Snowflake. A task will use a virtual warehouse to run a SQL statement, such as calling a stored procedure. In other words, you can combine tasks and stored procedures to create data pipelines!

Because tasks are named database objects, you can use them in third party ETL solutions as you would a table or view. Certain limitations may apply depending on the tool.

Streams

These are the last piece of the puzzle for creating a data engineering environment. Streams are named objects created off an existing table. A stream will have the same columns as the table, but with three additional columns: METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID.

  • METADATA$ACTION specifies if the row is an insert or delete
  • METADATA$ISUPDATE specifies if the insert or delete is an update
  • METADATA$ROW_ID specifies the unique and immutable row for the table.

With these three columns, you can implement Change-Data-Capture on tables. This article shows how to build a relatively simple Type II slowly changing dimension and all the upsert logic associated with it.

Streams can provide an easier alternative to stored procedures for processing Change-Data-Capture. In a traditional Kimball architecture, this would be best used in the ODS layer and for Type II slowly changing dimensions in the presentation layer. Instead of having to program a stored procedure in Javascript, you can combine a stream to provide the table history and a view to handle the insert/update/delete logic. However, a stream only applies to a single table. Streams are not appropriate for ETL processes that require complex joins and aggregations like you would typically use in a staging layer for fact and dimension tables.

Streams and stored procedures still need to be used together for CDC, but the process is simplified. Instead of comparing the ODS table with the new raw data, the stream will determine the table changes for you. The stream and corresponding view that manages the upsert logic are used in a simple stored procedure to expired changed and deleted records, insert new and current changed records, and append metadata to all records.

Bringing it all together

With these native Snowflake objects, data processing no longer requires third party tools. Achieving a robust persistent staging layer with auditability is now possible with streams, and when chained with tasks and called by stored procedures, the entire process can be automated. Pipes can continuously bring new data into Snowflake, so your data will never be stale. The data orchestration and processing components that were previously in a third-party tool now resides entirely within Snowflake, and the code is entirely SQL.

Interested in Learning More about Snowflake as a Data Engineering Tool?

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership here.

About Snowflake

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at Snowflake.com.

Subscribe

3 Steps to Building Rapport with Clients in a Hybrid Work Environment

Recently, the workforce has undergone a series of transitions. COVID-19 has forced companies to quickly adopt a digital transformation within their organizations and people workforce. Even in industries that formerly prohibited working remotely, working from the comfort of ones home seemingly became the norm overnight. With the release of the COVID-19 vaccine, many companies adopted a hybrid working solution, allowing employees to both work from home, while also choosing an in office environment as well. In this new day and age, companies now have a host of difficult challenges to over come, with relationship building at the core of them all. Healthy relationships sit at the center of everything we do, so it left me wondering… in this dynamic environment, how do we continue to connect and build relationships with our clients? Well, you’re in luck! Check out theses 3 ways you can build rapport with your clients virtually and in person.

Why Is It Important to Build Rapport?

Rapport is a relationship in which the people or groups involved are “in sync” with each other, understand each other’s feelings or ideas, and communicate smoothly. In other words, when people or teams get to know and like one another, they become friendly and open with fostering a more collaborative work experience.

How Does One Build Better Rapport?

Rapport building takes time, commitment, and dedication. The most successful way to build rapport is to start off by focusing on all 3 key fundamental areas; active listening, intentional relationship building, and consistent delivery. These are the things that people talk about the most when describing a great experience with Strive and their own internal organizational needs.

1. Listen to your clients and stakeholders

One of the most valuable and underrated methods to building rapport and relationships with a client is listening. It is easier said than done, but with practice can be one of the most successful tools in your toolbox. During client discussions and one-on-ones, ensure that you are practicing the 3 components of active listening (comprehend, retain, respond). In a remote environment, we are unable to rely on body language, so try focusing on not only what the client is saying, but how they are saying it. This should give us an understanding of what they are trying to convey. Retain the key points and the emotions the client is trying to express. Reiterating what they stated in a clear and concise manner, while avoiding overuse of jargon is key. Listening to our clients ensures that they feel understood and heard.

2. Create space for relationships

In this hybrid and dynamic work environment, it is easier than ever to lose sight of relationship building that naturally would occur sitting in a conference room or chatting at a coffee machine. Now to build relationships, we need to be intentional. Most experts agree that 70 to 93 percent of all communication is nonverbal 1. Luckily, with the shift to a more hybrid environment, we can now safely meet in person, as well as utilize other digital mediums. Maximize this opportunity to host a happy hour, lunch, or quick coffee in a responsible and safe manner, such as on an open restaurant patio. The change of scenery can foster a change in conversation and create a shared experience that will translate into smoother communication in the digital space. Alternatively, set aside time at the beginning of a meeting or throughout your day to connect with your clients or peers. This can be a quick “ping” or “slack” message asking them about their weekend, seeing if they need help on a task, or talking about a favorite sports team event. We can, and should, utilize all these tools to support rapport building. A client who feels a deeper connection to you is likely to be more collaborative.

3. Deliver results

While this may seem straight forward, it tends to be overlooked. The ability to deliver impressive results is often what separates successful consulting companies from the others. Delivering results is, after all, what we are hired to do. If you are successful in 1 and 2 above, delivery should be straightforward. You will have a list of key objectives, pain points, and deliverables the client not only expects, but you can build on. Alternatively, listening and relationship building become moot points if you or your team is not delivering or following through on commitments, contractually or otherwise. Continual delivery and follow-through will ensure a client is satisfied.

Although the future of traditional small and large business models is still unknown, understanding the fundamentals in successful relationship building is key in making sure we’re prepared for whatever life throws at us. Active listening and creating the space needed to foster genuine relationships, all while continuing to deliver your best results, will set you and your organization up to be the best of the very best.

Interested in learning more about how Strive put’s their clients first?

Here at Strive Consulting, we foster an active, innovative culture, providing the coaching, mentoring, and support our employees and clients need to work at the top of their game and succeed, personally and professionally. Understanding that the corporate world has changed and been challenging on all parties involved in the workplace, Strive is here to provide guidance and assist in any way we can. Strive Consulting is dedicated to being your partner, committed to success.

Understanding Modern Data Integration Best Practices

Access to data and real-time analytics are incredibly vital to any business. There are important decisions to be made, process improvements that can be achieved, supply chain notifications that are needed for immediate business choices, as well as other insights and information used to create or drive data-led decisions. With the data landscape changing over time and an increasing number of tools and cloud-based solutions available, the data space is always evolving. Understanding best practices when moving to a modern data architecture can help organizations succeed. Follow along and learn how Strive sets up our clients with the tools needed to bring business value every time.

Centralize Data

Data should be centralized, creating a single source of truth. It’s easy to say, and a lot more complicated to perform, depending on current source systems, but data silos can create multiple sources of data. As you can imagine, this complicates things. Two individuals can view two different data sets that have underlying differences when they should show the same data results.  Creating that single source of truth adds value of reliable data, data sources, and consistent results.

Scalability

One of the wonders of modern cloud tools is accessibility to a massive number of resources with no on-premise hardware increase. Using cloud tools, the elasticity and scalability can be essential to the accessibility of all data. Coinciding with the data lake concept, all data can be landed at low storage costs. If more data is needed to move downstream into the different data layers, resources can be added and scaled accordingly. Additional servers can be added for more power and clusters can be added for more concurrency. In a modern data architecture, scalability should always on the top of mind.

Extract, Load, Transform (ELT) vs. Extract, Transform, Load (ETL)

Traditionally, resources were needed to be planned out on-premise and ahead of time… plus databases had storage limits. This meant developers had to extract data from their sources, transform only what was needed, and load the data into the target. Tools or special coding were needed for this process, and the resources utilized for all of the transforming and loading of data were cumbersome and time consuming.

With the modern data integration approach, businesses can achieve a more ELT approach. The data is extracted and loaded first, due to low storage costs amongst cloud providers, allowing date to be transformed downstream. For example, with all of the unstructured and structured data loaded directly into a data lake, data scientists have access to everything and the ability to analyze every single bit. The data can finally be transformed based on the analysis and business rules to move downstream.

Re-usable Coding

As the number of tools grow and grow in the technology space, trying to keep re-usability in mind is very important. This improves speed to market and can improve the time it takes to deploy code, send extracts, load data into different layers, etc. For example, Strive partnered with one of our clients who needed 150 different files sent to a vendor. Using our proprietary ELT Accelerator, Strive developed reusable code and a database that could be used and run one time, loading all 150 files quickly and efficiently, saving our client time, money, and additional resources, while increasing speed to market.

Excellent Commenting and Documentation

Writing detailed documentation is difficult, but immensely important, especially in the beginning of a project. Creating a foundational aspect of commenting and documentation saves re-work and colleague understanding in the long run. Each piece of code, each SQL statement, and every deployment should have documentation and commenting. This makes support, future debugging, and future code changes easier.

Tool Agnostic Thinking

In order to understand a modern data architecture, it is important to think agnostically when deciding on tools. There are many cloud platforms, from AWS, Azure, Google Cloud as providers, and a growing number of data platforms in the cloud, such as Snowflake. When sifting through, it’s always important to take a step back and remember what is happening behind the scenes. We are moving data from one area to another and if you can learn one tool, it’s simple to learn another.

Minimal Approach

In the data landscape, there is no need to over complicate data pipelines. It adds to the skillset needed for a support team, creates data latency, and can increase the areas needed for code changes.  Let’s keep it as simple as possible. For example, a data lake can be housed within an AWS s3 folder structure and move to Snowflake. All layers can then be made within Snowflake itself. If the coding mindset remembers the ease of deployment and tool use, everyone wins.

Remember, it’s important to understand the current and future state of modern data integration and architecture. Data lakes are needed for data scientists, conformed layers are needed for downstream consumption, and semantic layers are needed for reporting purposes. As long as we all look at data and solutions as scalable, centralized, and re-usable, we are working towards a purpose that makes everyone’s job easier.

Interested in Strive’s Data & Analytics Practice?

Here at Strive Consulting, our subject matter experts’ team up with you to understand your core business needs, while taking a deeper dive into your organization’s growth strategy. Whether you’re interested in modern data integration or an overall data and analytics assessment, Strive Consulting is dedicated to being your partner, committed to success. Learn more about our Data & Analytics practice here.

Navigating the Storm When You’re Drowning in Data

The rise in demand for accurate, verifiable, and up-to-the-second data, has put a massive amount of pressure on companies’ existing data warehouse architecture. Many companies are struggling with large sets of data, what to do with it, and how to make sense of it all. These companies built out a large, robust data lake, but now they don’t know what to do with it, with large data sets being too overwhelming for IT teams to handle. One could even say, they’re drowning in data!

Luckily, our team here at Strive, while partnering with Snowflake, the data cloud platform, is capable of being your port in the storm. Bringing a much-needed life preserver to organizations through our expertise to help right-size data elements and simplify data challenges. Interested? Let’s walk through a recent example of doing just that.

This year, a Client partnered with Strive to help assess their current data warehouse initiative. Our Client had multiple source systems (15+) with desires to have a single source of truth for reporting. Drowning in manual, highly cumbersome tasks, and long hours for their employees, the Client needed a more efficient process regarding data extraction, transformation, loading, and sharing.

Enter Strive.

Our approach was simple – assist with the building of a strategic vision of future business processes, while implementing a proactive, rather than reactive, approach to technology adoption. Strive introduced the Client to our trusted partner Snowflake as the data platform. Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings. Additionally, the sharing functionality makes it easy for organizations to quickly share governed and secure data in real time. 

Knowing that executives and senior leadership wanted “better” reporting accuracy, but development staff wanted a more efficient way to accomplish these tasks, Strive had their work cut out for them. Our Data & Analytics team turned to the implementation of Strive’s proprietary ELT Accelerator to help build a dedicated reporting environment using Snowflake. We analyzed the Client’s current processes, conducted technical interview meetings, and identified opportunities to achieve the desired future state and enable a modern data integration platform capable of high-volume data transfer.

Using the ELT Accelerator, Strive was able to:

  • Set up an Azure Database for ETL Accelerator source tables and stored procedures
  • Customize Azure Data Factory Settings to support new environments
  • Create Scheduled pipeline for recurring data extraction to the Snowflake data lake

The Extract, Load, and Transform (ELT) process leverages built-in distributed query processing capabilities and eliminated resources needed for data transformation prior to loading, saving the Client headache and unnecessary spending on compute resources. Once our team fully implemented our ELT tool, we were able to help our Client load millions of rows of data, drive additional insights, and ultimately, enable the business to make better decisions.

So again, we ask you – is your organization drowning in data? By partnering with our team of trusted advisors, Strive can help you navigate the rough waters of any large data lake and calm the storm.

Interested in Strive’s ELT Accelerator?

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership here.

About Snowflake

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at Snowflake.com.