Adventures in Snowflake Cost Management

Pay for use is both an exciting and challenging aspect of using the Snowflake Data Cloud. I’ve lead workshops and proudly proclaimed “Snowflake has brought an end to capacity planning!” And it has.  You never have to figure out how much storage or processing power you are going to need. You don’t have to plan for three-year storage needs and hope that you’ve not bought too little. It’s a constant dance – but no more. With Snowflake you can just add whatever data you need and only pay for what you are using. The same is true for the query processing power. When Black Friday hits, you have power on demand and yet you’re not paying for power all year long.

Now budget planning? That’s a different story… Typically, you will have bought a certain size machine to run your database or contracted for a certain amount of cloud capacity…and whether you use it a little or a lot, you pay the same. When you see your Snowflake costs sky rocket, you’ll start to think about usage in ways you never had to before. 

Here are some tips for being more efficient with your Snowflake spend.

Think Small, Run Big

Thinking time and development time should be done on an x-small or small compute warehouse. When it comes time to run a job or a long query, that’s when you spin up a larger warehouse, run the job, and then shut the warehouse down. You have capacity on demand, so you will want to size your warehouse to optimize cost both in what Snowflake charges and in human capital. Why wait for 2 hours on a long job when you can run it in 15 minutes by using a warehouse 8 times the size? For the most part, you’ll see run times cut in half and the cost doubled at each size up. So, it’s cost neutral to use a bigger warehouse but saves human cost.

Sometimes even the Snowflake cost is saved by running a more expensive, larger warehouse. How so? If the compute warehouse is too small, it may have to spill data to local or even remote cache. Disk drives are a lot slower than ram. When you use a larger sized warehouse, you also get more ram. Thus, the query or load can complete so much faster that you are saving more than the extra cost of being large.

One Expandable Warehouse for All

It is typical for companies to assign each team or business unit their own warehouse. It’s one of the ways companies can manage cost charge-back. However, it’s inefficient to have multiple warehouses with their meters running up charges when a single shared warehouse will do. To handle overuse, you set it up as a multi-cluster that will spawn other instances when there is demand and shrink them when demand goes away. You use roles or tags to handle divvying up the shared cost across those using the warehouse.

Break Large Load Files Into Many Smaller Ones

Snowflake is a massively parallel database. Each node in a Snowflake warehouse cluster has 8 processes. A large sized warehouse has 8 nodes, 32 processes. If you try to load a single large file, only one of the processes is used. If you have the file broken up (Snowflake recommends 100-250mb chunks), then all 32 processes will work in parallel rocketing your loading performance.

Judicious Use of Cluster Keys

Snowflake builds micro-partitions when data is loaded. For 90% of the scenarios, you can just let Snowflake do its thing and you will get great performance. This is one of the ways Snowflake is so cost effective, it doesn’t take an army of tuning DBAs to operate. However, there are going to be times when you will need to put a cluster key on a table to get the performance needed. And poor performing queries cost extra money.

There was a 40 billion row table that was joined to a 3 billion row table in a view that brought reporting to its knees. Clustering both tables on the join keys enable the report to run in less than 2 minutes. For more information on clustering see Snowflake’s documentation.

Lift and Shift Still Needs Tuning

One of the common mistakes is to assume that “if it worked in the old system, it should work in Snowflake”. You will encounter performance issues (and thus cost issues) whose solution will not lay in adjusting Snowflake warehouses. 

Here are just some recent tuning scenarios I’ve encountered:

There was a data load that was running $500 to $1500 per day. 8 billion rows of inventory were loaded per day. Every item in every store across the world was scanned. The loading procedure used a MERGE.  So, 8 billion searches to find the right row and update the data. And yet, there was no history. Once the merge happened the current value was the only value. Thus, a merge wasn’t needed at all.  In effect, the table was a daily snapshot of inventory and the data coming in was all that was needed. Removing the merge took a process from 8 hours on a very expensive 64 node Snowflake warehouse to a couple minutes of a 32-node snowflake warehouse. A savings of $15k-$30k per month was realized.

Just because “the query worked on XYZ database” doesn’t mean everything is okay. A very expensive and long running query on Snowflake was fixed by discovering a cartesian join. When all the proper keys were added to the join, the query ran fast.

Oftentimes in mature systems – there are views built upon views built upon views. A slow report sent me spelunking through the “view jungle”. I discovered one of the views had a join to a table where no fields from that table were used plus a distinct. At a half billion rows, this unnecessary join and thus unnecessary distinct caused the performance problem.

The take away is that a good deal of the work will be taking a fresh look at the problem and not taking “the old system” as gospel for the new system.

Monitor the Spend

Snowflake has views to help you monitor cost and performance. They are located in the Snowflake database in the ACCOUNT_USAGE schema. If you have multiple accounts, the combined values are in the ORGANIZATION_USAGE schema. There are prebuilt Tableau, PowerBI, Sigma and other dashboards you can download. There is no substitute, however, for getting familiar with the views themselves.

Strive is a proud partner of Snowflake!

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

ABOUT SNOWFLAKE

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

An Example of a Living Data Mesh: The Snowflake Data Marketplace

The enterprise data world has been captivated by a new trend: Data Mesh. The “What Is Data Mesh” articles have already come out, but in this publication, I want to highlight a live, in production, worldwide Data Mesh example – The Snowflake Data Marketplace.

As in every “new thing” that comes down the pike, people will change the definition to suit their purposes and point of view, and I am no different. Zhamak Dehghani, a Director of Emerging Technologies at ThoughtWorks, writes that Data Mesh must contain the following shifts:

  • Organization: From central controlled to distributed data owners. From enterprise IT to the domain business owners.
  • Technology: It shifts from technology solutions that treat data as a byproduct of running pipeline code to solutions that treat data and code that maintains it as one lively autonomous unit.
  • Value: It shifts our value system from data as an asset to be collected to data as a product to serve and delight the data users (internal and external to the organization).
  • Architecture: From central warehouses and data lakes to a distributed mesh of data products with a standardized interface. 

It is on this principal that I take departure and advocate the Snowflake Data Cloud. I believe that the advantages that have always been in a centralized data store can be retained, while the infinite scale of Snowflake’s Data Cloud facilitates the rest of the goals behind Data Mesh.

With so much to understand about the new paradigm and its benefits, or even grasping what an up and running Data Mesh would look like… to date, even simplified overview articles are lengthy. As I wrestled with coming to my own understanding of Data Mesh and how Strive could bring our decades of successful implementations in all things data, software development, and organizational change management to bear, I was hit by a simple notion. There is already a great example of a successfully implemented, world-wide, multi-organization Data Mesh – The Snowflake Marketplace.

There are more than 1,100 data sets from more than 240 providers, available to any Snowflake customer. The data sets from the market become part of the customer’s own Snowflake account and yet are managed and kept up to date by providers. No ETL needed and no scheduling. When providers update their data, it is updated for all subscribers. This is the definition of “data as a product”.

In effect, The Snowflake Data Cloud is the self-service, data-as-a-platform infrastructure. The Snowflake Marketplace is the discovery and governance tool within it. Everyone that has published data into the Marketplace has become product owners and delivered data as a product.

We can see the promised benefit of the Snowflake Marketplace as Data Mesh in this – massive scalability. I’m not speaking of the Snowflake platforms near infinite scalability, impressive as that is, however considering how every team publishing data into the market has been able to do so without the cooperation of another team. None of the teams that have published data have had to wait in line to have their priorities bubble up to the top of IT’s agenda.  A thousand new teams can publish data today. A hundred thousand new teams can publish their data tomorrow.

This meets the organizational shift from centralized control to decentralized domain ownership, and the data as a product, and technically with data and the code together as one product. 

Data consumers can go to market and find data that they need, regardless of which organization created the data. If it’s in the Snowflake Marketplace, any Snowflake customer can use the data for their own needs. Each consumer of the data will bring their own compute, so that nobody’s use of the data is impacting or slowing down the performance of another team’s dashboards.

Imagine that instead of weather data published by AccuWeather and financial data by Capital One – it’s your own organizations customer, employee, marketing, and logistics data. Each data set is owned by the business team that creates the data. They are the team that knows the data best. They curate, cleanse, and productize the data themselves. They do so on their own schedule and with their own resources. That data is then discoverable and usable by anyone else in the enterprise (gated by role-based security). Imagine that you can scale as your business demands, as new businesses are acquired, as ideation for new products occur. All facilitated by IT, but never hindered by IT as a bottle neck.

With Snowflake’s hyper scalability and separation of storage and compute, and its handling of structured, semi-structured, and unstructured data, it’s the perfect platform to enable enterprise IT to offer “data as self-serve infrastructure” to the business domain teams. From there, it is a small leap to see how the Snowflake Data Marketplace is, in fact, a living example of a Data Mesh with all the benefits realized in Zhamak Dehghani’s papers.

As a data practitioner with over 3 decades of my own experience, I am as excited today as ever to see the continuous evolution of how to get value out of data and deal with the explosion in data types and volumes. I welcome Data Mesh and the innovations it is promising, along with Data Vault 2.0, cloud data hyper-scale databases, like Snowflake, to facilitate the scale and speed to value of today’s data environment.

Strive is a proud partner of Snowflake!

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience with query usage and helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

ABOUT SNOWFLAKE

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

Have Your Data and Query It Too!

“Have your cake and eat it too.” How would it make sense to have cake and not be able to eat it? And yet, we have, for decades, had similar experiences with enterprise data warehouses. We have our data; we want to query it too!

Organizations spend so much time, effort, and resources building a single source of truth. Millions of dollars are spent on hardware and software and then there is the cleansing, collating, aggregating, and applying business rules to data. When it comes time to query… we pull data out of enterprise data warehouse and put it into data marts. There simply is never enough power to service everybody who wants to query the data.

With the Snowflake Data Cloud, companies of all sizes can store their data in one place – and every department, every team, every individual can query that data. No more need for the time, expense, effort, and delay to move data out of an enterprise data warehouse and into data marts.

The advance of the ‘data lake’ promised to be the place where all enterprise data could be stored. Structured, semi-structured, and unstructured data could be stored together, cost effectively. And yet, as so many soon found out – data ended up needing to be moved out to achieve the query performance desired. More data marts, more cost, and more time delay to get to business insight.

Snowflake solved this problem by separating data storage from compute. Departments and teams can have their own virtual warehouse, a separate query compute engine that can be sized appropriately for each use case. These query engines do not interfere with each other.  Your data science team can run massive and complex queries without impacting accounting team’s dashboards.

Snowflake does this having designed for the cloud from the ground up. A massively parallel processing database, Snowflake is designed to use the cloud infrastructure and services of AWS, quickly followed by Azure and GCP. Organizations get all the scalability promised by “Hadoop based Big Data” in an easy to use, ANSI Standard SQL data warehouse, that delivers the 5 V’s of big data (Volume, Value, Variety, Velocity and Veracity). Not to mention all of these benefits come with industry leading cost and value propositions.

Speaking of Variety… Snowflake has broken out of the “data warehouse” box and has become ‘The Data Cloud’. All your data types: structured, semi-structured and now, unstructured.  All your workloads: Data Warehouse, Data Engineering, Data Science, Data Lake, Data Applications, and Data Marketplace. You have the scalability in data volume and in query compute engines across all types of data and use cases.

With the Snowflake Data Cloud, you truly can have all your data and query it too. Extracting business value for all departments and all employees along the way.

 

Want to learn more about the Snowflake Data Cloud? 

Strive Consulting is a business and technology consulting firm, and proud partner of Snowflake, having direct experience helping our clients understand and monopolize the benefits the Snowflake Data Platform presents. Our team of experts can work hand-in-hand with you to determine if leveraging Snowflake is right for your organization. Check out Strive’s additional Snowflake thought leadership HERE.

About Snowflake

Snowflake delivers the Data Cloud – a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Join the Data Cloud at SNOWFLAKE.COM.

Contact Us