Why Choose Open-Source Technologies?

In 2022, almost every enterprise has some cloud footprint, especially around their data. These cloud platforms offer closed-source tools which, while offering many benefits, may not always be the best choice for some organizations. First and foremost, these proprietary services can be expensive. In addition to paying for the storage and compute needed to store and access data, you also end up paying for the software itself. You could also become locked into a multi-year contract, or you might find yourself locked into a cloud’s tech stack. Once that happens, it’s very difficult (and expensive) to migrate to a different technology or re-tool your tech stack. To put it simply, if you ever reach a roadblock your closed-source tool can’t solve, there may be no workarounds.

Since closed-source technologies can create a whole host of issues, open-source technologies may be the right choice. Open-source tech is not owned by anyone. Rather, anyone can use, repackage, and distribute the technology. Several companies have monetized open-source technology by packaging and distributing it in innovative ways. Databricks, for example, built a platform on Apache Spark, a big-data processing framework. In addition to providing Spark as a managed service, Databricks offers a lot of other features that organizations find valuable. However, a small organization might not have the capital or the use case that a managed service like Databricks aims to solve. Instead, you can deploy Apache Spark on your own server or a cloud compute instance and have total control. This is especially attractive when addressing security concerns. An organization can benefit from a tool like Spark without having to involve a third party and risk exposing data to the third party.

Another benefit is fine-tuning resource provisioning.

Because you’re deploying the code on your own server or compute instance, you can configure the specifications however you want. That way, you can avoid over-provisioning or under-provisioning. You can even manage scaling, failover, redundancy, security, and more. While many managed platforms offer auto-scaling and failover, it is never so granular as it is when you provision resources yourself.

Many proprietary tools, specifically ETL (Extract, Transfer, Load) and data integration tools, are no-code GUI based solutions that require some prior experience to be implemented correctly. While the GUIs are intended to make it easier for analysts and less-technical people to create data solutions, more technical engineers can find it frustrating. Unfortunately, as the market becomes more inundated with new tools, it can be difficult to find proper training and resources. Even documentation can be iffy! Open-source technologies can be similarly peculiar, but it’s entirely possible to create an entire data stack – data engineering, modeling, analytics, and more – all using popular open-source tech. These tools will almost certainly lack a no-code GUI but are compatible with your favorite programming languages. Spark supports Scala, Python, Java, SQL and R, so anyone who knows one of those skills can be effective using Spark.

But how does this work with cloud environments?

You can choose how much of the open-source stack you want to incorporate. A fully open-source stack would simply be running all your open-source data components on cloud compute instances: database, data lake, ETL, data warehouse, and analytics all on virtual machine(s). However, that’s quite a bit of infrastructure to set up, so it may make sense to unload some parts to cloud-native technologies. Instead of creating and maintaining your own data lake, it would make sense to use AWS S3, Azure Data Lake Storage gen2, or Google Cloud Storage. Instead of managing a compute instance for a database, it would make sense to use AWS RDS, Azure SQL DB, or Google Cloud SQL and use an open-source flavor of database like MySQL or MariaDB. Instead of managing a Spark cluster, it might make sense to let the cloud manage the scaling, software patching, and other maintenance, and use AWS EMR, Azure HDInsight, or Google Dataproc. You could also abandon the idea of using compute instances and architect a solution using a cloud’s managed open-source offerings: AWS EMR, AWS MWAA, AWS RDS, Azure Database, Azure HDInsight, GCP’s Dataproc and Cloud Composer, and those are just data-specific services. As mentioned before, these native services bear some responsibility for maintaining the compute/storage, software version, runtimes, and failover. As a result, the managed offering will be more expensive than doing it yourself, but you’re still not paying for software licensing costs.

In the end, there’s a tradeoff.

There’s a tradeoff between having total control and ease of use, maintenance, and cost optimization, but there is a myriad of options for building an open-source data. You have the flexibility to host it on-premises or in the cloud of your choice. Most importantly, you can reduce spend significantly by avoiding software licensing costs.


Interested in Learning More About Open-Source Technologies? 

Here at Strive Consulting, our subject matter experts’ team up with you to understand your core business needs, while taking a deeper dive into your organization’s growth strategy. Whether you’re interested in modern data integration or an overall data and analytics assessment, Strive Consulting is dedicated to being your partner, committed to success. Learn more about our Data & Analytics practice HERE.

Contact Us