The Strategic Case of Open-Source Data Engineering

by Ginger Grant | August 29, 2025

If your company has migrated data platforms two or three times in the past decade, you are tired of the disruption and cost. Data engineering forces companies into an expensive cycle: invest heavily in proprietary platforms, train teams extensively and then repeat this process when vendors discontinue products or technology shifts make current solutions obsolete.

Data Transformation Tools Have a 10-year Lifespan

Data transformation platforms create a predictable business challenge. Informatica, Data Stage, DTS, SSIS, and Oracle Warehouse Builder dominated enterprise data transformation for years, yet most have either disappeared or lost market relevance. These tools have a shelf life of about ten years, and then new ones replace them. Each transition forces companies to restart their investment cycle—new licensing costs, extensive retraining programs, and productivity losses during migration periods.

Proprietary Tools Limit Options

These proprietary tools are designed to work with specific databases or clouds, making it expensive to switch providers or adopt new technologies. Periodically the people who use these data engineering tools will be tasked with learning a new tool to do data transformation, a task which has not changed.

Vendor Agnostic Might Be a Solution

Ideally companies could teach their data staff a vendor agnostic tool or language to transform data. That way, when technology changes, the company could use the same coding process to change it, and it would be easier to change vendors to the one with the best price and feature set for transforming data.

Open-Source Advantages

Over the past decades, creators of open-source technologies have been working on the challenge to create vendor agnostic tools for transforming data. Python has emerged as a leader in open-source tools as it has libraries that extend its functionality to many different tasks, including data manipulation. The functionality of Python is extended by including the Spark libraries which are specifically designed to process data efficiently. Adding these libraries to Python means there is now an open-source language and sets of libraries which can be used with data. No longer do companies need to be locked into a vendor specific tool that will become obsolete in 5 to 10 years. Company leaders can be more confident that investing time in a tool will pay off over the long term rather than constantly chasing the next tool.

Microsoft Moving to Open Source

Microsoft has fully embraced the open-source trend and have implemented it in their applications and development tools. Microsoft encourages the use of open-source coding languages, and they include the open-source user interface Jupyter notebooks as well as open-source methodology of Medallion Architecture. Nowhere is this trend more prominent than in their latest data tool Microsoft Fabric. Fabric stores data in the open-source format Delta Parquet. Microsoft previously had a data platform named Synapse that included .NET as a data language but this was dropped from Fabric. Microsoft Fabric now supports the languages Scala, R and Python. AWS Glue, Google's Dataproc or the Databricks tool also use these languages and data storage protocols. While each platform requires some proprietary knowledge, they all use the same data manipulation language and library, Spark. Note that Databricks uses a proprietary version of Spark, which while similar, prevents code developed in their tool from being easily migrated. Moving to open source is a big change for Microsoft and for data engineers who must learn the open-source tools. Hopefully, the payoff is that companies will not need to retrain people, and it should be easier to move to a different technology stack.

Design Problems in Data Transformation

Data transformation processes, which are commonly known as Extract, Transform and Load (ETL) pipelines, have been used for at least the last 25 years. Databricks studied commonly employed ETL uses and determined that since the data is inconsistent, the ETL pipelines tended not to be flexible or scalable. Taking advantage of newer technologies such as cheap storage, Databricks developed a process to fix those code design problems.

Medallion Architecture – Learning from the Past

Databricks created Medallion Architecture to provide a framework for designing data transformation processes. Medallion Architecture is comprised of three different layers, a bronze layer where all data is stored unchanged, a silver layer where the data is transformed, and a gold layer where the final version is released to users. Databricks created several white papers on this multi-layered architecture showing how it can improve quality outcomes. Microsoft embraced this idea, and heavily promotes Medallion Architecture within Microsoft Fabric, encouraging its adoption.

Business Impact and ROI

The increased adoption of open source offers many different benefits to corporations and individuals who spend time and money on tools and languages, some of which have a noticeably short shelf life. Adopting open-source technologies provides an investment in tools separate from a specific vendor, meaning the knowledge and code is possible to be transferred to other platforms. Eventually, companies will reduce training costs because teams develop transferable skills across data platforms. Companies will have more flexibility in hiring because the skills candidates need will be open-source tools, not specialized proprietary platform experience. Code and expertise built on open standards will allow companies to migrate to new platforms and technologies in the future. Microsoft's embrace of open source in Fabric means companies can be confident that when using Fabric, they are taking advantage of the best in the open-source community combined with their existing reporting product, Power BI.