The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster’s orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.
TLDR: Spark-based data PaaS solutions are convenient. But they come with their own set of challenges such as a high vendor lock-in and obscured costs. We show how to use a dedicated orchestrator (dagster-pipes). It can not only make Databricks an implementation detail but also save cost. Also, it improves developer productivity. It allows you to take back control.
The big data landscape is changing fast. Spark-based PaaS solutions are part of this. They offer both convenience and power for data processing and analytics. But, this ease has downsides. These include lock-in risks, hidden operating costs, and scope creep. They all reduce developer productivity. This can lead to bad choices. It makes it easy to spend resources without understanding the cost. This often leads to inflated spending. In fact, most commercial platforms try to become all-encompassing. By this they are violating the unix philosophy of doing one thing - but one thing well. This results in further lock-in.
The data we are processing at ASCII is huge. One of the largest datasets we use is Commoncrawl. Every couple of months 400 TiB uncompressed and about 88 TiB of compressed data is added to the dataset. This means optimizing for performance and cost is crucial.
Inspired by the integration of Dagster, dbt and DuckDB for medium-sized datasets. This post shows how scaling this concept to extremely large-scale datasets works, whilst building on the same principles of:
We use Dagster`s remote execution integration (dagster-pipes). It abstracts the specific execution engine. We support the following flavours of Apache Spark:
This lets us make the Spark-based data PaaS platforms an implementation detail. It saves cost and boosts developer productivity. It also reduces lock-in. In particular, it lets us mix in Databricks’ extra capabilities where needed. But, we can run most workloads on EMR for less money. We can do this without changing the business logic. Also, following software engineering best practices for non-notebook development becomes very easy again. This results in a more maintainable codebase. We use ASCII in production. For it, we can observe huge cost savings due to: