Reduce the cost of spark pipeliens by using a dedicated orchestrator and playing the cloud vendors against each other.
TLDR: Spark-based data PaaS solutions are convenient. But they come with their own set of challenges such as a high vendor lock-in and obscured costs. We show how to use a dedicated orchestrator (dagster-pipes). It can not only make Databricks an implementation detail but also save cost. Also, it improves developer productivity. It allows you to take back control.
The big data landscape is changing fast. Spark-based PaaS solutions are part of this. They offer both convenience and power for data processing and analytics. But, this ease has downsides. These include lock-in risks, hidden operating costs, and scope creep. They all reduce developer productivity. This can lead to bad choices. It makes it easy to spend resources without understanding the cost. This often leads to inflated spending. In fact, most commercial platforms try to become all-encompassing. By this they are violating the unix philosophy of doing one thing - but one thing well. This results in further lock-in.
The data we are processing at ASCII is huge. One of the largest datasets we use is Commoncrawl. Every couple of months 400 TiB uncompressed and about 88 TiB of compressed data is added to the dataset. This means optimizing for performance and cost is crucial.
Inspired by the integration of Dagster, dbt and DuckDB for medium-sized datasets. This post shows how scaling this concept to extremely large-scale datasets works, whilst building on the same principles of:
We use Dagster`s remote execution integration (dagster-pipes). It abstracts the specific execution engine. We support the following flavours of Apache Spark:
This lets us make the Spark-based data PaaS platforms an implementation detail. It saves cost and boosts developer productivity. It also reduces lock-in. In particular, it lets us mix in Databricks’ extra capabilities where needed. But, we can run most workloads on EMR for less money. We can do this without changing the business logic. Also, following software engineering best practices for non-notebook development becomes very easy again. This results in a more maintainable codebase. We use ASCII in production. For it, we can observe huge cost savings due to:
Also a small link to the demo: