Cloud arbitrage for spark pipelines


Reduce the cost of spark pipeliens by using a dedicated orchestrator and playing the cloud vendors against each other.

Jun 21, 2024 12:00 AM — 12:00 AM

TLDR: Spark-based data PaaS solutions are convenient. But they come with their own set of challenges such as a high vendor lock-in and obscured costs. We show how to use a dedicated orchestrator (dagster-pipes). It can not only make Databricks an implementation detail but also save cost. Also, it improves developer productivity. It allows you to take back control.


The big data landscape is changing fast. Spark-based PaaS solutions are part of this. They offer both convenience and power for data processing and analytics. But, this ease has downsides. These include lock-in risks, hidden operating costs, and scope creep. They all reduce developer productivity. This can lead to bad choices. It makes it easy to spend resources without understanding the cost. This often leads to inflated spending. In fact, most commercial platforms try to become all-encompassing. By this they are violating the unix philosophy of doing one thing - but one thing well. This results in further lock-in.

The data we are processing at ASCII is huge. One of the largest datasets we use is Commoncrawl. Every couple of months 400 TiB uncompressed and about 88 TiB of compressed data is added to the dataset. This means optimizing for performance and cost is crucial.

Inspired by the integration of Dagster, dbt and DuckDB for medium-sized datasets. This post shows how scaling this concept to extremely large-scale datasets works, whilst building on the same principles of:

  • Containerization & testability & taking back control
  • Partitions and powerful orchestration & easy backfills
  • Developer productivity
  • Cost control

We use Dagster`s remote execution integration (dagster-pipes). It abstracts the specific execution engine. We support the following flavours of Apache Spark:

  • Pyspark
  • Databricks
  • EMR

This lets us make the Spark-based data PaaS platforms an implementation detail. It saves cost and boosts developer productivity. It also reduces lock-in. In particular, it lets us mix in Databricks’ extra capabilities where needed. But, we can run most workloads on EMR for less money. We can do this without changing the business logic. Also, following software engineering best practices for non-notebook development becomes very easy again. This results in a more maintainable codebase. We use ASCII in production. For it, we can observe huge cost savings due to:

  • Flexible environment selection: One job processed commoncrawl data in spark on a single partition. It cost over 700€ on Databricks. There was approximately a 50% Databricks markup for convenience features. Now we only pay less than 400€.
  • Developer productivity & taking back control: Using pyspark locally on small sample data allows rapid prototyping. No need to wait 10 minutes for cloud VMs to spin up. Furthermore, this allows for a fast development cycle & feedback loop.
  • Flexible orchestration: We can easily add partitions and orchestrate steps. We can do this on-premise and in the cloud.

See the full post for details.

Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.

Hernan Picatto
Hernan Picatto
Researcher & data scientist

I’m interested in causal inference and forecasting of high-frequency time series data, with a special emphasis on extreme events.