Modern data orchestration using Dagster

From 0 to hero for the modern data stack. A series of blog posts.

Welcome 😄 to this blog series about the modern data stack ecosystem.

This blog series will give an overview of how dagster as a central piece of the modern data stack can easily interface and orchestrate with tools like:

  • Airbyte for ingesting data using connectors for various services
  • DBT for ETL-style (but modern define-once, test and reuse)-style SQL
  • jupyter notebooks in the domain of data science

With such an E2E integration capability, tool silos that adversely affect data governance, data quality and lineage of data assets are a matter of the past.

disclaimer

The following blog series was inspired by (and code for some of the examples is derived from):

Why another blog post on this topic? Well, firstly, one only learns about a topic for real not when reading posts from other people. Rather hands-on experimentation with a new technology and concrete examples are required. Secondly, hopefully this will be a good reference for me not to forget how dagster works.

Moreover, and most importantly, the official documentation and various pre-existing blog posts sometimes show fantastic examples: However, unless you have a cloud-based deployment of dagster and the required components (databases, APIs, connectors, SaaS services, blob storage, …) at hand, it can be hard to follow along.

Therefore, the examples in this post series are all structured to easily allow local experimentation (even in old-school enterprise scenarios where perhaps the cloud is still not yet a thing).

Why dagster? A great description why not to use Apache Airflow is We`re All Using Airflow Wrong and How to Fix It by Bluecore. TLDR: operator madness with varying quality of the connectors. No native notion of moving data/assets from one task to the next. No handling of resources - therefore testability madness. Dagster is changing this with a code-first approach and easy testability and by not mixing orchestration and business logic. Furthermore lyft brought up some great points regarding reproducability & resource isolation as weaknesses of Airflow.

A big thank you goes to Sandy Ryza as a co-author of these posts. He helped answer my questions when getting started with dagster. He furthermore supported in simplifying certain code examples (i.e. getting rid of external dependencies to cloud resources). To be fully transparent I have to disclose he is working at the company building dagster.

post series overview

  1. basic introduction (from hello-world to simple pipelines)
  2. assets: turning the data pipeline inside out using software-defined-assets to focus on the things we care about: the curated assets of data and not intermediate transformations
  3. a more fully-fledged example integrating multiple components including resources, an API as well as DBT
  4. integrating jupyter notebooks into the data pipeline using dagstermill
  5. working on scalable data pipelines with pyspark
  6. ingesting data from foreign sources using Airbyte connectors
  7. SFTP sensor reacting to new files

The source code is available here: https://github.com/geoHeil/dagster-asset-demo:

Requirements:

  • miniconda https://docs.conda.io/en/latest/miniconda.html is installed and on your path and has connectivity (direct or indirect via an artifact store) to install the required packages
  • optionally: docker (required for some of the later more complex examples)
  • git (to clone and access the example code)
  • make to execute the makefile

To be prepared for this tutorial execute:

git clone git@github.com:geoHeil/dagster-asset-demo.git
cd ASSET_DEMO

# prepare mamba https://github.com/mamba-org/mamba
conda activate base
conda install -y -c conda-forge mamba
conda deactivate

make create_environment

# follow the instructions below to set the DAGSTER_HOME
# and perform an editable installation (if you want to toy around with this dummy pipeline)
conda activate dagster-asset-demo
pip install --editable .

make dagit
# explore: Go to http://localhost:3000

# optionally enable (in a 2nd terminal; do not forget to again activate the required conda environment):
dagster-daemon run
# to use schedules and backfills

More involved examples such as the one for Airbyte or others might require access to docker. It is required to easily spin up containers for databases or further services.

The nessessary docker-compose.yml file is contained in the example code. Instructions how to use it are part of the separate posts which require additional such resources to follow along.

When debugging dagster, the interactivity of a jupyter notebook-based environment might be helpful.

This post here exlores the integration of both to directly interact with a running dagster instance.

learning more about the ecosystem

There are many more topics to cover beyond the scope of this simple introductory tutorial. The official dagster documentation contains some good examples: In particular, https://docs.dagster.io/guides/dagster/example_project is a great recommendation to learn more.

The youtube channel of dagster hosts additional great content. In particular, the community meetings like

can contain valuable further ideas to improve data pipelines.

Areas not be covered by this series that might be of interest:

Furthermore an interesting discussion recently evovled around tables vs. streams:

summary

❤️ Hopefully, you are inspired to experiment with the modern data stack tools. By following along in this series of posts and the accompanying source code you should be able to:

  • improve the quality of your data pipelines
  • increase the quality of your data assets
  • get an understanding of the end to end lineage between the various data assets
Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.