Modern data orchestration using Dagster

Welcome 😄 to this blog series about the modern data stack ecosystem.

This blog series will give an overview of how dagster as a central piece of the modern data stack can easily interface and orchestrate with tools like:

Airbyte for ingesting data using connectors for various services
DBT for ETL-style (but modern define-once, test and reuse)-style SQL
jupyter notebooks in the domain of data science

With such an E2E integration capability, tool silos that adversely affect data governance, data quality and lineage of data assets are a matter of the past.

disclaimer

The following blog series was inspired by (and code for some of the examples is derived from):

https://dagster.io/blog/rebundling-the-data-platform
https://dagster.io/blog/dagster-0-14-0-never-felt-like-this-before
https://dagster.io/blog/software-defined-assets
https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/
and the official dagster documentation https://docs.dagster.io/concepts including their examples https://github.com/dagster-io/dagster in code

Why another blog post on this topic? Well, firstly, one only learns about a topic for real not when reading posts from other people. Rather hands-on experimentation with a new technology and concrete examples are required. Secondly, hopefully this will be a good reference for me not to forget how dagster works.

Moreover, and most importantly, the official documentation and various pre-existing blog posts sometimes show fantastic examples: However, unless you have a cloud-based deployment of dagster and the required components (databases, APIs, connectors, SaaS services, blob storage, …) at hand, it can be hard to follow along.

Therefore, the examples in this post series are all structured to easily allow local experimentation (even in old-school enterprise scenarios where perhaps the cloud is still not yet a thing).

Why dagster? A great description why not to use Apache Airflow is We`re All Using Airflow Wrong and How to Fix It by Bluecore. TLDR: operator madness with varying quality of the connectors. No native notion of moving data/assets from one task to the next. No handling of resources - therefore testability madness. Dagster is changing this with a code-first approach and easy testability and by not mixing orchestration and business logic. Furthermore lyft brought up some great points regarding reproducability & resource isolation as weaknesses of Airflow.

A big thank you goes to Sandy Ryza as a co-author of these posts. He helped answer my questions when getting started with dagster. He furthermore supported in simplifying certain code examples (i.e. getting rid of external dependencies to cloud resources). To be fully transparent I have to disclose he is working at the company building dagster.

post series overview

basic introduction (from hello-world to simple pipelines)
assets: turning the data pipeline inside out using software-defined-assets to focus on the things we care about: the curated assets of data and not intermediate transformations
a more fully-fledged example integrating multiple components including resources, an API as well as DBT
integrating jupyter notebooks into the data pipeline using dagstermill
working on scalable data pipelines with pyspark
ingesting data from foreign sources using Airbyte connectors
SFTP sensor reacting to new files

The source code is available here: https://github.com/geoHeil/dagster-asset-demo:

Requirements:

miniconda https://docs.conda.io/en/latest/miniconda.html is installed and on your path and has connectivity (direct or indirect via an artifact store) to install the required packages
optionally: docker (required for some of the later more complex examples)
git (to clone and access the example code)
make to execute the makefile

To be prepared for this tutorial execute:

git clone git@github.com:geoHeil/dagster-asset-demo.git
cd ASSET_DEMO

# prepare mamba https://github.com/mamba-org/mamba
conda activate base
conda install -y -c conda-forge mamba
conda deactivate

make create_environment

# follow the instructions below to set the DAGSTER_HOME
# and perform an editable installation (if you want to toy around with this dummy pipeline)
conda activate dagster-asset-demo
pip install --editable .

make dagit
# explore: Go to http://localhost:3000

# optionally enable (in a 2nd terminal; do not forget to again activate the required conda environment):
dagster-daemon run
# to use schedules and backfills

More involved examples such as the one for Airbyte or others might require access to docker. It is required to easily spin up containers for databases or further services.

The nessessary docker-compose.yml file is contained in the example code. Instructions how to use it are part of the separate posts which require additional such resources to follow along.

When debugging dagster, the interactivity of a jupyter notebook-based environment might be helpful.

This post here exlores the integration of both to directly interact with a running dagster instance.

learning more about the ecosystem

There are many more topics to cover beyond the scope of this simple introductory tutorial. The official dagster documentation contains some good examples: In particular, https://docs.dagster.io/guides/dagster/example_project is a great recommendation to learn more.

The youtube channel of dagster hosts additional great content. In particular, the community meetings like

can contain valuable further ideas to improve data pipelines.

Areas not be covered by this series that might be of interest:

Deployment of Dagster. We are using a locally running instance in the example to make it easy to follow along. Obviously, it is not production-grade and a proper deployment using i.e. kubernetes, might be required for your use case.
- A further good example is https://github.com/MileTwo/dagster-example-pipeline and the accompanying blog post which uses docker conatiners for development and deployment
The Integration with external data governance tools like Egeria or datahub. However, such tools should be considered an essential capability to move data governance forward in an enterprise setting.
The reverse ETL orchestration of DBT from dagster together with hightouch as outlined in https://blog.getdbt.com/dbt-and-hightouch-are-putting-transformed-data-to-work/. However, the link still can be a good reference.
GraphQl-API-based data ecosystem: https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/
Lightdash a self-service BI tool which natively connects to DBT.

Furthermore an interesting discussion recently evovled around tables vs. streams:

It's clear that the modern data stack - built on the CDW with tools for ELT, data modeling, metrics, and reverse ELT/operationalization - is becoming a standard. How is this architecture/are these workflows *fundamentally* wrong?
— Sarah Catanzaro (@sarahcat21) March 15, 2022

summary

❤️ Hopefully, you are inspired to experiment with the modern data stack tools. By following along in this series of posts and the accompanying source code you should be able to:

improve the quality of your data pipelines
increase the quality of your data assets
get an understanding of the end to end lineage between the various data assets

Modern data orchestration using Dagster

disclaimer

post series overview

related posts:

learning more about the ecosystem

summary