Modern data orchestration using Dagster
Welcome 😄 to this blog series about the modern data stack ecosystem.
This blog series will give an overview of how dagster as a central piece of the modern data stack can easily interface and orchestrate with tools like:
- Airbyte for ingesting data using connectors for various services
- DBT for ETL-style (but modern define-once, test and reuse)-style SQL
- jupyter notebooks in the domain of data science
With such an E2E integration capability, tool silos that adversely affect data governance, data quality and lineage of data assets are a matter of the past.
disclaimer
The following blog series was inspired by (and code for some of the examples is derived from):
- https://dagster.io/blog/rebundling-the-data-platform
- https://dagster.io/blog/dagster-0-14-0-never-felt-like-this-before
- https://dagster.io/blog/software-defined-assets
- https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/
- and the official dagster documentation https://docs.dagster.io/concepts including their examples https://github.com/dagster-io/dagster in code
Why another blog post on this topic? Well, firstly, one only learns about a topic for real not when reading posts from other people. Rather hands-on experimentation with a new technology and concrete examples are required. Secondly, hopefully this will be a good reference for me not to forget how dagster works.
Moreover, and most importantly, the official documentation and various pre-existing blog posts sometimes show fantastic examples: However, unless you have a cloud-based deployment of dagster and the required components (databases, APIs, connectors, SaaS services, blob storage, …) at hand, it can be hard to follow along.
Therefore, the examples in this post series are all structured to easily allow local experimentation (even in old-school enterprise scenarios where perhaps the cloud is still not yet a thing).
Why dagster? A great description why not to use Apache Airflow is We`re All Using Airflow Wrong and How to Fix It by Bluecore. TLDR: operator madness with varying quality of the connectors. No native notion of moving data/assets from one task to the next. No handling of resources - therefore testability madness. Dagster is changing this with a code-first approach and easy testability and by not mixing orchestration and business logic. Furthermore lyft brought up some great points regarding reproducability & resource isolation as weaknesses of Airflow.
A big thank you goes to Sandy Ryza as a co-author of these posts. He helped answer my questions when getting started with dagster. He furthermore supported in simplifying certain code examples (i.e. getting rid of external dependencies to cloud resources). To be fully transparent I have to disclose he is working at the company building dagster.
post series overview
- basic introduction (from hello-world to simple pipelines)
- assets: turning the data pipeline inside out using software-defined-assets to focus on the things we care about: the curated assets of data and not intermediate transformations
- a more fully-fledged example integrating multiple components including resources, an API as well as DBT
- integrating jupyter notebooks into the data pipeline using dagstermill
- working on scalable data pipelines with pyspark
- ingesting data from foreign sources using Airbyte connectors
- SFTP sensor reacting to new files
The source code is available here: https://github.com/geoHeil/dagster-asset-demo:
Requirements:
miniconda
https://docs.conda.io/en/latest/miniconda.html is installed and on your path and has connectivity (direct or indirect via an artifact store) to install the required packages- optionally:
docker
(required for some of the later more complex examples) git
(to clone and access the example code)make
to execute the makefile
To be prepared for this tutorial execute:
git clone git@github.com:geoHeil/dagster-asset-demo.git
cd ASSET_DEMO
# prepare mamba https://github.com/mamba-org/mamba
conda activate base
conda install -y -c conda-forge mamba
conda deactivate
make create_environment
# follow the instructions below to set the DAGSTER_HOME
# and perform an editable installation (if you want to toy around with this dummy pipeline)
conda activate dagster-asset-demo
pip install --editable .
make dagit
# explore: Go to http://localhost:3000
# optionally enable (in a 2nd terminal; do not forget to again activate the required conda environment):
dagster-daemon run
# to use schedules and backfills
More involved examples such as the one for Airbyte or others might require access to docker. It is required to easily spin up containers for databases or further services.
The nessessary docker-compose.yml file is contained in the example code. Instructions how to use it are part of the separate posts which require additional such resources to follow along.
related posts:
When debugging dagster, the interactivity of a jupyter notebook-based environment might be helpful.
This post here exlores the integration of both to directly interact with a running dagster instance.
learning more about the ecosystem
There are many more topics to cover beyond the scope of this simple introductory tutorial. The official dagster documentation contains some good examples: In particular, https://docs.dagster.io/guides/dagster/example_project is a great recommendation to learn more.
The youtube channel of dagster hosts additional great content. In particular, the community meetings like
can contain valuable further ideas to improve data pipelines.Areas not be covered by this series that might be of interest:
- Deployment of Dagster. We are using a locally running instance in the example to make it easy to follow along. Obviously, it is not production-grade and a proper deployment using i.e. kubernetes, might be required for your use case.
- A further good example is https://github.com/MileTwo/dagster-example-pipeline and the accompanying blog post which uses docker conatiners for development and deployment
- The Integration with external data governance tools like Egeria or datahub. However, such tools should be considered an essential capability to move data governance forward in an enterprise setting.
- The reverse ETL orchestration of DBT from dagster together with hightouch as outlined in https://blog.getdbt.com/dbt-and-hightouch-are-putting-transformed-data-to-work/. However, the link still can be a good reference.
- GraphQl-API-based data ecosystem: https://www.sspaeti.com/blog/analytics-api-with-graphql-the-next-level-of-data-engineering/
- Lightdash a self-service BI tool which natively connects to DBT.
Furthermore an interesting discussion recently evovled around tables vs. streams:
It's clear that the modern data stack - built on the CDW with tools for ELT, data modeling, metrics, and reverse ELT/operationalization - is becoming a standard. How is this architecture/are these workflows *fundamentally* wrong?
— Sarah Catanzaro (@sarahcat21) March 15, 2022
summary
❤️ Hopefully, you are inspired to experiment with the modern data stack tools. By following along in this series of posts and the accompanying source code you should be able to:
- improve the quality of your data pipelines
- increase the quality of your data assets
- get an understanding of the end to end lineage between the various data assets