Local data stack template

Oct 25, 2024·
Dr. Georg Heiler
Dr. Georg Heiler
· 2 min read
blog

introduction

Almost a year ago Aleks and I blogged about: the local modern data stack as a pipeline around duckdb, dagster and parquet-ish files and remote storage And 6 months later Hernan and I about scaling that concept to spark and commoditizing the data-paas.

More and more people have picked up on the idea with regards to duckdb due to the simplicity and increasing popularity of lakehouse:

Therefore, github.com/l-mds/local-data-stack should serve as a template of how to get started with a local modern data stack. It is still a bit early, but over time it will be refined.

Usage

pixi run tpl-init-cruft

# alternatively:
cruft create cruft create git@github.com:l-mds/local-data-stack.git

cd <<your project name>>
git init
git add .
git commit -m "initial commit"

docker compose -f docker-compose.yml --profile dagster_onprem up --build

To update the template simply execute:

pixi run tpl-update

result

See for yourself:

Dagster asset graph
The following goodies are included:

  • Reproducibility
  • LMDS tools
    • dagster
    • dbt
    • duckdb
  • code quality
    • pyright
    • taplo
    • pytest
  • template updates via cruft
  • secops
    • age
    • sops

summary

This template is a starting point to get going with a local modern data stack and waiting for more collaborative refinement.

Currently, it is still not a full reference of the duckdb blog post and lacking around partition handling and lakehouse/iceberg/delta/Hudi integration. Are you interested in contributing? Please reach out to me.

With this I wish for a world where data practitioners (albeit all the customized glue code) can build on more best practices and reference examples for their pipelines. I hope that this can serve as building blocks to speed things up and build with higher quality & confidence.

Dr. Georg Heiler
Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.