Local data stack template

Oct 25, 2024·
Georg Heiler
Georg Heiler
· 2 min read

introduction

Almost a year ago Aleks and I blogged about: the local modern data stack as a pipeline around duckdb, dagster and parquet-ish files and remote storage And 6 months later Hernan and I about scaling that concept to spark and commoditizing the data-paas.

More and more people have picked up on the idea with regards to duckdb due to the simplicity and increasing popularity of lakehouse:

Therefore, github.com/l-mds/local-data-stack should serve as a template of how to get started with a local modern data stack. It is still a bit early, but over time it will be refined.

Usage

pixi run tpl-init-cruft

# alternatively:
cruft create cruft create git@github.com:l-mds/local-data-stack.git

cd <<your project name>>
git init
git add .
git commit -m "initial commit"

docker compose -f docker-compose.yml --profile dagster_onprem up --build

To update the template simply execute:

pixi run tpl-update

result

See for yourself:

Dagster asset graph
The following goodies are included:

  • Reproducibility
  • LMDS tools
    • dagster
    • dbt
    • duckdb
  • code quality
    • pyright
    • taplo
    • pytest
  • template updates via cruft
  • secops
    • age
    • sops

summary

This template is a starting point to get going with a local modern data stack and waiting for more collaborative refinement.

Currently, it is still not a full reference of the duckdb blog post and lacking around partition handling and lakehouse/iceberg/delta/Hudi integration. Are you interested in contributing? Please reach out to me.

With this I wish for a world where data practitioners (albeit all the customized glue code) can build on more best practices and reference examples for their pipelines. I hope that this can serve as building blocks to speed things up and build with higher quality & confidence.

Georg Heiler
Authors
senior data expert
My research interests include large geo-spatial time and network data analytics.