Local data stack template

Oct 25, 2024·

Georg Heiler

· 2 min read

introduction

Almost a year ago Aleks and I blogged about: the local modern data stack as a pipeline around duckdb, dagster and parquet-ish files and remote storage And 6 months later Hernan and I about scaling that concept to spark and commoditizing the data-paas.

More and more people have picked up on the idea with regards to duckdb due to the simplicity and increasing popularity of lakehouse:

Therefore, github.com/l-mds/local-data-stack should serve as a template of how to get started with a local modern data stack. It is still a bit early, but over time it will be refined.

Usage

pixi run tpl-init-cruft

# alternatively:
cruft create cruft create git@github.com:l-mds/local-data-stack.git

cd <<your project name>>
git init
git add .
git commit -m "initial commit"

docker compose -f docker-compose.yml --profile dagster_onprem up --build

To update the template simply execute:

pixi run tpl-update

result

See for yourself:

The following goodies are included:

Reproducibility
- pixi
- docker
LMDS tools
- dagster
- dbt
- duckdb
code quality
- pyright
- taplo
- pytest
template updates via cruft
secops
- age
- sops

summary

This template is a starting point to get going with a local modern data stack and waiting for more collaborative refinement.

Currently, it is still not a full reference of the duckdb blog post and lacking around partition handling and lakehouse/iceberg/delta/Hudi integration. Are you interested in contributing? Please reach out to me.

With this I wish for a world where data practitioners (albeit all the customized glue code) can build on more best practices and reference examples for their pipelines. I hope that this can serve as building blocks to speed things up and build with higher quality & confidence.

Last updated on Nov 5, 2024