Data Vault User Group Vienna 2025 February

I was attending the DDVUG spring workshop in Vienna.

About the event: https://datavaultusergroup.de/vortragsarchiv/16-tagung-fruehjahrstagung-2025/

The slides are here: https://datavaultusergroup.de/vortragsarchiv/

The event

Find below some interesting things and notes from the workshop.

John Giles - Elephant in the boardroom

https://www.linkedin.com/in/john-giles-data

https://datavaultusergroup.de/wp-content/uploads/DV-Data-Town-Plans-v02.pdf

Great motivational talk about data vault.

My notes:

do not build a source system oriented data vault
conceptual models never have attributes
even best components must follow the overall plan to build an overall joint good solution

if you look at todays systems you will not get the business vision for tomorrow

Sebastian Flucke - dbt jinja macros

https://www.linkedin.com/in/sebastian-flucke-15624915b

https://datavaultusergroup.de/wp-content/uploads/Jinja_Vortrag.pdf

Introduction to dbt and jinja macros.

My notes:

DRY: define once, test and then reuse in many places
bring software engineering patterns to data
HDA historical data archive / persisting staging area can be generated directly from dbt
persists the whole history (forever) in order to reload from scratch easily. Acknowledges that data retention is tricky but code generation can be useful to solve even that
key: in machine readable metadata stores the list of columns which denote the primary key
code generator adds metadata of generator version and generation timestamp into version number. No manual intervention
metadata in YML file neatly version controlled in GIT. Code + Config should be controlled in the same place. Single source of truth. Fast generation of tests.
3-way merge was a key challenge for classical ETL tools

Interesting reference (as the example was around dbt + databricks):

Save 50% of databricks cost - see https://georgheiler.com/post/paas-as-implementation-detail - this is possible when building around the orchestrator and not a specific transformation engine

Michael Müller DWH simple with a methodology

https://www.linkedin.com/in/michael-mueller-1035a424

https://datavaultusergroup.de/wp-content/uploads/20250327-DWH-Einfach-mit-Methode.pdf

My notes: (sorry in German as the talk was in German)

Andrew Hunt, der pragmatische Programmierer
- https://www.buecher-stierle.at/item/47981837
- Der Hund hat meinen Source code gefressen
all die Automation:
- all die Arbeit wird weniger
- die Dauer bleibt aber gleich: Quellsystem Aufbereitung ist das Problem
DWH sollte mit seinen Aufgaben wachsen
DWH automation comparison: https://dwa-compare.info/
- dwa-compare.info/automation-tools/
data quality should be a separate process i.e. according to data vault should not prevent loading. Business/metadata still has to happen later.
- Anwendungsfall vs. Loading
- fail fast vs. be more resilient - depends on the context what works better. Still breaking the flow should be deliberate. Often it is advisable to let the stream flow and then clean up later
- should the raw vault store all the problems just in raw?
data quality is not the problem - but the interpretation logic/metadata is not there (in many cases)
way too many interfaces outbound to people who often are not fully happy with a traditional DWH
a DWH needs a business object model (subject-oriented, integrated, time-variant, non-volatile according to Bill Inmon)
a single big corporate model is too big to understand it in one go. Abstraction is key. construct it from left to right (salesperson, order, payment)
- SAPs data model is built from this principle
why have data models such a bad reputation most of the time?
there is no need to have the model 100% right - it is there for communication
- data modeler vs. terrorist: you can negotiate with the terrorist but not the modeler
some methods
- ELM https://www.elmstandards.com/ https://europass.europa.eu/en/news/launch-european-learning-model-data-model-browser
- schema: connect to web ontologies for LLM semantics
data conracts should be minimal
- validate the number of not agreed changes in the last N days
- count mis-deliveries
- count delayed deliveries
- board of glory - praise the ones which deliver data well
diagram drawing - mermaid can be useful
- time
- measurable
- descriptive
- connections
automation
- generate to business vault
- generate after business vault
- automate the mechanics
- but not possible to automate the logic/business vault
- though metadata may help to simplify certain rules and automation
datamesh did one thing well: Too many interfaces, unclear responsibility - requiesting a domain owner
but one thing wrong: SAP as source system for example untangling to 5 systems to re-combine later
Information product canvas https://wow.agiledata.io/project/information-product-canvas-templates/
provide clear interfaces for people outside of core DWH which want to access it
lineage may be optimized by adding input/output column to construct the graph via graphviz
- though dbt or sqlmesh can provide this out of the box a bit more neat
error handling
- do not overload - prioritize
- send to accountable person
- not only send mail (no tracability) generate tickets automatically
- track unplanned work - and price technical debt
there is no one-stop-shop box for a data ware house/data platform
15% hidden refactoring work is always possible

Interesting reference:

See https://georgheiler.com/event/magenta-pixi-25 on per how to scale the DWH to many people and how by embracing the orchestrator allows for flexibility which was missing in the past. As a goodie it outlines how to combine AI & BI and also how to reduce shadow-it on the go.

Christian Hädrich - Die vermaledeite Beziehung - Erfahungsaustausch zum DataVault Link

https://www.linkedin.com/in/christian-headrich

My notes:

connections are important
one hub per source system is not recommended for implementation
history: Das Selbe in Grün - due to a law suit https://de.wikipedia.org/wiki/Opel_4_PS
peg-legged link - weak hub may be useless - but may become useful later if additional link on link happens
- though link on link is often considered an anti-pattern
- don`t cross the streams
- link is like a bridge - it must be anchored strongly on both sides. You would not build bridges on top of bridges
- basically link-ception
- is this just a dogma?
- in DV_v1 the load patterns would break in case of link on link. This is not a problem in DV_v2. But it throws a different error
- leads to recursion - the link needs to be added also to the hubs
- graph only has nodes, edges
- keyed instance hub may be a (generic) solution. But does it offer enough performance in lakehouses (the additional join)?
- n:m is usually a transaction. Ask for the follow-up processes
  - should it be hub in the logical model but in physical model modeled as the link for increased performance?
  - is the underlying 1:n or n:m - this is the key question to answer in the background
  - activity satellite may be needed for most links to track history of the link
- http://dvstandards.com/guidance/
flight segment (link)
- flight number (hub)
- date (hub)??? (peg-leggged; child dependent key might be better?)
- airport (hub)
- code sharing (makes it even more complex)
car registration (link)
- code (hub)
- car (hub)
- person (hub)
- swappable car code (even more complex)
binary vs. ternary vs. n-ary links?
- unit of work
- it comes from one source
  - but this can blow up quite quickly
- data scientists need all what is available in the source
- ensure no restrictions propagated from source
- data vault builder is only supporting binary links - but does not have any refactorings. (is more flexible; must have a status tracking otherwise you can lose data)

Fun fact: there are people with surname: Null. These have reported problems in not being able to book a lot of services

driving key - is n:m always an advantage?
- example
  - offer (hub)
  - offers (link)
  - customer (hub)
- ideally patterns should make it easy to recognize what is the driving key. (can be inferred fro mbusiness object model)

Datavault (raw vault) is only meant for long-term storage - not for optimized query performance.

links may change over time: source-/record- tracking satellite
- technical
- business validity
bi-temporality
- raw vault is not temporal but only has attributes (is an argument)
- only links are temporal
- hubs usually not (unitemporal)

Thomas Herzog - Design Entscheidungen für Data Vault Architekturen

https://www.linkedin.com/in/thomasherzogcubicon

My notes:

raw vault can be lean and just a pointer (may be slower as not hard materialized)
timestamps in persistent staging area
- technical timestamp of delivery
- extraction from source system
- business effective timestamp
- semantics (direct vs. reprocessing for example insert timestamp) in order to support bi-temporality
raw vault vs persistent staging area
- raw vault only imports the columns that people care about and use for real use cases
- marker needed for garbage/buggy loaded data to only push good data over to business vault
raw vault
- 1:1 original names?
  - some cleanup may be useful though
business vault
- when to use business names and not technical names?
interestingly no real consensus on what tool is the best one for automation of DV

Jörg Stahnke - Vollständig automatisierte Datenprojekte

https://www.linkedin.com/in/j%C3%B6rg-stahnke-b73b8a219

https://datavaultusergroup.de/wp-content/uploads/vollstaendigeAutomatisierungDDVUG.pdf

My notes:

Jörg does not want to program/code - he suggests to model and code-generate only
operational data lineage is essential via data orchestration in control table
- code version and partition must be tracked
- same principle may be implemented in a full-blown orchestrator

My take: Not a traditional dependency table but a full blown data orchestrator might be an enhanced implementation https://github.com/l-mds/local-data-stack here a small example for an OSS implementation.

data catalog should be the single source of truth
data datalog should be exportable into many serialiaztion formats - even Excel if this is useful for business user
bitemporal data model is good for data science model validation
joining multiple bitemporal tables correctly can be tricky
- this can be handled by automation
- exposed as easy to consume api to external users
backfilling/migration of old state
- code needs to know if a backload needs to be triggered (including for history) based on executing a specific code version for a partition
- PSA (persistent staging area) is required to
  - gdpr and other legal retention/deletion need to clearly mark what fields need to be deleted
  - privacy hub concept - key deletion instant
  - some privacy lawyers are not happy enough with deleting the key and require to perform physical deletion
  - marketing and fincancial reporting has to be clearly separated
  - deletion must be auditable (ideally when and what tables/fields were deleted)
data quality is based on observability
- in case of a DV based data model certain checks may be simpler due to the availability of the hash keys compared to a full blown record-per-record diffing procedure
test data for development is a must. And test/validation data comparision should be comparable.
- automatic data diffing may be quite relevant here to cross check the changes
performance testing capability needs to be able to referentially consistently increase the datasets
- tracking statistics of added/deleted tables and columns
blue/green production (cloned) deployment on snowflake ++ merge request

workshop modeling

My notes:

structural organization vs. process organization vs. business function
business functions (Geschäftsfunktionen) are the most stable ways to slice domains
model of the company is very useful even for paperware (non structured machine readable data)
based on https://github.com/ddvug/Willibald-Data

workshop automation

soft rules vs hard rules
automating the automation
supernova
abstract physical data model back to business model (Geschäftsobjekt)
https://medium.com/the-modern-scientist/data-vault-supernova-on-snowflake-e3c999f6d96e
ELM as intermediate metadata format before automation such as automatedv or others kick in
model driven vs metadata driven automation
development environment packages locking (some useful tools)
hubs for most businesses should be < 50 otherwise the DV model most likely has an issue

after workshop discussion

Theseus is impressive
For most other data needs Apache Ibis is a great fit (and it even can speak to theseus)
ADBC is not (only) about ABAB & SAP but also referring to Arrow. It can allow really fast file transfers
https://github.com/sfu-db/connector-x may be useful
often for more reproducible results and less raw speed https://slingdata.io/ or https://dlthub.com/ wrapped in https://docs.dagster.io/integrations/libraries/embedded-elt can be really interesting

next events

9th April: Vienna Data Engineering Meetup: https://www.meetup.com/de-DE/vienna-data-engineering-meetup/events/306930427
TDWI 2025 & DDVUG track there https://www.tdwi-konferenz.de/de/programm/konferenzprogramm?tx_dmconferences_session%5Btrack%5D=1605&cHash=9aba44cc0feba399a41b67ba598deee8

summary

It was a blast - I could meet and connect with a couple of really interesting people. Also some great inspriational talks.

Looking forward to continue the discussions at one of the next events. The slides are here: https://datavaultusergroup.de/vortragsarchiv/

Some key learnings:

data vault is for storing data. But it is equally important to get the data out again and do something with it. This may require additional read-side optimization.
automation is key
by default most data should be stored in a bitemporal model
handling gdpr data deletion requests is tricky but important
positive nudges: Board of glory
data diffing is key to approve a change. A blue/green deployment can support this
ELM as intermediate metadata format before automation tools kick in reduces the lock-in
hubs for most businesses should be < 50 otherwise the DV model most likely has an issue

Interesting quotes: