Data Vault User Group Vienna 2025 February


I was attending the DDVUG spring workshop in Vienna.
About the event: https://datavaultusergroup.de/vortragsarchiv/16-tagung-fruehjahrstagung-2025/
The slides are here: https://datavaultusergroup.de/vortragsarchiv/
The event
Find below some interesting things and notes from the workshop.
John Giles - Elephant in the boardroom
https://www.linkedin.com/in/john-giles-data
https://datavaultusergroup.de/wp-content/uploads/DV-Data-Town-Plans-v02.pdf
Great motivational talk about data vault.
My notes:
- do not build a source system oriented data vault
- conceptual models never have attributes
- even best components must follow the overall plan to build an overall joint good solution
if you look at todays systems you will not get the business vision for tomorrow
Sebastian Flucke - dbt jinja macros
https://www.linkedin.com/in/sebastian-flucke-15624915b
https://datavaultusergroup.de/wp-content/uploads/Jinja_Vortrag.pdf
Introduction to dbt and jinja macros.
My notes:
- DRY: define once, test and then reuse in many places
- bring software engineering patterns to data
- HDA historical data archive / persisting staging area can be generated directly from dbt
- persists the whole history (forever) in order to reload from scratch easily. Acknowledges that data retention is tricky but code generation can be useful to solve even that
- key: in machine readable metadata stores the list of columns which denote the primary key
- code generator adds metadata of generator version and generation timestamp into version number. No manual intervention
- metadata in YML file neatly version controlled in GIT. Code + Config should be controlled in the same place. Single source of truth. Fast generation of tests.
- 3-way merge was a key challenge for classical ETL tools
Interesting reference (as the example was around dbt + databricks):
Save 50% of databricks cost - see https://georgheiler.com/post/paas-as-implementation-detail - this is possible when building around the orchestrator and not a specific transformation engine
Michael Müller DWH simple with a methodology
https://www.linkedin.com/in/michael-mueller-1035a424
https://datavaultusergroup.de/wp-content/uploads/20250327-DWH-Einfach-mit-Methode.pdf
My notes: (sorry in German as the talk was in German)
- Andrew Hunt, der pragmatische Programmierer
- https://www.buecher-stierle.at/item/47981837
- Der Hund hat meinen Source code gefressen
- all die Automation:
- all die Arbeit wird weniger
- die Dauer bleibt aber gleich: Quellsystem Aufbereitung ist das Problem
- DWH sollte mit seinen Aufgaben wachsen
- DWH automation comparison: https://dwa-compare.info/
- dwa-compare.info/automation-tools/
- data quality should be a separate process i.e. according to data vault should not prevent loading. Business/metadata still has to happen later.
- Anwendungsfall vs. Loading
- fail fast vs. be more resilient - depends on the context what works better. Still breaking the flow should be deliberate. Often it is advisable to let the stream flow and then clean up later
- should the raw vault store all the problems just in raw?
- data quality is not the problem - but the interpretation logic/metadata is not there (in many cases)
- way too many interfaces outbound to people who often are not fully happy with a traditional DWH
- a DWH needs a business object model (subject-oriented, integrated, time-variant, non-volatile according to Bill Inmon)
- a single big corporate model is too big to understand it in one go. Abstraction is key. construct it from left to right (salesperson, order, payment)
- SAPs data model is built from this principle
- why have data models such a bad reputation most of the time?
- there is no need to have the model 100% right - it is there for communication
- data modeler vs. terrorist: you can negotiate with the terrorist but not the modeler
- some methods
- ELM https://www.elmstandards.com/ https://europass.europa.eu/en/news/launch-european-learning-model-data-model-browser
- schema: connect to web ontologies for LLM semantics
- data conracts should be minimal
- validate the number of not agreed changes in the last N days
- count mis-deliveries
- count delayed deliveries
- board of glory - praise the ones which deliver data well
- diagram drawing - mermaid can be useful
- time
- measurable
- descriptive
- connections
- automation
- generate to business vault
- generate after business vault
- automate the mechanics
- but not possible to automate the logic/business vault
- though metadata may help to simplify certain rules and automation
- datamesh did one thing well: Too many interfaces, unclear responsibility - requiesting a domain owner
- but one thing wrong: SAP as source system for example untangling to 5 systems to re-combine later
- Information product canvas https://wow.agiledata.io/project/information-product-canvas-templates/
- provide clear interfaces for people outside of core DWH which want to access it
- lineage may be optimized by adding input/output column to construct the graph via graphviz
- though dbt or sqlmesh can provide this out of the box a bit more neat
- error handling
- do not overload - prioritize
- send to accountable person
- not only send mail (no tracability) generate tickets automatically
- track unplanned work - and price technical debt
- there is no one-stop-shop box for a data ware house/data platform
- 15% hidden refactoring work is always possible
Interesting reference:
See https://georgheiler.com/event/magenta-pixi-25 on per how to scale the DWH to many people and how by embracing the orchestrator allows for flexibility which was missing in the past. As a goodie it outlines how to combine AI & BI and also how to reduce shadow-it on the go.
Christian Hädrich - Die vermaledeite Beziehung - Erfahungsaustausch zum DataVault Link
https://www.linkedin.com/in/christian-headrich
My notes:
- connections are important
- one hub per source system is not recommended for implementation
- history: Das Selbe in Grün - due to a law suit https://de.wikipedia.org/wiki/Opel_4_PS
- peg-legged link - weak hub may be useless - but may become useful later if additional link on link happens
- though link on link is often considered an anti-pattern
- don`t cross the streams
- link is like a bridge - it must be anchored strongly on both sides. You would not build bridges on top of bridges
- basically link-ception
- is this just a dogma?
- in DV_v1 the load patterns would break in case of link on link. This is not a problem in DV_v2. But it throws a different error
- leads to recursion - the link needs to be added also to the hubs
- graph only has nodes, edges
- keyed instance hub may be a (generic) solution. But does it offer enough performance in lakehouses (the additional join)?
- n:m is usually a transaction. Ask for the follow-up processes
- should it be hub in the logical model but in physical model modeled as the link for increased performance?
- is the underlying 1:n or n:m - this is the key question to answer in the background
- activity satellite may be needed for most links to track history of the link
- http://dvstandards.com/guidance/
- flight segment (link)
- flight number (hub)
- date (hub)??? (peg-leggged; child dependent key might be better?)
- airport (hub)
- code sharing (makes it even more complex)
- car registration (link)
- code (hub)
- car (hub)
- person (hub)
- swappable car code (even more complex)
- binary vs. ternary vs. n-ary links?
- unit of work
- it comes from one source
- but this can blow up quite quickly
- data scientists need all what is available in the source
- ensure no restrictions propagated from source
- data vault builder is only supporting binary links - but does not have any refactorings. (is more flexible; must have a status tracking otherwise you can lose data)
Fun fact: there are people with surname:
Null
. These have reported problems in not being able to book a lot of services
- driving key - is n:m always an advantage?
- example
- offer (hub)
- offers (link)
- customer (hub)
- ideally patterns should make it easy to recognize what is the driving key. (can be inferred fro mbusiness object model)
- example
Datavault (raw vault) is only meant for long-term storage - not for optimized query performance.
- links may change over time: source-/record- tracking satellite
- technical
- business validity
- bi-temporality
- raw vault is not temporal but only has attributes (is an argument)
- only links are temporal
- hubs usually not (unitemporal)
Thomas Herzog - Design Entscheidungen für Data Vault Architekturen
My notes:
- raw vault can be lean and just a pointer (may be slower as not hard materialized)
- timestamps in persistent staging area
- technical timestamp of delivery
- extraction from source system
- business effective timestamp
- semantics (direct vs. reprocessing for example insert timestamp) in order to support bi-temporality
- raw vault vs persistent staging area
- raw vault only imports the columns that people care about and use for real use cases
- marker needed for garbage/buggy loaded data to only push good data over to business vault
- raw vault
- 1:1 original names?
- some cleanup may be useful though
- 1:1 original names?
- business vault
- when to use business names and not technical names?
- interestingly no real consensus on what tool is the best one for automation of DV
Jörg Stahnke - Vollständig automatisierte Datenprojekte
https://www.linkedin.com/in/j%C3%B6rg-stahnke-b73b8a219
https://datavaultusergroup.de/wp-content/uploads/vollstaendigeAutomatisierungDDVUG.pdf
My notes:
- Jörg does not want to program/code - he suggests to model and code-generate only
- operational data lineage is essential via data orchestration in control table
- code version and partition must be tracked
- same principle may be implemented in a full-blown orchestrator
My take: Not a traditional dependency table but a full blown data orchestrator might be an enhanced implementation https://github.com/l-mds/local-data-stack here a small example for an OSS implementation.
- data catalog should be the single source of truth
- data datalog should be exportable into many serialiaztion formats - even Excel if this is useful for business user
- bitemporal data model is good for data science model validation
- joining multiple bitemporal tables correctly can be tricky
- this can be handled by automation
- exposed as easy to consume api to external users
- backfilling/migration of old state
- code needs to know if a backload needs to be triggered (including for history) based on executing a specific code version for a partition
- PSA (persistent staging area) is required to
- gdpr and other legal retention/deletion need to clearly mark what fields need to be deleted
- privacy hub concept - key deletion instant
- some privacy lawyers are not happy enough with deleting the key and require to perform physical deletion
- marketing and fincancial reporting has to be clearly separated
- deletion must be auditable (ideally when and what tables/fields were deleted)
- data quality is based on observability
- in case of a DV based data model certain checks may be simpler due to the availability of the hash keys compared to a full blown record-per-record diffing procedure
- test data for development is a must. And test/validation data comparision should be comparable.
- automatic data diffing may be quite relevant here to cross check the changes
- performance testing capability needs to be able to referentially consistently increase the datasets
- tracking statistics of added/deleted tables and columns
- blue/green production (cloned) deployment on snowflake ++ merge request
workshop modeling
My notes:
- structural organization vs. process organization vs. business function
- business functions (Geschäftsfunktionen) are the most stable ways to slice domains
- model of the company is very useful even for paperware (non structured machine readable data)
- based on https://github.com/ddvug/Willibald-Data
workshop automation
- soft rules vs hard rules
- automating the automation
- supernova
- abstract physical data model back to business model (Geschäftsobjekt)
- https://medium.com/the-modern-scientist/data-vault-supernova-on-snowflake-e3c999f6d96e
- ELM as intermediate metadata format before automation such as automatedv or others kick in
- model driven vs metadata driven automation
- development environment packages locking (some useful tools)
- hubs for most businesses should be < 50 otherwise the DV model most likely has an issue
after workshop discussion
- Theseus is impressive
- For most other data needs Apache Ibis is a great fit (and it even can speak to theseus)
- ADBC is not (only) about ABAB & SAP but also referring to Arrow. It can allow really fast file transfers
- https://github.com/sfu-db/connector-x may be useful
- often for more reproducible results and less raw speed https://slingdata.io/ or https://dlthub.com/ wrapped in https://docs.dagster.io/integrations/libraries/embedded-elt can be really interesting
next events
- 9th April: Vienna Data Engineering Meetup: https://www.meetup.com/de-DE/vienna-data-engineering-meetup/events/306930427
- TDWI 2025 & DDVUG track there https://www.tdwi-konferenz.de/de/programm/konferenzprogramm?tx_dmconferences_session%5Btrack%5D=1605&cHash=9aba44cc0feba399a41b67ba598deee8
summary
It was a blast - I could meet and connect with a couple of really interesting people. Also some great inspriational talks.
Looking forward to continue the discussions at one of the next events. The slides are here: https://datavaultusergroup.de/vortragsarchiv/
Some key learnings:
- data vault is for storing data. But it is equally important to get the data out again and do something with it. This may require additional read-side optimization.
- automation is key
- by default most data should be stored in a bitemporal model
- handling gdpr data deletion requests is tricky but important
- positive nudges: Board of glory
- data diffing is key to approve a change. A blue/green deployment can support this
- ELM as intermediate metadata format before automation tools kick in reduces the lock-in
- hubs for most businesses should be < 50 otherwise the DV model most likely has an issue
Interesting quotes:
if you look at todays systems you will not get the business vision for tomorrow
data modeler vs. terrorist: you can negotiate with the terrorist but not the modeler