Perfecting the Art of Doing Nothing: Incremental Multimodal AI Pipelines with Metaxy

BRZ, Vienna
Abstract
AI pipelines are now expensive enough that recomputing more than necessary is the dominant cost. Tokens and GPU hours change the economics, and agentic workflows branch and converge in ways the traditional single-state data platform was never designed for.
This talk introduces Metaxy, an open-source metadata control plane that:
- tracks lineage at the level of individual fields per record (not per dataset, not per asset),
- computes a precise diff when something changes — rows to add or recompute, rows to retire,
- hands that diff to whichever orchestrator or compute engine you already run.
We walk through field-versioning, field-level dependencies, and selective recompute, then look at two production applications:
- Anam — Cara 3 training-data workflows (face detection and cropping, audio extraction, transcription, embedding generation) over millions of multimodal samples since December 2025.
- Jubust — structured patent intelligence built on Docling for parsing, Metaxy as the incremental control plane, Ray and Dagster for execution, and a reviewer workspace that turns expert corrections into signal for prompts, models, and evaluation.
The second half of the talk zooms out to platform anatomy — building blocks vs. domain products, quality that lives in the graph, executable specifications as the seam between platform and domain teams, and compute flexibility from laptop to HPC cluster.
Two takeaways on what Metaxy actually buys you:
- Topological caching of expensive AI work. Field-level lineage and a precise diff mean GPU and token spend is scoped to what truly changed downstream — the usual incremental-recompute story, but at sample-and-field granularity instead of asset granularity.
- Efficient metadata access over multimodal data enables intelligent routing. Once per-sample, per-field metadata is queryable, the platform can route work conditionally — pick a transcription model by detected language, pick a vision model by document type, send only the slices that need a heavy VLM through the expensive path, and keep the rest on cheap defaults.
Links
