Metaxy + Dagster-Slurm for Efficient Multimodal Pipelines

Feb 22, 2026·
Dr. Georg Heiler
Dr. Georg Heiler
Daniel Gafni
Daniel Gafni
Hernan Picatto
Hernan Picatto
· 5 min read
blog

At ASCII (Supply Chain Intelligence Institute Austria), we process large-scale web crawls and extract structured knowledge from messy data. For us, fast iteration is essential: researchers need to test and compare extraction approaches quickly.

We had been using Spark, but it became a bottleneck for fast experimentation because of:

  • its complexity
  • a processing model that is not well aligned with AI workloads
  • missing sample-level versioning and field-level change semantics

Metaxy connects orchestrators such as Dagster (table/asset-level control) with low-level engines such as Ray, so we process exactly the samples that need work at each step, and no more.

This led us to build a new stack around:

  • Metaxy for sample-level and field-level incremental versioning
  • Ray for easy scalability

For a research institute, cloud GPU compute is often expensive. Luckily, we also have access to fairly affordable EU-sovereign HPC AI systems such as MUSICA. We built dagster-slurm to get a clean control plane and observability across local, cloud, and Slurm execution.

Introducing Metaxy

Why Metaxy (visual intuition)

In multimodal pipelines, one upstream change should only affect the downstream fields that actually depend on that change. Imagine a pipeline that branches after a video-processing step:

Pipeline branches where naive recomputation is wasteful

Consider a simple change: a new crop resolution in one video stage. After a fan-out, multiple downstream branches depend on that object’s lineage.

Topology matters for lineage.

Some branches depend on frames, others only on audio-derived artifacts.

Without field-aware versioning, the orchestration layer often marks the whole downstream region as stale. That triggers expensive recomputation for branches that are technically unaffected by the change, including model inference and embedding jobs that do not consume the updated field at all. The challenge is therefore not only graph topology; it is field-level dependency tracking within each sample.

Field-aware dependency and provenance view

Metaxy addresses this by tracking provenance per sample and per field, then resolving incremental sets (new, stale, orphaned) with field-level lineage instead of coarse asset-level invalidation.

Partial data dependencies

Feature anatomy and version-aware updates

Consider three Metaxy features (data produced at each step): video/full, transcript/whisper, and video/face_crop.

Field-aware dependency and provenance view

Dependencies are field-level, not only feature-level. Each field version is computed from the upstream fields it depends on, and field versions are then combined into a feature version.

Example: transcript/whisper.text depends on the audio field of video/full. If video/full is resized or recropped (frame-side change), transcript/whisper does not need recomputation.

Metaxy detects these irrelevant updates by storing separate data versions per field, per sample, per feature, and only recomputes downstream fields affected by upstream changes.

Incremental computations

The core operation is resolve_update. For a downstream feature, Metaxy computes the expected provenance and compares it with current state, returning exactly what changed.

with store:  # MetadataStore
    # Metaxy computes provenance_by_field and identifies changes
    increment = store.resolve_update(DownstreamFeature)

    # Process only the changed samples

In practice, pipeline code only needs to handle increment.new, increment.stale, and increment.orphaned, instead of reprocessing full datasets.

Composability

Metaxy is designed to stay composable with existing infrastructure choices. Dagster provides orchestration and observability, while engines such as Ray handle scale-out execution. This separation keeps control-plane logic stable while compute backends remain flexible across local, cloud, and HPC environments. Currently ClickHouse, BigQuery, DuckDB, Lance, DuckLake, and Delta Lake are supported as backends for Metaxy.

Use Case: RAG

Why Docling for RAG preprocessing?

Docling gives us a practical bridge from raw documents (PDF, Word, etc.) to structured markdown/text that is easier to use in LLM pipelines for chunking and embeddings.

Docling logo

With Metaxy on top, document conversion becomes incremental: changed documents are reprocessed, unchanged documents are skipped. More importantly, we can run extraction experiments on a subset of the corpus, materialize the full end-to-end subgraph, and merge the result back into main.

Why Ray?

Ray is useful because it keeps scale-out simple:

  • simple parallel map-style processing model
  • flexible worker sizing and batching
  • practical integration with existing Python data workloads
  • good fit for AI workloads with dynamic resource needs

Combined with Dagster-Slurm, this gives a clean control plane while still using HPC-native scheduling as the resource provider.

Hands-on example - Metaxy + Docling + Dagster-Slurm

Here is a minimal pseudocode sketch of the pipeline. For full details, see the referenced repository. We first register raw documents, then let Metaxy track provenance and resolve exactly what needs processing.

# 1) Register discovered docs with provenance fields
sources_samples = to_dataframe(files).with_columns(
    metaxy_provenance_by_field={
        "source_path": source_path,
        "file_size_bytes": file_size_bytes,
    }
)

with store:
    src_inc = store.resolve_update("docling/source_documents", samples=sources_samples)

with store.open("w"):
    store.write("docling/source_documents", src_inc.new)
    store.write("docling/source_documents", src_inc.stale)

# 2) Resolve what must be processed downstream
with store:
    inc = store.resolve_update("docling/converted_documents", versioning_engine="polars")

# Metaxy returns Increment(new, stale, orphaned)
to_process = pl.concat([inc.new.to_polars(), inc.stale.to_polars()]).filter(current_input_doc_uids)
if to_process.is_empty():
    return "up_to_date"

# 3) Scale-out execution with Ray (cost-balanced sharding)
shards = build_weighted_shards(to_process, num_workers)
result_ds = ray_dataset(shards).map_batches(DoclingMapper, batch_size=batch_size)

# 4) Persist results through Metaxy datasink
result_ds.write_datasink(MetaxyDatasink(feature="docling/converted_documents", store=store))

For full implementation details, see:

Contributing

We welcome contributions to both projects:

Special thanks to Daniel Gafni for the collaboration around Metaxy, and to Hernan for the Docling/document-processing collaboration.

Dr. Georg Heiler
Authors
senior data expert
Georg is a Senior data expert at Magenta and a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.
Daniel Gafni
Authors
Data Engineer
Hernan Picatto
Authors
Researcher & data scientist

Researcher at the Supply Chain Intelligence Institute Austria (ASCII).

My research interest lies at the intersection of forecasting extreme events and causal analysis in high-frequency time series.