Docling + Metaxy: Patent Intelligence at Scale

Abstract
A walkthrough at the Docling Community Office Hours of how Jubust combines Docling for document parsing with Metaxy as an incremental, field-level metadata control plane to build structured patent intelligence at scale. We cover why patent corpora break naive batch pipelines (long, multilingual, figure- and formula-heavy documents that change over time), how Docling’s structured output plugs into Metaxy’s per-record, per-field lineage, and how the resulting diff drives selective recompute on Ray and Dagster — so that GPU and token spend stays scoped to what truly changed downstream. We also show the reviewer workspace where expert corrections close an active-learning loop — feeding back into prompts, models, and evaluation.
Date
Apr 26, 2026 4:00 PM
Location
Online (Zoom, Linux Foundation)
Recording
Abstract
Patent corpora are an unforgiving stress test for document AI: documents are long, multilingual, dense with figures, tables, formulas and claim hierarchies, and the corpus keeps changing as new filings, translations, and corrections arrive. A naive batch pipeline either reprocesses everything (expensive on GPU and tokens) or silently goes stale.
This Docling Community Office Hours session shows how Jubust builds structured patent intelligence by combining:
- Docling for parsing — turning heterogeneous patent PDFs into a structured, reproducible representation,
- Metaxy as an incremental metadata control plane — tracking lineage at the level of individual fields per record, computing a precise diff when prompts, models, or upstream documents change, and handing that diff to whichever orchestrator or compute engine you already run,
- Ray and Dagster for execution at scale,
- a reviewer workspace that turns expert corrections into signal for prompts, models, and evaluation — closing an active-learning loop where uncertain or high-value samples are routed to humans first, and their corrections feed back into prompts, fine-tunes, and evaluation sets.
Two takeaways on what this stack actually buys you:
- Topological caching of expensive AI work. Field-level lineage and a precise diff mean GPU and token spend is scoped to what truly changed downstream — incremental recompute at sample-and-field granularity instead of asset granularity.
- Efficient metadata access over multimodal data enables intelligent routing. Once per-sample, per-field metadata is queryable, the platform can route work conditionally — pick a parser configuration by document type, pick a vision model where it is needed, and keep the rest on cheap defaults.
- Provenance is non-negotiable in a world of AI slop and hallucinations. Every extracted field must be traceable back to the exact source document, page, region, code/model version that produced it — so reviewers, downstream consumers, and auditors can verify claims instead of trusting a chat-style summary.
Links
- Deck: jubust.com/decks/ibm-docling
- Recording: youtube.com/watch?v=7eyoqVMoguY
- Community office hours event: LinkedIn · Zoom (Linux Foundation)
- Docling: github.com/docling-project/docling
- Metaxy: docs.metaxy.io · github.com/anam-org/metaxy
- Jubust: jubust.com

Authors
senior data expert
Georg is a co-founder @Jubust and a Senior data expert at Magenta as well as a ML-ops engineer at ASCII.
He is solving challenges with data. His interests include geospatial graphs
and time series. Georg transitions the data platform of Magenta to the cloud
and is handling large scale multi-modal ML-ops challenges at ASCII.