Docling + Metaxy: Patent Intelligence at Scale

Sun, 26 Apr 2026 16:00:00 +0000

Recording

Abstract

Patent corpora are an unforgiving stress test for document AI: documents are long, multilingual, dense with figures, tables, formulas and claim hierarchies, and the corpus keeps changing as new filings, translations, and corrections arrive. A naive batch pipeline either reprocesses everything (expensive on GPU and tokens) or silently goes stale.

This Docling Community Office Hours session shows how builds structured patent intelligence by combining:

for parsing — turning heterogeneous patent PDFs into a structured, reproducible representation,
as an incremental metadata control plane — tracking lineage at the level of individual fields per record, computing a precise diff when prompts, models, or upstream documents change, and handing that diff to whichever orchestrator or compute engine you already run,
Ray and Dagster for execution at scale,
a reviewer workspace that turns expert corrections into signal for prompts, models, and evaluation — closing an active-learning loop where uncertain or high-value samples are routed to humans first, and their corrections feed back into prompts, fine-tunes, and evaluation sets.

Two takeaways on what this stack actually buys you:

Topological caching of expensive AI work. Field-level lineage and a precise diff mean GPU and token spend is scoped to what truly changed downstream — incremental recompute at sample-and-field granularity instead of asset granularity.
Efficient metadata access over multimodal data enables intelligent routing. Once per-sample, per-field metadata is queryable, the platform can route work conditionally — pick a parser configuration by document type, pick a vision model where it is needed, and keep the rest on cheap defaults.
Provenance is non-negotiable in a world of AI slop and hallucinations. Every extracted field must be traceable back to the exact source document, page, region, code/model version that produced it — so reviewers, downstream consumers, and auditors can verify claims instead of trusting a chat-style summary.

Patents |

Docling + Metaxy: Patent Intelligence at Scale

Recording

Abstract

Links