Incremental Multimodal Graphs

At jubust we turn dense pharmaceutical patent PDFs into verified, source-linked structure-activity relationship (SAR) data. Get one value wrong and a research team can chase the wrong compound, so every datapoint has to trace back to the exact page it came from, and the expensive extraction behind it must never be recomputed without reason. Those needs, traceability and no wasted recompute, run through this whole post: they show up far beyond pharma, anywhere multimodal features are costly to produce.
Multimodal AI is moving from a side quest to the default shape of serious AI systems. Gartner expects multimodal generative AI to grow from 1% of GenAI solutions in 2023 to 40% by 2027, and surveys of multimodal foundation models now frame the field as a path from specialist models toward general-purpose assistants. The interesting data is no longer just text: it is image, video, audio, tables, documents, embeddings, model outputs, and the relationships between all of them. None of it is free: OCR, detection, embedding, captioning, layout parsing, speech transcription, and visual reasoning all burn time, money, and energy, and the resource-efficiency literature on multimodal models is explicit that the pressure only grows as the systems do. You pay to compute those features once, and you never want to pay again.
So whether the evidence is patents, public-safety archives, or anything else multimodal, the same question keeps coming back:
How do we keep multimodal features traceable, avoid recomputing expensive work, and still swap the graph engine when we want to explore relationships differently?
Graphs are where that exploration usually starts. They make entities, provenance, neighborhoods, paths, and evidence chains explicit. The growing GraphRAG literature is one expression of the same idea: vectors are useful, but relationships are often where the investigation begins.
The answer this post builds toward is to declare each feature once, as a kind of multimodal materialized view, and let the same facts feed whichever graph engine fits. To show the mechanics without the weight of the patent domain, the rest of the post uses a deliberately tiny “yajaaja, let’s poke at it” stand-in. Assume a multimodal pipeline has already detected faces in images. The detection itself, the slow and expensive model inference, is the part we skip: instead of running it, we mock its output with a handful of hand-written rows. For each detected face we have an image id, a person label, a workflow phase, a confidence score, and an embedding, and from those rows we derive a co-occurrence graph: two faces are connected if they appear in the same image context. The four faces are named for computing pioneers — Ada Lovelace, Grace Hopper, Katherine Johnson, and Alan Turing — which is about the only glamour this toy dataset gets.
The point is the shape of the system: declare the feature once, then keep asking the same graph question while swapping everything underneath it. Here is the path the rest of the post walks:
- Store the Polars data in LanceDB and query it, with a scalar filter and a vector search.
- Track the same faces as Metaxy feature metadata in LanceDB, hashed with
xxh3_64. - Ask the face co-occurrence graph a question with Lance Graph Cypher.
- Swap Metaxy onto DuckDB and ask the same question with DuckPGQ.
- Put Lance Graph, DuckPGQ, and LadybugDB side by side on several graph questions as the toy dataset scales, and watch the fastest engine change with the query shape.
By the end the toy graph is the least interesting thing on the page. The durable result is a single feature model that fed every backend without changing shape.
The minimal data model
Everything downstream hangs off one declaration, so that is where we start. Metaxy feature declarations are typed metadata contracts: the embeddings stay in LanceDB for vector search, while the metadata feature tracks the durable, explainable facts we want to version.
class DetectedFaces(mx.BaseFeature, spec=mx.FeatureSpec(
key="vision/detected_faces",
id_columns=["face_id"],
fields=["image_id", "person", "phase", "confidence"],
)):
face_id: str
image_id: str
person: str
phase: str
confidence: float
The edge table is deliberately boring. It says which detected faces co-occurred and in which phase of the workflow.
class FaceCoOccurrences(mx.BaseFeature, spec=mx.FeatureSpec(
key="vision/face_cooccurrences",
id_columns=["edge_id"],
fields=["src_face_id", "dst_face_id", "image_id", "phase"],
)):
edge_id: str
src_face_id: str
dst_face_id: str
image_id: str
phase: str
LanceDB first
With the model declared, the data needs a home. LanceDB writes the local Polars frames directly, then answers two different questions over them: one filters by workflow phase, the other does a nearest-vector lookup over the face embedding. That second query is the multimodal part: the vector is expensive to compute, so we store it once and reuse it many times.
db.create_table("faces", data=faces)
db.create_table("co_occurs", data=edges)
conference_faces = faces.where("phase = 'conference'")
nearest = faces.vector_search(query_embedding).limit(1)
Output:
Scalar filter: faces in the conference phase
| face_id | person | phase | confidence |
|---|---|---|---|
| ada | Ada | conference | 0.98 |
| grace | Grace | conference | 0.96 |
Vector search: nearest face embedding
| face_id | person | _distance |
|---|---|---|
| ada | Ada | 0.0003 |
Metaxy on LanceDB
LanceDB now holds and serves the data. It does version its tables and can add columns without rewriting them, but that bookkeeping stays inside a single table and the Lance format, with no cross-table or cross-engine view of what is fresh, stale, or already computed. That gap is what Metaxy fills: the traceability and reuse promised at the top of this post, in a form that travels across tables and across engines. It adds versioned metadata columns on top of the same store. The Metaxy JOSS paper describes it as record-level feature metadata management for multimodal ML pipelines, with tags that line up exactly with this example: metadata, data lineage, incremental computation, caching, and multimodal data. This is the incremental-compute piece, and the mental model is a multimodal materialized view: expensive outputs computed once, stored with their provenance, and refreshed only where an input changed. So if a detection, embedding, or upstream feature has not changed, downstream work reuses it instead of blindly recomputing everything. The field-level provenance behind this, recording which inputs each record depends on and recomputing only what went stale, is the subject of a talk I gave at the Vienna Data Science Group on optimizing multimodal AI pipelines, including how it plugs into orchestrators like Dagster, Ray, and Slurm.
store = LanceDBMetadataStore(path, hash_algorithm=mx.HashAlgorithm.XXH3_64)
with store.open("w"):
with_provenance = store.compute_provenance(DetectedFaces, faces)
store.write(DetectedFaces, with_provenance, materialization_id="local-demo")
Output:
Read back Metaxy-managed face metadata
| face_id | person | phase |
|---|---|---|
| ada | Ada | conference |
| alan | Alan | office |
| grace | Grace | conference |
| katherine | Katherine | archive |
Lance Graph Cypher
So far the data has only been filtered and searched. Now it becomes a graph. Lance Graph receives Arrow tables from LanceDB and a tiny graph configuration: Face nodes are keyed by face_id, and CO_OCCURS_WITH edges point from src_face_id to dst_face_id. This keeps the data in the Lance ecosystem while giving us Cypher-style exploration. The question stays the same throughout: which faces co-occurred during the conference phase? Lance Graph answers it first.
config = GraphConfig.builder() \
.with_node_label("Face", "face_id") \
.with_relationship("CO_OCCURS_WITH", "src_face_id", "dst_face_id") \
.build()
result = (
CypherQuery(
"MATCH (a:Face)-[r:CO_OCCURS_WITH]->(b:Face) "
"WHERE r.phase = 'conference' "
"RETURN a.person AS source, b.person AS target, r.phase AS phase"
)
.with_config(config)
.execute({
"Face": db.open_table("faces").to_arrow(),
"CO_OCCURS_WITH": db.open_table("co_occurs").to_arrow(),
})
)
Output:
| source | target | phase |
|---|---|---|
| Ada | Grace | conference |
DuckDB plus DuckPGQ
Now we change the engine underneath. Metaxy writes the same feature metadata into DuckDB instead of LanceDB, and DuckPGQ builds a temporary property graph over those tables. The question does not change; the engine answering it does.
CREATE PROPERTY GRAPH face_graph
VERTEX TABLES (Face)
EDGE TABLES (
CoOccurs SOURCE KEY (src_face_id) REFERENCES Face (face_id)
DESTINATION KEY (dst_face_id) REFERENCES Face (face_id)
LABEL CO_OCCURS_WITH
);
FROM GRAPH_TABLE (
face_graph
MATCH (a:Face)-[r:CO_OCCURS_WITH]->(b:Face)
WHERE r.phase = 'conference'
COLUMNS (a.person AS source, b.person AS target, r.phase AS phase)
);
Output:
| source | target | phase |
|---|---|---|
| Ada | Grace | conference |
Ada and Grace, at the conference: the same row Lance Graph returned, now from a different engine. The facts are portable; the engine is a choice.
Tiny traversal benchmark
If the engines all return the same answer, the next question is what each one costs. The final step adds LadybugDB and times all three engines across two kinds of questions on the same data: predicate-heavy lookups, and topology-only multi-hop path counts. Graph benchmarks are easy to abuse, so this one stays deliberately modest: warm up, then run several measured rounds, with a fresh setup before each. The setup is not part of the timed section: Metaxy writes, DuckDB reads, Ladybug table creation/import into its own storage, and graph construction all happen before the clock starts. The timed section is query execution plus result materialization; validation runs after the timer stops. This run uses 1 warmup and 3 measured rounds.
The five graph questions deliberately separate predicate pushdown from path traversal:
-- Filtered one-hop edge lookup.
MATCH (a:Face)-[r:CO_OCCURS_WITH]->(b:Face)
WHERE a.person = 'Ada' AND r.phase = 'conference'
RETURN a.person AS source, b.person AS target, r.phase AS phase
-- Filtered two-hop path through Grace.
MATCH (a:Face)-[r1:CO_OCCURS_WITH]->(m:Face)-[r2:CO_OCCURS_WITH]->(b:Face)
WHERE a.person = 'Ada' AND m.person = 'Grace'
RETURN a.person AS source, b.person AS target, r2.phase AS phase
-- Topological two-hop path count, no property filter.
MATCH (a:Face)-[r1:CO_OCCURS_WITH]->(m:Face)-[r2:CO_OCCURS_WITH]->(b:Face)
RETURN count(*) AS paths
-- Topological three-hop path count, no property filter.
MATCH (a:Face)-[r1:CO_OCCURS_WITH]->(m1:Face)-[r2:CO_OCCURS_WITH]->(m2:Face)-[r3:CO_OCCURS_WITH]->(b:Face)
RETURN count(*) AS paths
-- Topological four-hop path count, no property filter.
MATCH (a:Face)-[r1:CO_OCCURS_WITH]->(m1:Face)-[r2:CO_OCCURS_WITH]->(m2:Face)-[r3:CO_OCCURS_WITH]->(m3:Face)-[r4:CO_OCCURS_WITH]->(b:Face)
RETURN count(*) AS paths
Every round checks that each engine returns the expected answer, so the timings come only from queries that produced the right rows.
Unlike DuckPGQ, which queries the DuckDB tables directly, this LadybugDB demo first loads the same Metaxy output into a Ladybug database, closes it, and reopens that native store for the timed query:
exported_faces = duckdb.sql("SELECT face_id, person, phase FROM vision__detected_faces").pl()
exported_edges = duckdb.sql("SELECT src_face_id, dst_face_id, edge_id, image_id, phase FROM vision__face_cooccurrences").pl()
conn.execute("CREATE NODE TABLE Face(face_id STRING PRIMARY KEY, person STRING, phase STRING)")
conn.execute("CREATE REL TABLE CO_OCCURS_WITH(FROM Face TO Face, edge_id STRING, image_id STRING, phase STRING)")
conn.execute("COPY Face FROM exported_faces")
conn.execute("COPY CO_OCCURS_WITH FROM exported_edges")
conn.close()
database.close()
database = lb.Database("ladybug.db", read_only=True)
conn = lb.Connection(database)
Below is one chart per query, all using the same benchmark settings. Each dot shows query time relative to Lance Graph at that dataset size. Values below 1.0x are faster than Lance Graph; values above 1.0x are slower. The x-axis is dataset size (400, 4k, 40k, 400k, and 4M nodes/edges).
These are local toy numbers, not a universal database ranking. After keeping connection cleanup and Ladybug import out of the timed section, the split is clearer: Lance Graph is strongest once predicates enter the plan, while LadybugDB shines when the query can stay topology-shaped and count paths. At 4M nodes/edges, the filtered two-hop query lands at 468.1 ms for Lance Graph, 578.7 ms for DuckPGQ, and 1117 ms for LadybugDB. On the topology-only two-hop count, LadybugDB flips that result: 47.78 ms versus 218.8 ms for DuckPGQ and 649.8 ms for Lance Graph.
That shape is closer to what the engine designs suggest. Lance Graph today still gets a lot of performance from DataFusion’s relational planning and predicate handling, even though its path expansion is not a dedicated graph physical plan. LadybugDB already has graph-specific factorization, which matters most on pure traversal/count workloads. A planned native CSR adjacency index, plus more Lance-native physical operators, should help Lance Graph on those traversal-heavy shapes. So the result is not “Lance wins” or “Ladybug wins”; it is predicate pushdown versus factorized traversal. Benchmark on your own data and queries.
What this shows
Go back to the question this started with: keep multimodal features traceable, avoid recomputing expensive work, and still swap the graph engine when you want to explore differently. The tiny example answers all three with one feature model that moved, unchanged, across:
- LanceDB for local multimodal and vector data.
- Metaxy as a multimodal materialized view: feature metadata and materialization tracking, here using
xxh3_64. - Lance Graph for Cypher over Lance-backed tables.
- DuckDB plus DuckPGQ for SQL/PGQ over the same Metaxy-shaped metadata.
- LadybugDB as a separate Cypher target, with this demo importing the Metaxy-backed DuckDB tables into Ladybug storage before timing the query.
The graph stayed deliberately boring. That was the point: swap the engine, scale the data, ask the same question, and the answer comes back unchanged. Only the cost moves, and the benchmark shows it moves a lot, which is exactly why being free to pick the engine that fits each query shape is worth the plumbing. That portability is what we lean on at jubust, carrying the same shape from a toy face graph to verified data pulled out of patent PDFs.
Runnable local project files
The complete working example is kept with the post bundle:
The bundle pins DuckDB 1.4.4, since DuckPGQ is not yet available on DuckDB 1.5.x.
References:
- https://joss.theoj.org/papers/10.21105/joss.10449
- https://www.gartner.com/en/newsroom/press-releases/2024-09-09-gartner-predicts-40-percent-of-generative-ai-solutions-will-be-multimodal-by-2027
- https://arxiv.org/abs/2309.10020
- https://arxiv.org/abs/2405.10739
- https://arxiv.org/abs/2501.00309
- https://docs.lancedb.com/tables/index#from-polars-dataframes
- https://docs.metaxy.io/stable/guide/quickstart/quickstart/
- https://docs.metaxy.io/stable/integrations/metadata-stores/databases/lancedb/
- https://github.com/lance-format/lance-graph
- https://duckdb.org/docs/stable/guides/sql_features/graph_queries
- https://duckpgq.org/
- https://docs.ladybugdb.com/
- https://georgheiler.com/event/vdsg-optimizing-multimodal-ai-pipelines-with-metaxy/
- https://jubust.com/
