The Bitter Lesson Stops at the Lab Door

Rich Sutton’s Bitter Lesson still holds. General methods that scale with computation beat hand-crafted knowledge for learning.

But the Bitter Lesson describes the direction of AI research. It does not explain what happens when you try to make AI useful inside an actual organization.

In deployment, deterministic code, typed interfaces, review steps, and domain constraints are not relics. They are the load-bearing structure around the model. The model handles what is fuzzy. Everything around it has to stay crisp enough that people can trust the result.

The framing I keep coming back to is simple: Connection, Context, and Control. I first heard the idea on the Linear Digressions podcast, and it has stuck with me because it points to where the real implementation work sits.

If you want a rough map of where AI value is getting built now, this is mine: not only inside the model, but in the systems that connect it, ground it, and constrain it. That is also why I am more interested in products that solve those layers than in yet another thin wrapper over the same model APIs.

The 3C framework: Connection, Context, and Control bridge the gap between model capability and organizational deployment.

The deployment gap

The numbers are imperfect, and most of them come from firms that also sell AI services. Even so, the directional signal is consistent.

Deloitte’s 2025 reporting says roughly 95% of generative AI pilots still fail to reach production.
Menlo Ventures reports that only 8.6% of organizations have AI agents in production.
Deloitte also reports that many CEOs still say they have seen little or no value from AI adoption.

That should not be surprising. Most organizations were never designed to consume probabilistic output.

If an LLM can summarize a document or extract a field, that is not yet a product. Somebody still has to decide where that output enters a system, how it is validated, what happens when it is wrong, who owns the exception path, and whether the surrounding business process can absorb it at all.

Not every process should be turned into a model call. If the task is stable, well-understood, and auditable, deterministic software is usually better. The pattern that works today is narrower: use models at the messy boundary, then hand off to structured systems as quickly as possible.

The deployment pattern: AI parses unstructured input, a verifier enforces contracts at the boundary, and the deterministic pipeline handles the rest.

The 3Cs are a way to think about that handoff.

Connection

Models live behind APIs. They do nothing on their own. Someone has to connect them to the place where the problem actually lives: a workflow, a business system, an approval chain, a reviewer queue, a machine, or a human operator.

Sometimes the connection layer is literal. In AI for science, models only become useful once they are connected to instruments, experiments, and domain workflows:

We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team

One command turns any OpenClaw agent into a full AI Co-Scientist.

Demo: https://t.co/TgGtKO2lxQ
Dragon Shrimp Army reporting for duty 🦞🔬#AIforScience #OpenClaw pic.twitter.com/lIpWVbuLO2
— AI4Science Catalyst (@AI4S_Catalyst) March 11, 2026

A model that extracts invoice line items is not valuable because the demo looks good. It becomes valuable when it is wired into authentication, field mapping, exception handling, observability, and the ERP system that actually decides what gets paid.

The same is true in higher-stakes knowledge work. If a model helps parse complex source material, somebody still has to connect that output to the domain workflow that consumes it, the reviewer who checks edge cases, and the system of record that people trust.

I work with data engineers who spend much of their time on exactly this layer: connectors, orchestration, brittle upstream systems, and all the operational glue between a promising model and a useful system.

This is why so many AI pilots stall between demo and deployment. The missing piece is usually not intelligence. It is the connection layer: the integration work nobody sees in the benchmark, plus the organizational work required to get security, operations, and the budget owner to trust the system enough to use it.

Context

Context is not just “give the model more tokens.” The deeper problem is making an organization’s information legible enough that a model and the surrounding software can do something reliable with it.

Most AI demos assume clean data and clear semantics. Real organizations have PDFs, half-documented schemas, conflicting definitions, stale wikis, brittle jobs, and critical knowledge trapped in a few people’s heads. The gap is not only retrieval. It is structure: lineage, dependencies, freshness, provenance, and the ability to explore safely.

That makes Context one of the most interesting infrastructure layers right now. I think Bauplan and Metaxy are a particularly good pair of examples because they attack different parts of the same problem.

Bauplan is interesting because it treats exploration as a first-class concern. Its paper on supervaluationism for lakehouses points toward a world where agents can reason across branching or inconsistent data states instead of requiring one prematurely canonical view. That matters when you are still testing hypotheses and do not yet want the cost or rigidity of collapsing everything into one official version.

Metaxy picks up the problem once change becomes operational. When a dataset changes, when a prompt changes, or when a model changes, you want to know exactly which downstream artifacts are affected and which ones are not. Without that, experimentation is slow and expensive: you rerun too much, you trust stale outputs, or you avoid testing ideas because the feedback loop is too costly.

Metaxy’s lineage model attacks that directly. If you can trace the downstream impact of a change precisely, you can recompute only the affected slice instead of retraining or reprocessing everything. That is a cost win, but it is also a context win: it makes the system easier to reason about, easier to debug, and safer to evolve.

Put differently: Bauplan makes exploration across uncertain states cheaper; Metaxy makes propagation of concrete changes cheaper. Together they point to a stronger version of context than “just retrieve more text.”

Another way to say the same thing: making dependencies explicit is itself a form of context engineering. Most organizations still have jobs that silently depend on other jobs, tables that break when upstream schemas change, and freshness assumptions that live in somebody’s head.

That is why systems like Dagster matter here. When dependencies become first-class, operational knowledge stops living only in tribal memory and starts living in the graph. Magenta Telekom’s data platform is a good example of that pattern: guardrails around IO, experimentation, and security emerged from the structure of the graph rather than from a checklist somebody had to remember.

So when I say Context, I do not just mean better retrieval. I mean making the surrounding data system explicit enough that both humans and machines can ask, “What does this depend on?”, “What changed?”, and “What needs to be rerun?” That is the layer where systems like Bauplan and Metaxy can become foundational. It is also where Context starts reinforcing Control: better provenance and cheaper exploration make the review loop cheaper to run.

Control

Even with strong connection and good context, a system still fails if nobody can tell when the model is wrong.

That is not a theoretical problem. In February 2026, reporting on AWS’s Kiro coding tool described a 13-hour disruption affecting AWS Cost Explorer in one China region after the tool was allowed to make changes and reportedly chose to “delete and recreate” part of its environment. Amazon disputed the characterization and said the underlying issue was misconfigured access controls rather than an AI judgment failure. Either way, the lesson is the same: once an agent can act on production systems, the surrounding controls matter more than the demo.

Control is about the surrounding mechanisms that make model behavior safe to use. Not micromanaging the model, but building validation, traceability, escalation paths, human review, and feedback loops that improve the system over time.

This is where Jubust is a strong example. What makes it interesting is not just that it uses AI on pharma patents. The interesting part is the operating model around the model: structured extraction, traceability back to source material, review of uncertain cases, and a path for reviewed corrections to improve future performance. That is what turns model output into something a serious domain can actually use.

That pattern is much more important than the buzzword layer around it. In many real systems, uncertainty is not handled by one magical confidence score. It is estimated through a mix of model signals, validators, disagreement checks, heuristics, and domain-specific rules. The point is not the exact mechanism. The point is that uncertain outputs do not silently flow into downstream decisions.

That is what makes human-in-the-loop systems economically and operationally viable. The review step is not just a tax. Done well, it is a bootstrapping mechanism. You catch the cases that matter, create an audit trail, and turn corrections into training data or product improvements. Over time, the review burden can shrink because the system gets better on the cases that used to fail.

This is also why vertical AI often looks stronger than generic “AI for everything” products. In a domain with clear stakes, clear operators, and clear exception paths, Control can be designed into the product. In a vague horizontal workflow, trust usually breaks first.

The EU AI Act pushes in the same direction for high-risk systems with its human oversight requirements, but regulation is only part of the story. Even where oversight is not mandated, unreliable output destroys trust faster than any product team can rebuild it.

Why the 3Cs matter

The easiest mistake in AI discourse is to assume that once the base models get better, this surrounding work disappears. I do not think that is what happens.

As models improve, some existing control points will shrink and some kinds of integration will get easier. But every capability jump also opens a new application surface, and that new surface brings fresh connection work, fresh context problems, and fresh control requirements.

The 3Cs are useful because they tell you where the durable work sits after the demo, and where defensible product value can still be built even as base models improve.

Connection gets the model into the real system.
Context gives it something coherent to operate on.
Control makes the result trustworthy enough to use.

If I had to point to the kinds of products that fit this thesis, I would point to Metaxy on the context side and Jubust on the control side. Neither treats the model as the whole product. Metaxy makes change legible; Jubust makes uncertainty operable. They live in the gap between raw model capability and operational usefulness, and that gap is where a large share of the next AI infrastructure wave will be built.

References

Rich Sutton, The Bitter Lesson (2019)
Linear Digressions podcast, Episode on the 3C of AI
Deloitte, State of AI in the Enterprise (2025)
Menlo Ventures, 2025: The State of Generative AI in the Enterprise (2025)
Dagster — software-defined assets with dependency-first orchestration
Bauplan — supervaluationism for lakehouses
Metaxy — lineage-aware data and compute workflows
Jubust — structured extraction with traceability and review loops
EU AI Act, Article 14: Human Oversight

Engineering Ai Deployment

Authors

Dr. Georg Heiler

senior data expert

Georg is a co-founder @Jubust and a Senior data expert at Magenta as well as a ML-ops engineer at ASCII. He is solving challenges with data. His interests include geospatial graphs and time series. Georg transitions the data platform of Magenta to the cloud and is handling large scale multi-modal ML-ops challenges at ASCII.

← Incremental Multimodal Graphs Jun 29, 2026

Stacked PRs and AI Worktrees Mar 17, 2026 →

No results found