Ai |

The Bitter Lesson Stops at the Lab Door

Tue, 14 Apr 2026 00:00:00 +0000

Rich Sutton’s still holds. General methods that scale with computation beat hand-crafted knowledge for learning.

But the Bitter Lesson describes the direction of AI research. It does not explain what happens when you try to make AI useful inside an actual organization.

In deployment, deterministic code, typed interfaces, review steps, and domain constraints are not relics. They are the load-bearing structure around the model. The model handles what is fuzzy. Everything around it has to stay crisp enough that people can trust the result.

The framing I keep coming back to is simple: Connection, Context, and Control. I first heard the idea on the , and it has stuck with me because it points to where the real implementation work sits.

If you want a rough map of where AI value is getting built now, this is mine: not only inside the model, but in the systems that connect it, ground it, and constrain it. That is also why I am more interested in products that solve those layers than in yet another thin wrapper over the same model APIs.

The deployment gap

The numbers are imperfect, and most of them come from firms that also sell AI services. Even so, the directional signal is consistent.

Deloitte’s 2025 reporting says roughly 95% of generative AI pilots still fail to reach production.
Menlo Ventures reports that only 8.6% of organizations have AI agents in production.
Deloitte also reports that many CEOs still say they have seen little or no value from AI adoption.

That should not be surprising. Most organizations were never designed to consume probabilistic output.

If an LLM can summarize a document or extract a field, that is not yet a product. Somebody still has to decide where that output enters a system, how it is validated, what happens when it is wrong, who owns the exception path, and whether the surrounding business process can absorb it at all.

Not every process should be turned into a model call. If the task is stable, well-understood, and auditable, deterministic software is usually better. The pattern that works today is narrower: use models at the messy boundary, then hand off to structured systems as quickly as possible.

The 3Cs are a way to think about that handoff.

Connection

Models live behind APIs. They do nothing on their own. Someone has to connect them to the place where the problem actually lives: a workflow, a business system, an approval chain, a reviewer queue, a machine, or a human operator.

Sometimes the connection layer is literal. In AI for science, models only become useful once they are connected to instruments, experiments, and domain workflows:

We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team

One command turns any OpenClaw agent into a full AI Co-Scientist.

Demo: https://t.co/TgGtKO2lxQ
Dragon Shrimp Army reporting for duty 🦞🔬#AIforScience #OpenClaw pic.twitter.com/lIpWVbuLO2
— AI4Science Catalyst (@AI4S_Catalyst) March 11, 2026

A model that extracts invoice line items is not valuable because the demo looks good. It becomes valuable when it is wired into authentication, field mapping, exception handling, observability, and the ERP system that actually decides what gets paid.

The same is true in higher-stakes knowledge work. If a model helps parse complex source material, somebody still has to connect that output to the domain workflow that consumes it, the reviewer who checks edge cases, and the system of record that people trust.

I who spend much of their time on exactly this layer: connectors, orchestration, brittle upstream systems, and all the operational glue between a promising model and a useful system.

This is why so many AI pilots stall between demo and deployment. The missing piece is usually not intelligence. It is the connection layer: the integration work nobody sees in the benchmark, plus the organizational work required to get security, operations, and the budget owner to trust the system enough to use it.

Context

Context is not just “give the model more tokens.” The deeper problem is making an organization’s information legible enough that a model and the surrounding software can do something reliable with it.

Most AI demos assume clean data and clear semantics. Real organizations have PDFs, half-documented schemas, conflicting definitions, stale wikis, brittle jobs, and critical knowledge trapped in a few people’s heads. The gap is not only retrieval. It is structure: lineage, dependencies, freshness, provenance, and the ability to explore safely.

That makes Context one of the most interesting infrastructure layers right now. I think and are a particularly good pair of examples because they attack different parts of the same problem.

Bauplan is interesting because it treats exploration as a first-class concern. Its paper on supervaluationism for lakehouses points toward a world where agents can reason across branching or inconsistent data states instead of requiring one prematurely canonical view. That matters when you are still testing hypotheses and do not yet want the cost or rigidity of collapsing everything into one official version.

Metaxy picks up the problem once change becomes operational. When a dataset changes, when a prompt changes, or when a model changes, you want to know exactly which downstream artifacts are affected and which ones are not. Without that, experimentation is slow and expensive: you rerun too much, you trust stale outputs, or you avoid testing ideas because the feedback loop is too costly.

Metaxy’s lineage model attacks that directly. If you can trace the downstream impact of a change precisely, you can recompute only the affected slice instead of retraining or reprocessing everything. That is a cost win, but it is also a context win: it makes the system easier to reason about, easier to debug, and safer to evolve.

Put differently: Bauplan makes exploration across uncertain states cheaper; Metaxy makes propagation of concrete changes cheaper. Together they point to a stronger version of context than “just retrieve more text.”

Another way to say the same thing: making dependencies explicit is itself a form of context engineering. Most organizations still have jobs that silently depend on other jobs, tables that break when upstream schemas change, and freshness assumptions that live in somebody’s head.

That is why systems like matter here. When dependencies become first-class, operational knowledge stops living only in tribal memory and starts living in the graph. is a good example of that pattern: guardrails around IO, experimentation, and security emerged from the structure of the graph rather than from a checklist somebody had to remember.

So when I say Context, I do not just mean better retrieval. I mean making the surrounding data system explicit enough that both humans and machines can ask, “What does this depend on?”, “What changed?”, and “What needs to be rerun?” That is the layer where systems like Bauplan and Metaxy can become foundational. It is also where Context starts reinforcing Control: better provenance and cheaper exploration make the review loop cheaper to run.

Control

Even with strong connection and good context, a system still fails if nobody can tell when the model is wrong.

That is not a theoretical problem. In February 2026, reporting on AWS’s Kiro coding tool described a 13-hour disruption affecting AWS Cost Explorer in one China region after the tool was allowed to make changes and reportedly chose to “delete and recreate” part of its environment. Amazon disputed the characterization and said the underlying issue was misconfigured access controls rather than an AI judgment failure. Either way, the lesson is the same: once an agent can act on production systems, the surrounding controls matter more than the demo.

Control is about the surrounding mechanisms that make model behavior safe to use. Not micromanaging the model, but building validation, traceability, escalation paths, human review, and feedback loops that improve the system over time.

This is where is a strong example. What makes it interesting is not just that it uses AI on pharma patents. The interesting part is the operating model around the model: structured extraction, traceability back to source material, review of uncertain cases, and a path for reviewed corrections to improve future performance. That is what turns model output into something a serious domain can actually use.

That pattern is much more important than the buzzword layer around it. In many real systems, uncertainty is not handled by one magical confidence score. It is estimated through a mix of model signals, validators, disagreement checks, heuristics, and domain-specific rules. The point is not the exact mechanism. The point is that uncertain outputs do not silently flow into downstream decisions.

That is what makes human-in-the-loop systems economically and operationally viable. The review step is not just a tax. Done well, it is a bootstrapping mechanism. You catch the cases that matter, create an audit trail, and turn corrections into training data or product improvements. Over time, the review burden can shrink because the system gets better on the cases that used to fail.

This is also why vertical AI often looks stronger than generic “AI for everything” products. In a domain with clear stakes, clear operators, and clear exception paths, Control can be designed into the product. In a vague horizontal workflow, trust usually breaks first.

The EU AI Act pushes in the same direction for high-risk systems with its , but regulation is only part of the story. Even where oversight is not mandated, unreliable output destroys trust faster than any product team can rebuild it.

Why the 3Cs matter

The easiest mistake in AI discourse is to assume that once the base models get better, this surrounding work disappears. I do not think that is what happens.

As models improve, some existing control points will shrink and some kinds of integration will get easier. But every capability jump also opens a new application surface, and that new surface brings fresh connection work, fresh context problems, and fresh control requirements.

The 3Cs are useful because they tell you where the durable work sits after the demo, and where defensible product value can still be built even as base models improve.

Connection gets the model into the real system.
Context gives it something coherent to operate on.
Control makes the result trustworthy enough to use.

If I had to point to the kinds of products that fit this thesis, I would point to on the context side and on the control side. Neither treats the model as the whole product. Metaxy makes change legible; Jubust makes uncertainty operable. They live in the gap between raw model capability and operational usefulness, and that gap is where a large share of the next AI infrastructure wave will be built.

References

Rich Sutton, (2019)
Linear Digressions podcast,
Deloitte, (2025)
Menlo Ventures, (2025)
— software-defined assets with dependency-first orchestration
— supervaluationism for lakehouses
— lineage-aware data and compute workflows
— structured extraction with traceability and review loops
EU AI Act,

Stacked PRs and AI Worktrees

Tue, 17 Mar 2026 00:00:00 +0000

AI coding agents can shorten implementation time. They do not reduce the need for code review, branch management, or merge coordination. In practice, those activities often become the limiting factor once agents can produce changes quickly.

Without a clear workflow, parallel sessions tend to create large diffs, overlapping branches, and avoidable merge conflicts. The problem is usually not model capability. It is the absence of a structure that keeps work small, traceable, and reviewable.

Review remains the constraint

Code review still determines how quickly code reaches the main branch. When a pull request grows while earlier feedback is pending, the reviewer receives a large mixed diff instead of one coherent change. That slows review and increases the chance that defects or unintended side effects are missed.

AI-assisted development amplifies that pattern. A model can generate a substantial amount of code in a short period, which makes it easier to accumulate work faster than a team can review it. If teams keep using a single long-lived branch, output increases but review quality often declines.

That pressure is now visible at the platform level as well. On January 27, 2026, GitHub and said maintainers were spending substantial time reviewing submissions that often failed project guidelines, were abandoned, and were often AI-generated. Among the options GitHub said it was exploring were collaborator-only pull requests and disabling pull requests for specific repository use cases, which as GitHub pondering a PR “kill switch.”

The practical requirement is straightforward: keep units of change small enough that they can be reviewed independently.

Stacked PRs reduce review scope

address that requirement by organizing work as a sequence of dependent branches rather than one large feature branch. A lower branch might introduce a schema change, the next branch might add API logic, and a later branch might update the user interface. Each pull request can then be reviewed on its own terms.

This offers two practical advantages. Reviewers see a narrower diff, and authors can continue working while earlier pull requests are under review. Once the lower branch is merged, the next pull request can be retargeted or restacked onto the main branch.

The trade-off is operational overhead. Managing dependent branches manually requires regular rebasing, careful branch targeting, and cleanup after merges. For stacked PRs to work consistently, the workflow needs tooling support.

stax as a supporting tool

is a Rust CLI for working with stacked branches and GitHub pull requests. It supports creating and submitting stacks, syncing and restacking branches, merging from the bottom of a stack, and recovering from risky history edits with st undo and st redo.

For AI-assisted workflows, several functions are especially relevant:

st resolve can assist with rebase conflicts by sending only the currently conflicted files to a configured AI agent.
st generate --pr-body can draft pull request text from the branch diff.
st submit --squash allows commit-heavy local iteration while still publishing a clean, one-commit pull request per branch.
st standup --summary can generate a short status summary across recent work.
st absorb helps route staged fixes back into the correct branch when review feedback spans multiple layers of a stack.
st worktree / st wt manage parallel worktrees that remain part of the normal stack model.

These features do not replace review or Git fundamentals. They reduce the overhead of keeping many small branches in order.

Agent worktree lanes

A useful part of the workflow is the . Each lane is a separate Git worktree with its own tracked branch. That makes it possible to run multiple AI-assisted sessions in parallel without sharing a working directory or mixing unrelated edits.

# Create a lane for a test stability task
st wt c fix-tests --agent claude -- "fix the flaky integration tests in tests/api/"

# Create a second lane for a feature task
st wt c add-export --agent codex -- "add CSV export to the /reports endpoint"

# Inspect active lanes and their branches
st wt ll
st ls

In practice, each lane:

runs in its own worktree, which reduces interference between concurrent sessions,
uses a normal branch that appears in st ls and can be restacked or submitted like other stack entries,
can be re-entered later with st wt go,
can be combined with tmux for persistent long-running sessions.

A typical loop looks like this:

Create lanes for independent tasks.
Re-enter a lane later to inspect progress or continue work.
Sync or restack lanes when the trunk branch moves, for example with st wt rs.
Submit the resulting branches as pull requests.
Remove the lane when the work is merged.

roborev as a review companion

works at commit time: install a post-commit hook, commit frequently, and let it review each new commit in the background. Stacked PR workflows operate at a different level. In a stack, the unit that reviewers ultimately merge is the branch or pull request, not the individual commit.

That difference used to make the pairing feel awkward. roborev tends to encourage many small commits during the fix-and-review loop, while stacked PR workflows usually want a clean branch boundary. With the merged stax integration work, that boundary is now explicit instead of improvised.

In practice, the combined workflow is straightforward:

let roborev review incremental local commits while the branch is still in progress,
publish the stack with st submit --squash so each branch is presented upstream as one clean commit,
use st modify when you want review-driven edits folded back into the current branch before submission,
use st absorb when staged fixes belong to different branches in the stack.

The result is a better division of labor. roborev provides fast feedback during local iteration. stax remains responsible for branch structure, restacking, and presenting reviewable pull requests at the right granularity. The tools still optimize for different layers of the workflow, but they now integrate cleanly enough to use together on the same stack.

Why this matters for AI-assisted development

The main effect of AI coding tools is higher output volume. That makes branch structure and review discipline more important, not less. Teams still need a way to isolate changes, understand dependencies, and integrate work without turning every update into a large pull request.

Stacked PRs address review scope. Worktree lanes address parallelism and isolation. Used together, they provide a practical way to keep AI-assisted development compatible with normal engineering controls: smaller diffs, clearer sequencing, and more predictable merges.

This is less about any single tool than about workflow design. If AI-generated changes are going to be useful in a team setting, they need to arrive in a form that other engineers can review and merge without unnecessary friction.

References

— the stacked PRs workflow pattern
— Rust CLI for stacked branches with AI agent support
— stax documentation on parallel AI coding sessions
— adds st submit --squash and documents roborev integration patterns
— commit-based continuous review for coding agents
— another stacking tool (CLI + web UI)
— GitHub community discussion opened on January 27, 2026
— The Register coverage of GitHub’s discussion on low-quality, often AI-generated pull requests