The promise of converting user intent into code is compelling.
Instead of manually building rigid integrations for every possible user intent, you give the LLM a sandbox and a simple instruction: "If you can write the code to solve this, I will run it."
This shifts the complexity of the world from the platform (which no longer needs 1,000 distinct buttons) to the model (which must now act as a dynamic systems engineer).
References to this pattern are popping up everywhere: "Code Mode: the better way to use MCP" from Cloudflare, "Code execution with MCP: Building more efficient agents" from Anthropic, and "Code execution with MCP" from Simon Willison.
Take a simple user request: "Email me a summary of my calendar for the next week."
// The Dream: const events = await calendar.getEvents({ range: "next_week", }); const summary = await llm.summarize(events); await email.send({ to: "me", subject: "Weekly Brief", body: summary, });// The Dream: const events = await calendar.getEvents({ range: "next_week", }); const summary = await llm.summarize(events); await email.send({ to: "me", subject: "Weekly Brief", body: summary, });
It sounds simple, but underneath lies a minefield of unknowns. What service do you use for your calendar? What format are the events in? What email service do you use? When you try to bridge this gap between vague intent and concrete execution, the illusion often collapses. The model guesses an API that doesn't exist, assumes a data shape that's wrong, or fails to handle authentication nuances.
This is the gap between execution as a demo and execution as a reliable platform mechanism.
This post documents what we learned trying to bridge that gap, and the four layers of determinism we found necessary to make it work in production: Schema Discovery, Idempotent Execution, Runtime Self-Healing, and Type Coercion.
Non-Goal: We are not discussing how to build a sandbox for a hardcoded set of APIs (like a "GitHub Integration"). We are exploring how to build a universal engine for any schema-based integration, whether it's MCP, GraphQL, OpenAPI, or even REST (guided by documentation), without manual plumbing.
The Core Challenge
Our central thesis is simple: You cannot get reliable execution from an LLM by asking nicer.
Prompt engineering is not a type system. To trust AI-generated code in production, we have to surround the stochastic generator with deterministic guardrails. We must rectify, filter, and amplify the messy signal source into validity.
Assume a tool-calling model with bounded side effects. Our goal is to let a user say "Find my churning users and email them a discount," and have the system execute that request.
The Integration Trap
Why build this engine? Why not write distinct wrappers for every tool?
Because the real world is the Messy Middle.
- Manual Integrations: You install the SDK for Jira. Then Slack. Then Salesforce.
- Trade-off: Linear Maintenance Limit. This is the "N+1 Problem." Even if you don't write the code yourself, identifying, installing, testing, and keeping 500 different libraries up-to-date is operationally impossible.
- Risk: Attack Surface. Every new SDK is a new vector for prompt injection and malicious code. A sandbox with 10,000 libraries is 10,000 times harder to secure. Simpler is better.
- Strict Schemas: You rely on perfect OpenAPI/GraphQL specs.
- Trade-off: Limited Coverage. This ignores the ecosystem of untyped REST APIs and dynamic MCP tools.
Our system flips this interaction. We exchange Linear Human Effort (writing integrations) for Constant System Complexity (the engine described below). We accept the pain of building a complex runtime so that adding the 10,001st tool is instant.
The Minimal Example
Returning to our calendar request: "Send me a summary of my calendar for the next week."
In a naive eval() loop, the model might confidently hallucinate a bespoke Google API wrapper:
// Hallucinated code const events = await google.calendar.getEvents({ timeMin: "now", timeMax: "7d", });// Hallucinated code const events = await google.calendar.getEvents({ timeMin: "now", timeMax: "7d", });
This fails if the google object isn't in scope, if the method is actually listEvents, or if the API returns a paginated object { "items": [...] } when the code expects a flat array.
Evidence: In our initial tests, composed requests like "Search for weather in Seattle and email it" had a <20% success rate using two separate MCP servers. The model consistently hallucinated the shape of the weather API response (e.g. guessing weather.result.temperature instead of weather.temperature), causing the email step to crash.
The 4 Layers of Determinism
To solve this, we moved away from "run the code" to a system built on four interlocking layers of determinism. These are not a linear pipeline; they form a self-healing loop designed to impose order on the chaos.
1. Pre-computation: The Contract
If you give an LLM a vague description of an API, it will invent a plausible (but wrong) interface.
// Without a schema, the model guesses: await linear.createIssue({ title: "Fix bug", priority: "high", });// Without a schema, the model guesses: await linear.createIssue({ title: "Fix bug", priority: "high", });
This crashes at runtime because the actual API requires priority to be a number (1-5), not a string.
Our Solution: We generate rigorous TypeScript definitions (d.ts) for input schemas.
We search its environment (or the web) for available integrations and inspect their capabilities. Whether it's a local MCP server, a public OpenAPI spec, or a GraphQL endpoint, we ingest the schema and turn it into a strict TypeScript interface.
Output schemas are often optional, variable, or entirely unknown. We explicitly avoid a "discovery" step (calling the API to see what it returns). Executing speculative code has side effects. We don't want to createInvoice() to see what the invoice ID looks like. Instead, we initially compile these return types as any.
This seems counter-intuitive. Aren't we trying to prevent runtime errors? Yes, but we prefer a controlled crash later (which we can learn from) over a blind guess now. We let the model write code against this any type, knowing that if it assumes a structure that isn't there, we'll catch it in the execution loop.
2. Execution: The "Time Loop"
Coding is an iterative loop. You write, run, fail, edit, run again.
But when your code has side effects (like "Create Invoice"), you cannot re-run the whole script every time you fix a syntax error on line 50. You'd end up with 50 duplicate invoices.
Our Solution: Deterministic caching of service calls.
We wrap every external interaction in a smart proxy. We cache based on both the inputs and the code history leading up to the call.
This enables a powerful iteration strategy. We bias the model to edit code after the last successful line. Since the preceding code remains constant, our cache guarantees that we replay the previous side effects (like "Create Invoice") without repeating them.
This effectively gives us "Groundhog Day" debugging. We can replay the exact same morning 100 times to fix a bug in the data processing logic without ever acting on the external world again.
3. Self-Healing
The documentation is always wrong. Or missing entirely.
We often start with any for output types, or we might have a schema that claims to return CertifiedUser[] but actually returns null.
In a traditional script, this throws TypeError: Cannot read property 'id' of null (or undefined) and dies.
// definition.d.ts interface User { id: string; } declare function getUsers(): User[]; // The model writes: const users = await tool.getUsers(); users.map((u) => u.id);// definition.d.ts interface User { id: string; } declare function getUsers(): User[]; // The model writes: const users = await tool.getUsers(); users.map((u) => u.id);
If the API violates the contract and returns null instead of [], the runtime throws Uncaught TypeError: Cannot read properties of null.
Our Solution: We use runtime failure as a learning signal.
When the sandbox crashes with a type error, or when we detect a mismatch between the assumed type and the actual value, we inspect the actual value that caused the crash.
- Capture: "Expected
User, gotnull" (or "Inferred output type isX"). - Patch: Update the generated
d.tsto reflect reality (replacinganyor the incorrect type). - Regenerate: Ask the model to fix the code given the new, truthful type definition.
- Retry: Run the code again.
This loop turns "drift" into "calibration." We persist these learned definitions.
But even a "learned" schema isn't perfect. It represents an inference based on the last successful call. We might have an incomplete schema, or dynamic APIs might return slightly different shapes next time. This residual uncertainty is why we need one final safety net.
4. The Glue: Inline Coercion
Sometimes, the type mismatch is trivial but blocking. The API returns "100" (string), but your math library demands 100 (number).
But sometimes, the gap is too wide for a schema fix. A weather API might return "The weather in SF is sunny" when your code expects { "precipitation": 0 }.
No amount of schema patching will make that sentence a number. This is fundamental non-determinism.
// definition.d.ts interface Weather { precipitation: number; } // The model expects a number: if (weather.precipitation > 0.5) { ... }// definition.d.ts interface Weather { precipitation: number; } // The model expects a number: if (weather.precipitation > 0.5) { ... }
If the API returns a string like "It's barely drizzling", this results in a NaN comparison or a schema validation failure.
Our Solution: A "Type Coercion" layer. We use a small, fast model (or even heuristic logic) to intercept these near-misses at runtime. If a function demands structure A and gets structure B, and they are semantically identical, we coercively map them on the fly.
Security: Users Bring Chaos
We have built layers to handle the unpredictability of external APIs, but we must also handle the unpredictability of the user. If you give an LLM a shell, someone will try to read /etc/passwd.
We use a Sibling Sandbox architecture, wrapped in static analysis.
- Static Analysis: We lint out common attack patterns after generation. If the code resembles a known exploit or tries to access forbidden globals, the system rejects it before execution.
- The Worker: A locked-down Deno isolate. It has zero IO capabilities. No
fetch, no filesystem, no environment access. It can run pure computation and call the specific functions we inject. - The Supervisor: A privileged process that holds the API keys and manages the network.
The Worker isn't even aware of the network. It sees a dumb interface:
const github = await getMcpClient("github"); const issues = await github.toolCall("listIssues", args);const github = await getMcpClient("github"); const issues = await github.toolCall("listIssues", args);
When called, this function proxies a message to the Supervisor, which validates the request, injects secrets (never exposing them to the worker), performs the physical HTTP call, and returns a sanitized result.
Note: We check even the response data. We scan for patterns that look like keys or secrets before passing data back to the Worker. This isn't a guarantee (security is an arms race), but it serves as a final layer of defense to limit the risk of credentials leaking into the untrusted sandbox.
Conclusion
We are moving from a world of "Stateless Chat" to "Stateful Work."
To make that transition, we have to stop treating Code Generation as a creative writing task and start treating it as a fault-tolerant systems problem. By stacking layers of determinism (static types, caching, runtime healing, and coercion), we can build a system that survives the chaos of the real world.
We don't know yet how this scales to 10,000-line programs (Unknown). But for the glue code that powers most integrations, it's a solid foundation.
We’ve been building this engine at Subroutine. If you’re solving similar problems in production, we’d love to compare notes.