Part III · Chapter 9 of 18

Tool Engines

The convention every Microapp follows so it works on the web AND for agents from the same source. One file, one function, two surfaces.

11 min read

The first time I noticed the math was drifting was on a percentage calculator.

The widget on the page returned "15% of 200 is 30." The API endpoint I'd added for agents returned 30.00. The Custom GPT spec said "answer: 30". All correct, all subtly different — and three implementations of the same formula, each one drifting by a small detail that mattered to a different caller. Nobody had set out to write the same calculation three times. It happened because each surface had its own developer urgency and a copy of the math was the obvious shortcut at each step.

Two implementations of "what's 15% of 200" drift in their tenth pull request. Three drift faster. By the hundredth Microapp, every surface is computing a slightly different answer, and the question "which one is correct?" has no defensible answer because the source of truth is wherever the last bug got fixed.

The math lives in exactly one place.

We call that one place a tool engine. Every Microapp has exactly one engine. The widget that draws the tool on the page calls the engine. The REST endpoint that exposes it to agents calls the engine. The MCP catalogue calls the engine. The OpenAPI spec describes the engine. Same code, same answer, every surface, every time. Anything else is clothes the engine wears in different rooms.

The transferable why: when the same logic gets implemented in two places, the two copies will drift — not because the team is sloppy, but because the change pressure on each copy is independent. Make the logic live in one place and let every consumer call it. The cost is one extra level of indirection per call; the benefit is that the system has a single source of truth for what its outputs mean.

Locked 2026-05-10 · src/lib/engines/<slug>/index.ts · one engine per Microapp

The second decision was the one that made the engine usable by agents on its first call.

The temptation in early engines was to treat them like web handlers: read the request, talk to a database, hit a third-party API for live data, write a log line, return a response. Convenient. Also non-deterministic. The agent calling the engine has no way to know whether the answer it got is the answer it would get if it called the engine again with the same inputs — and if not, replay is impossible, caching is impossible, testing is brittle.

The rule that survives the failure modes is the strictest one: the engine is a pure function. Inputs go in, outputs come out, nothing else happens. No network calls. No file reads. No clock-checking. No randomness without a seed parameter the caller provides. If the engine needs external data — today's currency rate, a time-zone offset — the caller passes it in. The engine never fetches.

Same inputs in, same outputs out. Every time. Forever.

The benefit lands in the agent layer. An agent that calls the engine with bad inputs can replay with corrected inputs and trust the only thing that changed is the inputs. The CDN at the edge can cache aggressively on the input hash and never serve a stale answer, because the answer can't be stale — it's a property of the inputs. The unit tests are the simplest tests in the codebase because there's nothing to mock.

The transferable why: determinism is a property worth paying for. Most systems trade it for convenience early — a clock read here, a fetch there, a random seed nobody documented — and pay for the convenience forever in flaky tests, surprise behavior, and "why did this work yesterday" tickets. Pay the pure-function cost up front; the system gets simpler in every direction afterward.

Locked 2026-05-10 · pure-function engines · no side effects, ever

The third decision was the one that turned out to matter most for agents and almost not at all for humans.

An engine could return "15% of 200 is 30" and ship to the website unchanged — humans read prose just fine. The problem is on the agent side. An agent that gets "15% of 200 is 30" as a response has to regex out the number to use it in the next step. Regexing prose is brittle, locale-dependent, and the first place an agent pipeline fails when the model rewrites the answer slightly. Prose is structured information the wrong way around.

The engine returns an object with named fields instead. The result is a number in its own field. The formula is a string in another. The human-readable form is a third. Numbers carry their unit in a separate field. Currency carries its ISO code. Timestamps are ISO 8601 in UTC. Every piece is named, every piece is pickable.

Structured output. Never prose.

The widget reads the structured object and renders prose for the human. The agent reads the same object and uses the result field directly. Both win: the human gets a clean sentence, the agent gets a clean field, and neither side has to parse the other's representation. The agent's first call works because nothing in the protocol asks it to guess.

The transferable why: return data, not strings. Prose is structured information rendered for humans; once it's rendered, you've thrown away the structure. Other consumers of the same output — agents, dashboards, exports, replays — have to reverse-engineer the structure you discarded. Hand them the structure to begin with and let each surface render its own prose.

Locked 2026-05-10 · named-field outputs · prose lives in the widget, not the engine

The fourth decision was about how the engine talks back when something goes wrong.

Early engines threw "Validation failed" or "Invalid input" when an argument was off. Technically correct, operationally useless. An agent that gets that error has no idea what to do next. It re-tries the same inputs, gets the same error, escalates to the human, and the human spends three minutes reading the engine source to figure out what the agent did wrong. The error message was the wrong abstraction — it told the developer that an error happened; it didn't tell the agent how to fix it.

The fix was to treat every error as the next call's prompt. "Amount must be positive; got -5. Did you mean amount=5?" tells the agent exactly which field is wrong, what the constraint is, and what a plausible correction looks like. "Value must be a number; got '12kg' — did you mean value=12, from_unit='kg'?" teaches the agent the shape of the call without a human in the loop.

The error is the next call's prompt. Write it that way.

The cost of writing error messages this way is one extra sentence per failure path — maybe twenty lines of code per engine. The benefit is that agents resolve their own mistakes most of the time, without escalating, and humans only see the errors that genuinely indicate a bug rather than a misuse.

The transferable why: error messages are the place a system teaches its consumers what it expects. Most errors are written for the developer who wrote the code; the audience is actually the caller who triggered the error. Treat the error message as the documentation that fires only when it's needed — and the documentation gets read every time.

Locked 2026-05-10 · errors-that-teach · one sentence, named field, suggested fix

The last decision was about how to migrate the older Microapps that pre-dated the engine pattern.

Most of the tools in the catalogue were built before the engine rules were written down. Their math lives inside the widget that draws them on the page — fine for the website, useless for an agent. The first impulse was to migrate them all in a sprint. The math was already written; the lift looked mechanical. A week of focused work, the whole catalogue agent-ready.

I didn't do it. The reason was the same reason I don't do most one-shot sweeps anymore: a one-shot migration done in a hurry creates a class of subtle regressions that surface over the following months, each one annoying enough to investigate and not annoying enough to revert. A migration done one tool at a time, whenever someone touches the tool for any other reason, ships at the speed of the rest of the work, with the regressions caught by whoever was already in the file.

Migrate one at a time. Whenever someone touches the tool for any other reason.

The recipe is short. Find the pure core inside the widget — the function that takes inputs and returns outputs without touching state. Lift it into an engine file with a kebab-case name that matches the slug. Have the widget call the engine instead of inlining the math. Verify nothing changed for the user. Move on.

The catalogue migrates at the speed of the catalogue's own work. A year from now most engines will be migrated; the ones that aren't will be the ones nobody touched, which is exactly the right priority order.

The transferable why: one-shot rewrites are seductive because the work looks uniform. They tend to fail because the regressions they introduce are also uniform — and uniform regressions are the kind that ship to production together. Migrations that piggyback on existing work ship at the speed of the existing work and inherit the existing work's test coverage. Pay the slower migration; collect the cheaper regression bill.

Locked 2026-05-10 · migrate-on-touch · the catalogue moves with its own gravity

That's the engine pattern. One math, one place, no clothes-as-truth. Every Microapp the company ships from here on opens with the engine file; the widget and the API are downstream of it.