The Tidy-First Economics of the Diff
Writing code is cheap. Proving a massive AI rewrite did not break production remains wildly expensive.
The cost of drafting code has collapsed. An agent writes in seconds what used to take a sprint. But the cost of verifying that code has not moved at all. It is still a senior engineer reading a diff, line by line, deciding whether the system’s promises survived.
Kent Beck’s Tidy First? was written as a guide for human refactoring. Read in 2026, it is something stricter: a prerequisite for letting machines change production code without losing control of what the code actually does.
The Fluency Trap
There is a natural signal of uncertainty in human code. Junior developers write syntax that looks hesitant. You can read a pull request and spot the boundaries of their understanding. The rough edges tell you where to focus.
Language models do not hesitate. They produce polished, confident syntax even when hallucinating logic. A function that silently drops an authentication check reads exactly like a function that preserves it. The visual signals of uncertainty are gone.
This is the fluency trap. Teams treat AI code review like human code review, approving massive diffs because they look professional. The actual technical debt being accumulated is trust without verification.
The Tangled Commit
A tangled commit mixes structural changes with behavioral changes. For a human author, this is annoying to review. For an agentic pipeline, it is the failure mode that matters most.
Consider a concrete scenario. An agent is asked to add request logging to a service. To “clean up” the file, it also extracts three helper functions, renames two variables, and flattens a nested conditional. The PR is 400 lines. The logging feature is 12 of them.
Buried in the extraction, the agent drops an early-return guard clause that validated an authentication token. The remaining 388 lines of syntactically perfect structural changes provide cover. The tests pass because no test targeted that specific guard. The reviewer, scanning for the logging feature, approves.
The privilege escalation ships silently. It surfaces weeks later, traced back to a “refactoring” nobody asked for.
When behavior and structure change in the same diff, verifying that no existing invariants were broken becomes intractable. The blast radius of the change exceeds what any reviewer can hold in working memory. Reliability decays with the size of an unverified rewrite, and the decay is steep.
The Cost Asymmetry
Beck roots Tidy First? in two economic principles: the time value of money and the option value of code. Software creates value in two ways. What it does today, and what it could do tomorrow. A well-structured module preserves the option to change cheaply later. A tangled module forecloses that option by coupling every future change to a full rewrite.
When agents write the code, the generation half of this equation collapses. The cost of producing a first draft approaches zero. But the verification half stays fixed. A senior engineer still has to read the diff, trace the invariants, and decide whether the system’s contracts survived. That labor does not scale with the speed of generation.
The result is a permanent asymmetry. There are tasks where AI delivers clear returns: converting a messy JSON payload into a strictly typed DTO against a deterministic schema, or writing a pure function that an existing test harness can validate instantly. Generation is expensive to do by hand, and verification is cheap. The economics work.
Then there are tasks where the savings are illusory. A cross-repository refactoring, a complex business logic change, a structural migration. The agent generates the code in seconds, but the reviewer must re-derive the agent’s reasoning path across the entire dependency graph. The cognitive labor does not disappear. It relocates downstream, lands on the reviewer, and gets more expensive because the diff is bigger and the intent is less legible than a human’s.
Organizations that measure productivity by commit throughput are measuring the wrong side of the equation. The constraint is verification. The metric that matters is how quickly a reviewer can confirm that a diff did what it claims and nothing else. Every tangled commit makes that confirmation slower. Every isolated tidy makes it faster.
Guard Clauses and Dead Code: Why Tidying Is Physical
Beck’s tidying maneuvers read differently when the coder is a transformer.
Guard clauses. Beck recommends returning early to reduce cognitive load. For an agent, a guard clause physically flattens the abstract syntax tree. Flat ASTs mean the model spends less attention capacity tracking nested conditional state. Each level of nesting is a branch the transformer must hold in its weights during generation. Flatten the branches and you reduce the probability of a logic inversion. The tidying changes the computation.
Dead code removal. Beck says delete dead code because it confuses humans. The agentic version is harsher: dead code poisons the prompt. An agent reading a deprecated branch will sample tokens from it and generate hallucinations based on obsolete interfaces. Every unreachable function is a tax on the context window. You pay for that mess in tokens, latency, and degraded reasoning. Dead code removal is garbage collection for the prompt.
These are not metaphors. Guard clauses reduce the model’s branch prediction state-space. Dead code removal shrinks the noise floor of the context window. Beck’s tidying maneuvers have mechanical consequences for transformer-based generation that he could not have anticipated.
Ousterhout’s End-State and Beck’s Transition Function
I wrote previously about John Ousterhout’s “Deep Modules.” Deep modules define the spatial shape a codebase needs for constrained LLM attention to work. They give the agent a clean boundary to operate within.
But how do you reach that state safely when the current codebase is a shallow, coupled mess?
Beck’s answer is the transition function. You tidy first, verify that behavior did not change, then alter behavior in a separate step. The tidying moves the structure toward Ousterhout’s ideal. The separation ensures you can prove each step independently.
Skip the separation and watch what happens. An agent is asked to add a caching layer to a service with a shallow, highly coupled module structure. To make the cache fit, the agent refactors the data access layer: extracting interfaces, renaming internal methods, consolidating three query functions into one. The cache works. But the consolidated query function silently changed the sort order of results because the agent optimized for the cache’s access pattern rather than preserving the original contract. The tests pass because no test asserted sort order. The regression surfaces in a downstream report that nobody connects to the caching PR for weeks.
The agent did two things in one step. You can verify neither independently. Beck’s rule prevents this by making the structural tidy a separate, provable operation. The feature prompt inherits a clean structure and produces a small, focused diff.
You cannot ask an agent to refactor a shallow module into a deep module while simultaneously adding a new feature. Current models routinely fail on multi-file structural refactoring, frequently dropping branches or inverting logic. The transition must be isolated. Tidy, then verify. Only then, change behavior.
The Two-Prompt Pipeline
Agents are fast and compute is cheap. Why bother tidying? Just let the agent rewrite the whole file.
Because a 400-line whole-file rewrite by a non-deterministic model is a reliability problem you cannot review your way out of. Writing is cheap. Proving the rewrite did not introduce a silent failure remains wildly expensive.
The defense is a strict two-prompt pipeline that isolates structural refactoring from feature generation.
Phase one: the pure refactor. The agent is prompted strictly to tidy the structure without altering behavior. This phase is constrained by a tripwire test: a locked unit test written before the agent touches the file, targeting the exact data contract of the module. The agent is forbidden from editing the tripwire. If the structural tidy accidentally alters runtime behavior, the tripwire fails and the loop terminates.
A tripwire test for a user service might look like this:
// LOCKED — do not modify. Tripwire for structural tidy.
test('user service contract', () => {
const result = createUser({ name: 'test', role: 'viewer' });
expect(result).toStrictEqual({
id: expect.any(String),
name: 'test',
role: 'viewer',
permissions: ['read'],
});
});
If the agent’s refactoring changes the return shape, the permissions array, or the role mapping, this test catches it before the PR exists.
Phase two: the behavioral delta. Only after the structural changes have been committed and verified does the agent receive the second prompt: add the feature. Because the codebase was tidied in phase one, the feature diff is small and isolated. The reviewer reads a focused block of business logic instead of parsing 400 lines of mixed intent.
By separating the prompts, the pipeline creates a verification boundary. Each diff answers one question. Did the structure change safely? Did the behavior change correctly? The reviewer never has to answer both at once.
The Diff as the Unit of Trust
The future of software economics belongs to the teams that enforce the strictest structural boundaries around their agents.
I do not review code anymore. I review diffs. The diff is the unit of trust in a codebase maintained by machines. When that diff is clean and isolated, I can verify it. When it tangles structure and behavior into a single commit, I cannot. No one can. The reviewer’s correction ability drops toward zero, and whatever the agent introduced ships unchecked.
The engineering role has shifted. I write the interface. I write the tripwire test. I set the boundary that triggers an alarm when the agent crosses it. The agent implements. I own what ships.
Writing is cheap. The proof is expensive. The only way to afford the proof is to keep the diff small enough to read.
Tidy first. The machine cannot guess what you meant.


