The Autonomy of AI Software Engineering Agents

2026-04-19 · 5 min

Software Engineering is among the hardest domains to verify in, and the limitations that arise are only a symptom of that, while context windows, proneness to hallucination, and tool limits get improved iteratively with each release. An agent can be deployed without a human in the loop only on tasks for which a cheap, sound verifier exists. Cheap and sound are the binding constraints. Where agents already work autonomously, there exists a verifier such that a proof either checks, a test either passes, and a program either crashes or runs. AlphaProof can iterate for hours. Fuzzing campaigns don't need supervision. Competitive programming/Leetcode-like problems are essentially solved. In these domains, capability has been there for a while, and what makes them work agentically is that "correct" is a decidable function that can run in a loop. Real Software Engineering has no such function, and it is not an engineering inconvenience, but a stack of theorems.

Rice's theorem (1953) is the founding result: for any non-trivial semantic property of programs ("returns a sorted list", "never leaks memory", "is free of deadlocks") the problem "does program M satisfy property P?" is undecidable. Rice generalizes halting, a non-trivial semantic property, and every interesting property inherits the same fate. There is no algorithm that takes arbitrary (M, P) and decides correctness in finite time, irrespective of speed or compute capacity. It serves as a structural ceiling on what any verifier can do, LLMs included. The escape routes are:

1. Restrict the program class (finite-state systems, typed fragments).
2. Restrict the property class (type-checkable properties, safety-only).
3. Accept incompleteness (verifier can return "unknown", or even lie).

None of these routes lead anywhere pleasant. Restriction trades undecidability for intractability: the decidable fragments, SAT, model checking, bounded verification, sit in NP-complete, PSPACE-complete, or worse, and state-space explosion (2^n for n booleans past 10^30 at n=100) eats any advantage SMT solvers and abstractions buy back. Accepting incompleteness means accepting a verifier that sometimes shrugs or lies, which just moves the human back into the loop to settle the shrugs.

Formal verification works when one pays for it. seL4 is the canonical number: roughly 200k lines of Isabelle/HOL proof for 10k lines of C, about 20 person-years for a microkernel with a narrow spec. That is the documented cost of pushing through the undecidability barrier for one artifact that cannot be done at the velocity of shipping features in weekly deadlines.

Production code sacrifices proof and relies on tests. But testing is sampling, and sampling has its own ceiling. A passing test suite bounded the probability of failure only on the distribution it sampled from. Bugs live in the tail, on distributions noone wrote a test for. This is Goodhart's law in statistical form, with the test being a proxy, an agent optimized to pass tests optimizes the proxy, not the target. The gap cannot be patched by writing more tests because the space of semantically distinct programs has infinite VC dimensions. As a result no finite suite PAC-verifies correctness, and Weyuker's oracle problem says that for a large class of programs, optimizers, simulations, anything without a closed-form spec, not even the correct output to test against can be determined. Tests require an oracle, and when the program under construction is the oracle, testing is circular.

That still assumes a specification to verify against that doesn't exist in practice. A user story is a lossy-compressed encoding of stakeholder preferences, transmitted over a noisy social channel. The real specification includes team conventions, what the PM meant versus what they typed, the architectural decision from years ago nobody wrote down, which deploy environment matters in a few weeks, etc. No finite ticket captures it. The fully written specification is an infinite object, approximated in real time by humans who have been around long enough to hold the priors. The human in the loop is a living specification oracle. Removing them removes the oracle.

These layers of theorems compose. Consider an agent taking n sequential actions, each checked by some verifier with soundness 1-α (probability a bad artifact passes). Assuming independence (generous, as errors often correlate) the probability the whole trajectory is correct is at best (1-α)^n. For α = 0.05 and n = 100, this is 0.6%. Small per-step soundness gaps compound exponentially, and α = 0 is precisely what Rice forbids. Reducing α has a cost (compute, time, or human review), and there is no known way to drive it to zero at shipping velocity. This is the mathematical reason humans remain in the loop, being the last α-reducer, deployed selectively where the automated verifier is known to be weak.

The natural response is "train end-to-end with RL against a learned reward model," but a reward model is itself a verifier of bounded soundness, and agents trained against it learn to exploit the soundness gap. The tighter the loop, the more aggressively the agent mines false-positive regions of the reward. This moves α from the outer loop to the reward model without eliminating it, i.e. Goodhart with gradient steps.

This predicts the frontier of autonomous Software Engineering to advance exactly as fast as verifiers that are cheaper than humans and sound enough to bound α are created. Domains with executable specs like compiler passes against a reference implementation, numerical kernels with mathematical oracles, refactors that must preserve test behavior, go autonomous first. They already are. Domains with dense tests and strong type systems like Rust services, purely functional backends, property-tested libraries, are next. Legacy codebases without tests, with specifications living in human memory, are last. Not because the models aren't smart enough, but because α is naturally close to 0.5 and cannot be reduced without first building the verifier, and that requires the human that the agent was meant to replace.

In the end, an autonomous agent deployed to iterate over and edit software components isn't bottlenecked on intelligence, but on the fact that Rice's theorem, the intractability of decidable verification, the statistical nature of testing, and the informality of specifications together form a collection of layers where each leaks probability mass into the agent's error budget. More parameters add per-step capability yet they do not close the verification bug, because the bug is a theorem.