Essay · Software Architecture × AI
Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code
Our models write code that compiles, passes its tests, and ships. They also quietly erode the architecture underneath — and almost no one is measuring it.
Gláucia (Glau) Santana
A developer on one of my squads asks an AI coding agent for a payment endpoint. Seconds later, there it is: clean, readable, functionally correct. It compiles. It passes the tests. It ships.
And in that frictionless moment, a question quietly goes unasked: does this code honor the architecture we decided on? Is the write idempotent? Is authentication where it needs to be? Will it hold at twenty thousand transactions a minute, or only on the developer's laptop?
Multiply that moment across every squad, every sprint, every quarter. What you get is not a bug. It's drift — slow, silent architectural erosion that no test suite is designed to catch.
I've spent fifteen years leading engineering in payments and fintech, and the last two watching this pattern repeat. It is now the subject of my doctoral research. I want to lay out the problem as precisely as I can, because I think our industry is measuring the wrong thing.
We benchmark correctness. We don't benchmark fitness.
The instinct is to treat this as anecdote — "developers should just review the code." The data says otherwise.
A 2025 systematic review by Owoola and colleagues, covering 146 studies, found that the non-functional attributes most consistently neglected in LLM-generated code are precisely the ones that decide whether a system survives contact with production: security, reliability, and maintainability. The review also makes a sharper point. The benchmarks we use to evaluate these models — HumanEval and its descendants — measure functional correctness only. They have nothing to say about whether the code is secure, scalable, or sound under load.
The downstream signal is already visible. A large-scale empirical study by Hassan et al. (2026) found that code churn — a well-established proxy for low quality — rose from 3.1% in 2020 to 5.7% in 2024, tracking the adoption curve of AI-assisted development. Silva et al. (2025) found vulnerabilities in 27.3% of generated code, spanning 43 distinct CWE categories.
So we have models optimized to produce code that works, evaluated by benchmarks that only check whether it works, accepted by developers who — research on developer interaction confirms — tend not to subject generated code to systematic architectural review. Every layer of that pipeline is blind to the same thing.
The problem isn't the code. It's the level above it.
Here is the distinction I think we keep collapsing.
Architecture and code live at different levels of abstraction. Code is implementation — line by line, component by component. Architecture is the level above: the structural decisions, the patterns, the trade-offs you accept on purpose. "Every write is idempotent." "Authentication happens at the gateway, never in the service." "We trade some latency for stronger consistency here, and the reverse over there."
LLMs operate brilliantly at the code level and are, by design, absent from the level above. They solve the immediate, local problem. They do not — and currently cannot — reason about the system as a whole. The architectural decision simply isn't in their generation scope.
This is why code review doesn't save us. Review is tuned to catch code-level defects: a missing null check, an off-by-one, a leaked secret. It is not tuned to notice that, across forty pull requests, the system has quietly stopped honoring a structural commitment everyone agreed to six months ago. The violation isn't in any single diff. It's in the aggregate. And nobody owns the aggregate.
A counterintuitive direction: let architecture steer the model.
The most interesting recent work in this space points language models at architecture — using LLMs to help execute or accelerate architecture evaluation. The model evaluates the design.
My research proposes the opposite, complementary direction. Let a proven architecture-evaluation method produce the constraints, and use those constraints to steer the model before it writes a line of code. Evaluation first; generation second.
The method I'm adapting is ATAM — the Architecture Tradeoff Analysis Method, developed at Carnegie Mellon's Software Engineering Institute in 2000. ATAM is a structured way to surface how an architecture's decisions trade off against quality attributes like performance, availability, and security. It was built for a world of deterministic, human-written systems. I'm asking what it becomes when part of the system is generated by an AI agent.
The adaptation — I call it ATAM-LLM — works in three layers:
A focused ATAM workshop turns vague intentions ("it has to be reliable and secure") into concrete, checkable architectural constraints: OAuth 2.0 for authentication, idempotent write operations, structured logging mandatory on every external call.
Those constraints become structured prompt instructions the agent receives before generation. Instead of "write me a REST endpoint," the developer issues a template carrying the architectural constraints with it. The architecture enters the loop upstream, not after the fact.
Static analysis checks the generated code against the constraints — yielding a Constraint Conformance Index, the percentage of architectural constraints the delivered code actually satisfies. Then production telemetry checks whether runtime behavior matches what the architecture promised.
Making drift a number.
That last step is where the core idea of my work lives. You cannot manage what you cannot measure, and right now architectural drift is unmeasured. So I'm proposing to measure it directly.
I call it the Architectural Drift Index (IDA): the measurable distance between two views of the same system. Intent — the behavior the architecture workshop decided and formalized. Reality — the behavior production observability actually records. Concretely, for a payment system: an intended P99 latency of two seconds that runs at 4.3; an intended 99.995% availability observed at 98.2%; an intended 100% of endpoints idempotent, measured at 70%.
Each gap is a quantified divergence between what we decided and what we built. Aggregate them into a single normalized index, and architectural drift stops being a feeling that something has gotten worse. It becomes a number you can track sprint over sprint, set thresholds on, and act on before the erosion compounds.
The central claim my research sets out to test: squads that generate code under architectural constraints produce significantly lower drift than squads that generate code from free-form prompts. Not "cleaner-feeling" code — measurably lower drift, under controlled conditions.
Why this is urgent now.
The gap I'm describing is widening on its own. As generation gets faster and cheaper, the volume of code that no human meaningfully architected grows, while our capacity to evaluate architectural fitness stays flat. Speed scales. Judgment doesn't — unless we move it upstream, into the generation loop itself.
I don't think the answer is to slow down generation. The answer is to make architecture a first-class input to it, and to make architectural drift as observable as latency or error rate already are.
I'm now running this as a controlled experiment — squads building a realistic payment system, half with architectural constraints in the loop and half without — and I'll be sharing what the numbers say as the research progresses.
If you lead or build software, I'd like to hear from you. Where have you watched architecture quietly erode under the speed of generated code? What would you most want to be able to measure that you currently can't? The most useful version of this work will be shaped by people fighting this in production, not only by the experiment in the lab.
Gláucia (Glau) Santana is an engineering leader in fintech and payments and a doctoral researcher at the Polytechnic School of the University of São Paulo (EPUSP), where her work adapts the Architecture Tradeoff Analysis Method for software generated by large language models.