Glau Santana — Engineering leadership · Software architecture · AI

Engenharia
com Propósito.

Engineering with purpose.

I build and lead engineering organizations in fintech and payments — and I research what happens to software architecture when machines write the code.

About

Glau Santana is a Head of Engineering with fifteen years across fintech, payments, marketplaces, and adtech. She leads cross-functional squads building large-scale loyalty and payments platforms, and is a doctoral researcher at the Polytechnic School of the University of São Paulo (EPUSP).

Her research adapts the Architecture Tradeoff Analysis Method (ATAM) for software generated by large language models — asking how we keep architectural intent intact when code is written by machines. Her work sits where engineering leadership, software architecture, and AI meet, and it carries one conviction: engineering with purpose.

Role: Head of Engineering · loyalty & payments platform
Focus: Software architecture · AI-assisted development · Engineering leadership
Research: PhD candidate, EPUSP / USP (PPGEE) — ATAM-LLM
Education: Executive MBA, Fordham Gabelli · MSc, IPT
Based in: São Paulo, Brazil
Previously: iFood · B2W / Americanas · Dasa · Itaú · Bradesco

Writing

Selected essays

Research

ATAM-LLM

My doctoral work adapts ATAM so that architecture steers code generation rather than being discovered after the fact. Architectural constraints are elicited, translated into prompt instructions before a model writes code, and then verified — in static analysis and in production telemetry.

The core metric is the Architectural Drift Index (IDA): the measurable distance between the architecture you decided on (Intent) and what production actually does (Reality).

Intent vs. Reality — worked example

Attribute	Intent	Reality	Δ
P99 latency	≤ 2.0 s	4.3 s	+2.3 s
Availability	99.995%	98.2%	−1.795 pp
Idempotent endpoints	100%	70%	−30 pp

Architectural Drift Index0.47 — high risk

The thesis: squads that generate code under architectural constraints produce measurably lower drift than squads working from free-form prompts.

The foundation — a decade in the making

This question isn't new for me. Over a decade ago I began studying how teams hold on to non-functional quality — security, reliability, performance — under the pressure of fast, iterative delivery. My doctoral work is a direct extension of that thesis, carried into the age of AI-generated code.

Master's dissertation · 2014

A Roadmap for Identifying and Treating Non-Functional Requirements in Scrum Projects Through ATAM Practices

MSc in Software Engineering · Instituto de Pesquisas Tecnológicas (IPT), São Paulo

Original title: Roteiro para identificação e tratamento dos requisitos não funcionais em projetos Scrum aplicando as práticas do ATAM

The dissertation proposed a roadmap that brings ATAM's architecture evaluation into the agile flow of Scrum — where features get prioritized and non-functional requirements tend to be deferred. It ran architectural analysis at every stage of delivery, from backlog to sprint to integrated testing to post-deployment, so the risks tied to those requirements surfaced before code was written, not after. Validated on a real deployed system, it gave teams visibility into architectural risk and moved the critical conversations earlier. Its closing question — the technical debt those deferred requirements carry forward — is the very thread my doctoral research now picks up.

Read the full overview of the dissertation →

Mentorship

Mentoria de Gestão Tech

A mentorship program preparing women for leadership in engineering — across management, architecture, and software craft. The same conviction, pointed forward: opening the path for the next generation of women who lead with purpose.

Contact

Let's talk.

For speaking, research collaboration, advisory, or a conversation about engineering and AI:

LinkedIn · /in/glausantana
Email · glau@glausantana.com

Essay · Software Architecture × AI

Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code

Our models write code that compiles, passes its tests, and ships. They also quietly erode the architecture underneath — and almost no one is measuring it.

Gláucia (Glau) Santana

A developer on one of my squads asks an AI coding agent for a payment endpoint. Seconds later, there it is: clean, readable, functionally correct. It compiles. It passes the tests. It ships.

And in that frictionless moment, a question quietly goes unasked: does this code honor the architecture we decided on? Is the write idempotent? Is authentication where it needs to be? Will it hold at twenty thousand transactions a minute, or only on the developer's laptop?

Multiply that moment across every squad, every sprint, every quarter. What you get is not a bug. It's drift — slow, silent architectural erosion that no test suite is designed to catch.

I've spent fifteen years leading engineering in payments and fintech, and the last two watching this pattern repeat. It is now the subject of my doctoral research. I want to lay out the problem as precisely as I can, because I think our industry is measuring the wrong thing.

We benchmark correctness. We don't benchmark fitness.

The instinct is to treat this as anecdote — "developers should just review the code." The data says otherwise.

A 2025 systematic review by Owoola and colleagues, covering 146 studies, found that the non-functional attributes most consistently neglected in LLM-generated code are precisely the ones that decide whether a system survives contact with production: security, reliability, and maintainability. The review also makes a sharper point. The benchmarks we use to evaluate these models — HumanEval and its descendants — measure functional correctness only. They have nothing to say about whether the code is secure, scalable, or sound under load.

The downstream signal is already visible. A large-scale empirical study by Hassan et al. (2026) found that code churn — a well-established proxy for low quality — rose from 3.1% in 2020 to 5.7% in 2024, tracking the adoption curve of AI-assisted development. Silva et al. (2025) found vulnerabilities in 27.3% of generated code, spanning 43 distinct CWE categories.

So we have models optimized to produce code that works, evaluated by benchmarks that only check whether it works, accepted by developers who — research on developer interaction confirms — tend not to subject generated code to systematic architectural review. Every layer of that pipeline is blind to the same thing.

The problem isn't the code. It's the level above it.

Here is the distinction I think we keep collapsing.

Architecture and code live at different levels of abstraction. Code is implementation — line by line, component by component. Architecture is the level above: the structural decisions, the patterns, the trade-offs you accept on purpose. "Every write is idempotent." "Authentication happens at the gateway, never in the service." "We trade some latency for stronger consistency here, and the reverse over there."

LLMs operate brilliantly at the code level and are, by design, absent from the level above. They solve the immediate, local problem. They do not — and currently cannot — reason about the system as a whole. The architectural decision simply isn't in their generation scope.

This is why code review doesn't save us. Review is tuned to catch code-level defects: a missing null check, an off-by-one, a leaked secret. It is not tuned to notice that, across forty pull requests, the system has quietly stopped honoring a structural commitment everyone agreed to six months ago. The violation isn't in any single diff. It's in the aggregate. And nobody owns the aggregate.

A counterintuitive direction: let architecture steer the model.

The most interesting recent work in this space points language models at architecture — using LLMs to help execute or accelerate architecture evaluation. The model evaluates the design.

My research proposes the opposite, complementary direction. Let a proven architecture-evaluation method produce the constraints, and use those constraints to steer the model before it writes a line of code. Evaluation first; generation second.

The method I'm adapting is ATAM — the Architecture Tradeoff Analysis Method, developed at Carnegie Mellon's Software Engineering Institute in 2000. ATAM is a structured way to surface how an architecture's decisions trade off against quality attributes like performance, availability, and security. It was built for a world of deterministic, human-written systems. I'm asking what it becomes when part of the system is generated by an AI agent.

The adaptation — I call it ATAM-LLM — works in three layers:

1 — Elicit

A focused ATAM workshop turns vague intentions ("it has to be reliable and secure") into concrete, checkable architectural constraints: OAuth 2.0 for authentication, idempotent write operations, structured logging mandatory on every external call.

2 — Translate

Those constraints become structured prompt instructions the agent receives before generation. Instead of "write me a REST endpoint," the developer issues a template carrying the architectural constraints with it. The architecture enters the loop upstream, not after the fact.

3 — Verify

Static analysis checks the generated code against the constraints — yielding a Constraint Conformance Index, the percentage of architectural constraints the delivered code actually satisfies. Then production telemetry checks whether runtime behavior matches what the architecture promised.

Making drift a number.

That last step is where the core idea of my work lives. You cannot manage what you cannot measure, and right now architectural drift is unmeasured. So I'm proposing to measure it directly.

I call it the Architectural Drift Index (IDA): the measurable distance between two views of the same system. Intent — the behavior the architecture workshop decided and formalized. Reality — the behavior production observability actually records. Concretely, for a payment system: an intended P99 latency of two seconds that runs at 4.3; an intended 99.995% availability observed at 98.2%; an intended 100% of endpoints idempotent, measured at 70%.

Each gap is a quantified divergence between what we decided and what we built. Aggregate them into a single normalized index, and architectural drift stops being a feeling that something has gotten worse. It becomes a number you can track sprint over sprint, set thresholds on, and act on before the erosion compounds.

The central claim my research sets out to test: squads that generate code under architectural constraints produce significantly lower drift than squads that generate code from free-form prompts. Not "cleaner-feeling" code — measurably lower drift, under controlled conditions.

Why this is urgent now.

The gap I'm describing is widening on its own. As generation gets faster and cheaper, the volume of code that no human meaningfully architected grows, while our capacity to evaluate architectural fitness stays flat. Speed scales. Judgment doesn't — unless we move it upstream, into the generation loop itself.

I don't think the answer is to slow down generation. The answer is to make architecture a first-class input to it, and to make architectural drift as observable as latency or error rate already are.

I'm now running this as a controlled experiment — squads building a realistic payment system, half with architectural constraints in the loop and half without — and I'll be sharing what the numbers say as the research progresses.

If you lead or build software, I'd like to hear from you. Where have you watched architecture quietly erode under the speed of generated code? What would you most want to be able to measure that you currently can't? The most useful version of this work will be shaped by people fighting this in production, not only by the experiment in the lab.

Gláucia (Glau) Santana is an engineering leader in fintech and payments and a doctoral researcher at the Polytechnic School of the University of São Paulo (EPUSP), where her work adapts the Architecture Tradeoff Analysis Method for software generated by large language models.

Engenharia
com Propósito.

Selected essays

Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code

ATAM-LLM

The foundation — a decade in the making

Mentoria de Gestão Tech

Let's talk.

Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code

We benchmark correctness. We don't benchmark fitness.

The problem isn't the code. It's the level above it.

A counterintuitive direction: let architecture steer the model.

Making drift a number.

Why this is urgent now.

A Roadmap for Identifying and Treating Non-Functional Requirements in Scrum Projects

The problem

The contribution

How it works — two levels of analysis

…across four stages of delivery

Validated on a real system

What it found

The thread to today

Engenhariacom Propósito.

Selected essays

Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code

ATAM-LLM

The foundation — a decade in the making

Mentoria de Gestão Tech

Let's talk.

Functionally Correct, Architecturally Broken: The Blind Spot in AI-Generated Code

We benchmark correctness. We don't benchmark fitness.

The problem isn't the code. It's the level above it.

A counterintuitive direction: let architecture steer the model.

Making drift a number.

Why this is urgent now.

A Roadmap for Identifying and Treating Non-Functional Requirements in Scrum Projects

The problem

The contribution

How it works — two levels of analysis

…across four stages of delivery

Validated on a real system

What it found

The thread to today

Engenharia
com Propósito.