Reading Guide
Reliable Agent Systems
This series examines how AI systems fail in production and what it takes to operate them with evidence, not assumptions. It is written for engineering leaders and teams building agent systems at enterprise scale.
The ten core essays follow a deliberate arc: the first five diagnose the problems, the next two build the framework, and the final three make it concrete. Companion articles go deeper on specific topics. You can read them in order or start wherever your team's problems are sharpest.
The diagnosis
These essays establish what goes wrong and why. Each one treats a different failure mode as a systems problem, not a model problem.
- 01
Agent failures are distributed systems failures
Why agent failures look like model problems but behave like distributed systems failures. Partial failures, silent degradation, and the gap between demo reliability and production reliability.
- 02 Companion Building an eval harness that survives production Declarative specs, loader/runner/scorer separation, and what production eval infrastructure looks like when it is built to last.
The eval gap
Why staging success does not predict production reliability. The mismatch between how teams test agents and how agents actually break.
- 03
Guardrails are not safety
Why runtime safety layers create an illusion of control. The difference between filtering outputs and engineering safety into the system.
- 04
Who owns the agent's mistake?
Accountability in multi-agent chains. When an agent acts on behalf of a user, a team, and an organization simultaneously, who is responsible for what it does?
- 05 Companion Drift detection patterns for production agents Concrete patterns for catching model, tool, and behavioral drift before it reaches users.
Drift is the default
Models change, tools change, configs change, behavior changes. Drift is not a failure mode — it is the baseline. The question is whether you detect it before your customers do.
The framework
These two essays shift from diagnosing what is broken to defining what a working system looks like. Together they introduce the obligation-control-evaluation-evidence-response loop that the rest of the series builds on.
- 06 Companion From obligation to evidence in 90 minutes A practical walkthrough of the full loop applied to a single use case, from identifying the obligation to producing the evidence artifact.
What should an AI system actually prove?
The bridge from diagnosis to architecture. Introduces the five-object loop: obligation, control, evaluation, evidence artifact, response. Each object only has meaning in relation to the others.
- 07
Controls are not guardrails
Controls are engineering constructs with owners, tests, and evidence. Guardrails are runtime filters. The difference determines whether your system is auditable or just decorated.
The evidence
These essays make the framework tangible. What does an evidence pack look like? How do regulations map to engineering work? What happens when an incident exposes the gap?
- 08 Companion What your agent logged vs. what the auditor needed The gap between operational logging and audit-grade evidence, and what it costs teams when they discover the difference too late.
Anatomy of an evidence pack
What a real audit-ready evidence package contains: traces, test runs, approvals, model cards, change logs, and sign-offs. This is what the five-object loop produces when it is working.
- 09 Companion The regulatory mapping table The reference artifact from Essay 9 turned into a usable table engineering teams can hand to legal.
Mapping the EU AI Act to engineering evidence
How Articles 9 through 15, 26 through 27, and 72 through 73 of the EU AI Act translate into controls, evals, and evidence artifacts. With crosswalks to NIST AI RMF and ISO 42001.
- 10
The incident response gap in AI systems
Why traditional incident response does not work for AI systems. What changes when the system that failed is probabilistic, opaque, and continuously drifting.
Choosing your eval architecture
The synthesis
How to read this
If you build AI systems
Start at essay 1 and read forward. The series is cumulative.
If you are in governance, compliance, or legal
Start at essay 6. The first five will make more sense after you see where they lead.
If you want the single most practical essay
Read essay 9. It is the one with the regulatory mapping table.