Reliable Agent Systems Reading Guide

The diagnosis

These essays establish what goes wrong and why. Each one treats a different failure mode as a systems problem, not a model problem.

01

Agent failures are distributed systems failures

Why agent failures look like model problems but behave like distributed systems failures. Partial failures, silent degradation, and the gap between demo reliability and production reliability.
02

The eval gap

Why staging success does not predict production reliability. The mismatch between how teams test agents and how agents actually break.

Companion Building an eval harness that survives production Declarative specs, loader/runner/scorer separation, and what production eval infrastructure looks like when it is built to last.
03

Guardrails are not safety

Why runtime safety layers create an illusion of control. The difference between filtering outputs and engineering safety into the system.
04

Who owns the agent's mistake?

Accountability in multi-agent chains. When an agent acts on behalf of a user, a team, and an organization simultaneously, who is responsible for what it does?
05

Drift is the default

Models change, tools change, configs change, behavior changes. Drift is not a failure mode — it is the baseline. The question is whether you detect it before your customers do.

Companion Drift detection patterns for production agents Concrete patterns for catching model, tool, and behavioral drift before it reaches users.

The framework

These two essays shift from diagnosing what is broken to defining what a working system looks like. Together they introduce the obligation-control-evaluation-evidence-response loop that the rest of the series builds on.

The evidence

These essays make the framework tangible. What does an evidence pack look like? How do regulations map to engineering work? What happens when an incident exposes the gap?

Choosing your eval architecture

Choosing your eval architecture

When to use off-the-shelf benchmarking, when to adopt a flexible eval framework, and when to build production-native eval infrastructure. A decision framework for teams that have outgrown their first eval setup.

The synthesis

The evidence plane for AI systems

The named framework that ties the full series together. Your AI system has a data plane and a control plane. It does not have an evidence plane. That is the gap regulators, auditors, and your own incident reviews will find first.

Reliable Agent Systems

The diagnosis

Agent failures are distributed systems failures

The eval gap

Guardrails are not safety

Who owns the agent's mistake?

Drift is the default

The framework

What should an AI system actually prove?

Controls are not guardrails

The evidence

Anatomy of an evidence pack

Mapping the EU AI Act to engineering evidence

The incident response gap in AI systems

Choosing your eval architecture

Choosing your eval architecture

The synthesis

The evidence plane for AI systems

How to read this