Reading Guide

Reliable Agent Systems

This series examines how AI systems fail in production and what it takes to operate them with evidence, not assumptions. It is written for engineering leaders and teams building agent systems at enterprise scale.

The ten core essays follow a deliberate arc: the first five diagnose the problems, the next two build the framework, and the final three make it concrete. Companion articles go deeper on specific topics. You can read them in order or start wherever your team's problems are sharpest.

The diagnosis

These essays establish what goes wrong and why. Each one treats a different failure mode as a systems problem, not a model problem.

  1. 01

    Agent failures are distributed systems failures

    Why agent failures look like model problems but behave like distributed systems failures. Partial failures, silent degradation, and the gap between demo reliability and production reliability.

  2. 02

    The eval gap

    Why staging success does not predict production reliability. The mismatch between how teams test agents and how agents actually break.

    Companion Building an eval harness that survives production Declarative specs, loader/runner/scorer separation, and what production eval infrastructure looks like when it is built to last.
  3. 03

    Guardrails are not safety

    Why runtime safety layers create an illusion of control. The difference between filtering outputs and engineering safety into the system.

  4. 04

    Who owns the agent's mistake?

    Accountability in multi-agent chains. When an agent acts on behalf of a user, a team, and an organization simultaneously, who is responsible for what it does?

  5. 05

    Drift is the default

    Models change, tools change, configs change, behavior changes. Drift is not a failure mode — it is the baseline. The question is whether you detect it before your customers do.

    Companion Drift detection patterns for production agents Concrete patterns for catching model, tool, and behavioral drift before it reaches users.

The framework

These two essays shift from diagnosing what is broken to defining what a working system looks like. Together they introduce the obligation-control-evaluation-evidence-response loop that the rest of the series builds on.

  1. 06

    What should an AI system actually prove?

    The bridge from diagnosis to architecture. Introduces the five-object loop: obligation, control, evaluation, evidence artifact, response. Each object only has meaning in relation to the others.

    Companion From obligation to evidence in 90 minutes A practical walkthrough of the full loop applied to a single use case, from identifying the obligation to producing the evidence artifact.
  2. 07

    Controls are not guardrails

    Controls are engineering constructs with owners, tests, and evidence. Guardrails are runtime filters. The difference determines whether your system is auditable or just decorated.

The evidence

These essays make the framework tangible. What does an evidence pack look like? How do regulations map to engineering work? What happens when an incident exposes the gap?

  1. 08

    Anatomy of an evidence pack

    What a real audit-ready evidence package contains: traces, test runs, approvals, model cards, change logs, and sign-offs. This is what the five-object loop produces when it is working.

    Companion What your agent logged vs. what the auditor needed The gap between operational logging and audit-grade evidence, and what it costs teams when they discover the difference too late.
  2. 09

    Mapping the EU AI Act to engineering evidence

    How Articles 9 through 15, 26 through 27, and 72 through 73 of the EU AI Act translate into controls, evals, and evidence artifacts. With crosswalks to NIST AI RMF and ISO 42001.

    Companion The regulatory mapping table The reference artifact from Essay 9 turned into a usable table engineering teams can hand to legal.
  3. 10

    The incident response gap in AI systems

    Why traditional incident response does not work for AI systems. What changes when the system that failed is probabilistic, opaque, and continuously drifting.

Choosing your eval architecture

The synthesis

How to read this

If you build AI systems

Start at essay 1 and read forward. The series is cumulative.

If you are in governance, compliance, or legal

Start at essay 6. The first five will make more sense after you see where they lead.

If you want the single most practical essay

Read essay 9. It is the one with the regulatory mapping table.