agents
Evidence Is What Someone Can Verify Later
An eval result becomes evidence only when the party that needs to verify a claim can retrieve it, interpret it, and verify it. Production alone is not enough. Retention alone is not enough.
The Eval Pack Belongs to the Class
An eval pack is the reusable unit of evaluation for an agent class. It contains scenarios, scorers, thresholds, expected evidence, and obligation mappings — the content that turns a test suite into traceable evidence.
Drift Is When the Queue Does Not Know
Change routing works only for visible change. Drift is the movement that invalidates evidence without producing a change record.
Baseline Inheritance Is How Agent Evaluation Scales
An instance inherits its class baseline only while it stays inside the boundary the baseline was proven against. The hard part of fleet evaluation is detecting the moment that boundary has been crossed.
Evaluation Should Follow Change
Calendar-driven eval cadence wastes capacity on stable systems and misses risk on changing ones. Change-routed evaluation matches eval work to what actually changed.
The Next Eval Is the One with the Most Evidence at Risk
The next eval should be the one where delay puts the most load-bearing evidence at risk.
The Agent Is Not The Unit. The Agent Class Is.
Per-agent evaluation fails at fleet scale because the unit of review is wrong. The reviewable unit is the agent class: a shared pattern of purpose, tools, data access, autonomy, and risk surface.
Why Per-Agent Evaluation Breaks at Fleet Scale
Most evaluation systems assume a single agent. At fleet scale, the question shifts from whether one agent passed to where limited evaluation capacity should be spent now.
Choosing Your Eval Architecture
The question is not which eval tool. The question is what kind of eval infrastructure your system actually needs. Three architectures, three failure modes, and how they compose into an evidence pipeline.
Building an Eval Harness That Survives Production
Most eval harnesses die the same way. Five structural decisions separate the ones that survive production from the ones that quietly rot.
The Incident Response Gap in AI Systems
You built the controls. You still cannot contain the failure. Most organizations have started building AI controls. Far fewer have built AI incident response.
Mapping the EU AI Act to Engineering Evidence
The regulation tells you what to prove. It does not tell you how to build the proof. This essay maps every major obligation from the EU AI Act to a specific control, eval, and evidence artifact.
Anatomy of an Evidence Pack
Your system passed the eval. Can you prove it? An evidence pack is a structured, continuously generated collection of artifacts — traces, eval results, approvals, config snapshots, and incident records — that proves your AI system did what you said it would do.
Controls Are Not Guardrails
A guardrail catches the output. A control proves the system works. The difference is the evidence layer — obligation, mechanism, eval, evidence, owner.
Who Owns the Agent's Mistake?
The legal answer is converging fast. Courts are rejecting the 'AI did it' defense. The question is whether your organization has the infrastructure to assign accountability when an agent fails.
Guardrails Are Not Safety
Boundary guardrails are the AI equivalent of locking the front door while leaving the windows open. Real safety requires observability, containment, least privilege, and structured human review.
Agent Failures Are Distributed Systems Failures
You already have the mental models for agent reliability. Retries, circuit breakers, observability — the vocabulary changes, the physics don't.