reliability

Evidence Is What Someone Can Verify Later

An eval result becomes evidence only when the party that needs to verify a claim can retrieve it, interpret it, and verify it. Production alone is not enough. Retention alone is not enough.

May 4, 2026

The Eval Pack Belongs to the Class

An eval pack is the reusable unit of evaluation for an agent class. It contains scenarios, scorers, thresholds, expected evidence, and obligation mappings — the content that turns a test suite into traceable evidence.

May 4, 2026

Drift Is When the Queue Does Not Know

Change routing works only for visible change. Drift is the movement that invalidates evidence without producing a change record.

May 4, 2026

Baseline Inheritance Is How Agent Evaluation Scales

An instance inherits its class baseline only while it stays inside the boundary the baseline was proven against. The hard part of fleet evaluation is detecting the moment that boundary has been crossed.

May 4, 2026

The Next Eval Is the One with the Most Evidence at Risk

The next eval should be the one where delay puts the most load-bearing evidence at risk.

May 4, 2026

The Agent Is Not The Unit. The Agent Class Is.

Per-agent evaluation fails at fleet scale because the unit of review is wrong. The reviewable unit is the agent class: a shared pattern of purpose, tools, data access, autonomy, and risk surface.

May 3, 2026

Why Per-Agent Evaluation Breaks at Fleet Scale

Most evaluation systems assume a single agent. At fleet scale, the question shifts from whether one agent passed to where limited evaluation capacity should be spent now.

May 1, 2026

The Evidence Plane for AI Systems

The missing layer between what your system must prove and how your organization proves it. A framework synthesis connecting obligations, controls, evaluations, evidence artifacts, and the response loop.

Apr 5, 2026

Drift Detection Patterns for Production Agents

Your agent is still answering. That does not mean it is still behaving the same way. Five drift classes, three detection layers, and the patterns that catch regression before your customers do.

Apr 5, 2026

The Incident Response Gap in AI Systems

You built the controls. You still cannot contain the failure. Most organizations have started building AI controls. Far fewer have built AI incident response.

Apr 4, 2026

Drift Is the Default

Your agent worked yesterday. That is not a promise about today. Model updates, prompt changes, and shifting inputs cause silent behavioral regression that traditional monitoring doesn't catch.

Apr 4, 2026

The Eval Gap: Why Your Agent Works in Staging and Breaks in Production

Your benchmarks are passing. Your agent is failing. Most evals measure isolated performance under controlled conditions while production failure comes from distribution shift, tool-chain errors, and changing reality.

Apr 4, 2026

Agent Failures Are Distributed Systems Failures

You already have the mental models for agent reliability. Retries, circuit breakers, observability — the vocabulary changes, the physics don't.

Apr 3, 2026

← All topics