Essays
Essays on AI agents, evaluation, safety, privacy, compliance, and governance through the lens of distributed systems and production engineering.
Series: Reliable Agent Systems
A connected set of essays on failure, evaluation, controls, and proof.
Start here
New to LatentMesh? Start with the core essays on how agent systems fail, how to evaluate them in the real world, and how governance becomes operational.
Agent Failures Are Distributed Systems Failures
Why agent failures are usually system failures, not just model failures.
The Eval Gap
Why staging success does not predict production reliability.
Controls Are Not Guardrails
Why runtime safety layers are not enough without evidence, ownership, and testing.
All essays
Recent writing on reliable agent systems, evidence, safety, and governance.
The Evidence Plane for AI Systems
The missing layer between what your system must prove and how your organization proves it. A framework synthesis connecting obligations, controls, evaluations, evidence artifacts, and the response loop.
Choosing Your Eval Architecture
The question is not which eval tool. The question is what kind of eval infrastructure your system actually needs. Three architectures, three failure modes, and how they compose into an evidence pipeline.
The Regulatory Mapping Table
An interactive reference that turns EU AI Act high-risk obligations into operating controls, verification methods, evidence artifacts, owners, and review cadence. Filter by role, article, cluster, or cadence to map obligations into your operating responsibilities.
What Your Agent Logged vs. What the Auditor Needed
The trace says what happened. The auditor asks why, under what authority, and what changed. Most agent deployments log enough to debug a success but not enough to investigate a failure.
Drift Detection Patterns for Production Agents
Your agent is still answering. That does not mean it is still behaving the same way. Five drift classes, three detection layers, and the patterns that catch regression before your customers do.
From Obligation to Evidence in 90 Minutes
Pick one requirement. Map it to a control. Write the eval. Generate the artifact. Assign the owner. A hands-on walkthrough of the full compliance loop using EU AI Act Article 14.
Building an Eval Harness That Survives Production
Most eval harnesses die the same way. Five structural decisions separate the ones that survive production from the ones that quietly rot.
The Incident Response Gap in AI Systems
You built the controls. You still cannot contain the failure. Most organizations have started building AI controls. Far fewer have built AI incident response.
Mapping the EU AI Act to Engineering Evidence
The regulation tells you what to prove. It does not tell you how to build the proof. This essay maps every major obligation from the EU AI Act to a specific control, eval, and evidence artifact.
Anatomy of an Evidence Pack
Your system passed the eval. Can you prove it? An evidence pack is a structured, continuously generated collection of artifacts — traces, eval results, approvals, config snapshots, and incident records — that proves your AI system did what you said it would do.
Controls Are Not Guardrails
A guardrail catches the output. A control proves the system works. The difference is the evidence layer — obligation, mechanism, eval, evidence, owner.
What Should an AI System Actually Prove?
You diagnosed the problem five different ways. Now build the answer. The proof loop: obligation, control, evaluation, evidence, response.
Drift Is the Default
Your agent worked yesterday. That is not a promise about today. Model updates, prompt changes, and shifting inputs cause silent behavioral regression that traditional monitoring doesn't catch.
Who Owns the Agent's Mistake?
The legal answer is converging fast. Courts are rejecting the 'AI did it' defense. The question is whether your organization has the infrastructure to assign accountability when an agent fails.
Guardrails Are Not Safety
Boundary guardrails are the AI equivalent of locking the front door while leaving the windows open. Real safety requires observability, containment, least privilege, and structured human review.
The Eval Gap: Why Your Agent Works in Staging and Breaks in Production
Your benchmarks are passing. Your agent is failing. Most evals measure isolated performance under controlled conditions while production failure comes from distribution shift, tool-chain errors, and changing reality.
Agent Failures Are Distributed Systems Failures
You already have the mental models for agent reliability. Retries, circuit breakers, observability — the vocabulary changes, the physics don't.