New series

Evaluating Agent Fleets

The scarce resource is not inventory. It is eval capacity.

Reliable Agent Systems asks how a single agent should be evaluated. This series asks a different question: when you have dozens or hundreds of agents, classes, workloads, and changes, which ones should be evaluated now?

Most teams treat evaluation as per-agent. Every agent gets its own eval suite, its own schedule, its own thresholds. That works until you have more agents than you have eval budget. Then evaluation becomes an allocation problem, and the tools and mental models built for single-agent reliability stop scaling.

This series introduces the structures needed to make that allocation decision well: work units, baseline inheritance, trigger-routed re-evaluation, sampling, eval packs, and evidence.

Essays

  1. 01

    Why per-agent evaluation breaks at fleet scale

    The first essay. Why the per-agent evaluation model stops working when you have more agents than eval capacity, and what the fleet-scale alternative looks like.

  2. 02

    The agent is not the unit. The agent class is.

    The reviewable unit is not each individual agent. It is the agent class: a shared pattern of purpose, tools, data access, autonomy, and risk surface. Inheritance, blast radius, and governed reuse.

  3. 03

    Baseline inheritance is how agent evaluation scales

    An instance inherits its class baseline only while it stays inside the boundary the baseline was proven against. The admissibility test, the inheritance contract, and detecting boundary crossings.

  4. 04

    Evaluation should follow change

    Calendar-driven eval cadence wastes capacity on stable systems and misses risk on changing ones. Change-routed evaluation matches eval work to what actually changed.

  5. 05

    The next eval is the one with the most evidence at risk

    Routing creates a queue. Prioritization sorts it. The next eval should be the one where delay puts the most load-bearing evidence at risk.

  6. 06

    Drift is when the queue does not know

    Change routing works only for visible change. Drift is the movement that invalidates evidence without producing a change record. Detection, observed records, and routing drift back into evaluation.

  7. 07

    The eval pack belongs to the class

    An eval pack is the reusable evaluation bundle for an agent class: scenarios, scorers, thresholds, expected evidence, and obligation mappings. The content that turns a test suite into traceable evidence.

  8. 08

    Evidence is what someone can verify later

    An eval result becomes evidence only when the party that needs to verify a claim can retrieve it, interpret it, and verify it. The three tests, evidence access, and the close of the series.

More essays in this series are coming. Follow on LinkedIn to get notified.

How this connects

If you are new to LatentMesh

Start with Reliable Agent Systems. That series covers how individual agents fail and how to build evidence that they work. This series builds on those foundations.

If you already manage multiple agents

Start here. This series assumes you know what evals are and why they matter. The question is how to allocate them.