Series

How to Prove AI Systems Work

A ten-part series on building AI systems that are reliable enough to govern.

Most organizations shipping AI agents today cannot answer a simple question: how do you know this system is working correctly?

Not "did it pass a benchmark." Not "does it have a guardrail." Can you show, with evidence, that the system does what it is supposed to do, that you tested the things that matter, and that you will know when something changes?

That question sits at the intersection of engineering, evaluation, governance, and compliance. This series works through it from both directions. The first five essays diagnose why most teams cannot answer it today. The next five build toward what the answer actually looks like.

The diagnosis

  1. 01

    Agent failures are distributed systems failures

    AI agents fail the same way distributed systems fail: cascading errors, partial availability, Byzantine faults, missing observability. The mental models that make distributed systems reliable apply directly. Most teams have not made the connection yet.

  2. 02

    The eval gap

    Your agent works in staging. It breaks in production. The gap is not a bug. It is a structural mismatch between how teams evaluate AI systems and how those systems actually behave under real traffic, real user inputs, and real tool-call chains.

  3. 03

    Guardrails are not safety

    Input/output filters are the AI equivalent of perimeter security. Necessary, but nowhere near sufficient. Real safety requires observability into the reasoning chain, containment when confidence drops, least-privilege tool access, and structural human review. Most teams are in their "we have a firewall" era.

  4. 04

    Who owns the agent's mistake?

    When an agent chain makes a bad decision across three tool calls and two LLM hops, who is accountable? Courts are already answering this question. Air Canada's chatbot ruling, the Workday hiring discrimination case, and California's proposed liability laws all point in the same direction: the deployer owns it.

  5. 05

    Drift is the default

    The model you shipped last month is not the model running today. Provider updates, config changes, shifting user behavior, and tool API drift mean agent systems change constantly. If you are not testing for behavioral regression continuously, you are hoping. Hope is not an engineering strategy.

The answer

  1. 06

    What should an AI system actually prove?

    The bridge. Five essays of diagnosis lead to one structural question: what would it take to actually demonstrate that an AI system is reliable, fair, transparent, and safe? Not as a policy statement. As a system that runs in production and generates evidence. This essay introduces the obligation, control, evaluation, evidence, and response loop.

  2. 07

    Controls are not guardrails

    A guardrail catches bad output. A control proves the system works. This essay draws the line between runtime filters and auditable controls, and shows what a control looks like when it is mapped to an obligation, tested by an eval, and backed by stored evidence. The Salesforce Agentforce breach is the opening case study.

  3. 08

    Anatomy of an evidence pack

    What does audit-ready proof actually look like? Traces, test runs, approval records, model cards, change logs, sign-offs. This essay makes the abstract framework from essays 6 and 7 tangible by walking through the artifacts a team would need to produce when someone with authority asks to see their work.

  4. 09

    Mapping the EU AI Act to engineering evidence

    The EU AI Act's high-risk obligations take effect August 2026. This essay maps each major requirement, Articles 9 through 15 plus deployer obligations under Articles 26 and 27, to a specific control, evaluation, and evidence artifact. Shorter mappings cover NIST AI RMF and ISO 42001. This is the essay that gets forwarded between engineering leaders and legal teams.

  5. 10

    The incident response gap in AI systems

    What happens when an AI system fails and nobody has a process for capturing what went wrong? Most incident response playbooks were written for infrastructure outages, not for a model that silently started giving worse answers. This essay covers what AI-specific incident response looks like when the failure is behavioral, not mechanical.

How to read this

If you build AI systems

Start at essay 1 and read forward. The series is cumulative.

If you are in governance, compliance, or legal

Start at essay 6. The first five will make more sense after you see where they lead.

If you want the single most practical essay

Read essay 9. It is the one with the regulatory mapping table.

If you want the thesis in one sentence: AI systems that cannot produce evidence of their own reliability are not ready for production, and the organizations that figure out how to generate that evidence systematically will be the ones that ship with confidence while everyone else is still hoping for the best.