← AI Compliance guide

What an AI Evidence Pack Should Contain

An evidence pack is a structured, continuously generated collection of artifacts that proves your AI system meets its obligations across time. This guide explains what belongs in one, who owns each piece, and what makes the difference between documentation and evidence.

Why evidence packs exist

An evaluation proves a control worked at a point in time. An evidence pack proves the system is trustworthy across time — under real conditions, with proper oversight, through incidents and changes.

Without evidence packs, compliance is a collection of assertions. With them, compliance is demonstrable.

A trace explains what happened in one run. An evidence pack explains why the system should be trusted across time.

Minimum evidence pack contents

A complete evidence pack covers nine areas. Each area maps to specific compliance obligations and produces concrete artifacts that an auditor can inspect.

1. System scope and purpose

What the system does, who it serves, what data it processes, and the boundaries of its intended use. This is the foundation — everything else in the pack is evidence that the system operates within this scope.

2. Obligation and requirement mapping

Every applicable obligation identified and linked to a specific control. If an obligation has no control, that gap is visible. If a control has no obligation, it is overhead.

3. Control inventory

The complete list of controls: what each one does, which obligation it serves, how it is evaluated, what evidence it produces, and who owns it.

4. Evaluation results

Outputs of structured tests that verify controls are working. Not benchmarks — production evaluations that reflect the conditions the system actually operates in.

5. Trace and logging evidence

Automatic records that allow any interaction to be reconstructed: inputs, model reasoning, tool calls, oversight decisions, and outputs. Obligation-mapped, not just infrastructure logs.

6. Approvals and sign-offs

Records showing that consequential actions received human review. Each record should include who approved, when, the rationale, and what was approved.

7. Model and system version history

A change log covering model updates, prompt changes, configuration changes, and deployment history. Attributable and timestamped so any version can be reconstructed.

8. Incidents and remediation history

Structured records of incidents: detection, investigation, root cause, remediation, and follow-up. AI incidents need AI-specific playbooks — generic IT incident templates are insufficient.

9. Ownership and review cadence

A register showing who owns each control and evidence artifact, and how often each is reviewed. Ownership without cadence is a commitment without follow-through.

Artifact reference table

Each row describes one artifact, what it proves, who typically owns it, what triggers an update, the expected review cadence, and the most common way it fails.

Artifact What it proves Typical owner Update trigger Review cadence Common failure mode
System scope document The system's intended purpose, context of use, and deployment boundaries are defined and current. Product / ML Engineering New system or significant design change Pre-release, reviewed annually Scope described at proposal stage and never updated to match what was actually built.
Obligation map Every applicable legal obligation is identified and linked to a specific control. Legal / Compliance New regulation, new system version, or change in intended purpose Pre-release, reviewed quarterly Obligations listed but not mapped to controls. The document says 'we comply' without saying how.
Control inventory Each obligation has at least one verifiable mechanism that enforces or satisfies it. Safety / ML Engineering New obligation, failed evaluation, or control gap audit Reviewed quarterly Controls defined during initial assessment and never revisited as the system changed.
Evaluation results Controls are working as intended under production conditions. ML Engineering / Safety Pre-release, post-update, scheduled cadence Per-release and periodic Benchmarks used as evidence instead of production evaluations. Results from controlled conditions presented as production proof.
Trace / logging evidence The system's behaviour can be reconstructed for any given interaction. ML Engineering / Infrastructure Always-on (continuous) Continuous capture, reviewed on incident Logs exist but do not capture the full chain from input to output. Logs are voluminous but not obligation-mapped.
Approval and sign-off records Consequential decisions or tool calls required human review and received it. Operations / Safety Oversight rule triggered during operation Per-event Human-in-the-loop process exists on paper but approvals are rubber-stamped or not recorded.
Model / system version history Changes to the model, prompts, or system configuration are tracked and attributable. ML Engineering Any model update, prompt change, or config change Per-change System updated without a version record. Nobody can reconstruct which version was running during an incident.
Incident and remediation records Incidents were detected, investigated, and remediated with a documented response. Safety / Operations Incident occurrence Per-incident, reviewed in post-mortems Incidents handled ad hoc without structured records. Post-mortem culture exists but outputs are not retained as compliance evidence.
Ownership and cadence register Every control, evaluation, and artifact has a named owner and a defined review schedule. Compliance / Program Management Control inventory change, organizational change Reviewed quarterly Ownership assigned at setup and never updated when teams change. Cadences defined but not enforced.

Logs are not evidence

This is the most common confusion. Logs record system events. Evidence proves the system meets its obligations. The difference:

Logs

  • Record what happened
  • Infrastructure-level
  • Useful for debugging
  • High volume, low signal
  • Not obligation-mapped

Evidence

  • Prove obligations are met
  • Obligation-mapped
  • Useful for audit
  • Curated and structured
  • Tied to controls and owners

Common failure modes

Snapshot compliance. The pack was assembled once before launch and never updated. The system changed; the evidence didn't.
Unowned artifacts. Evidence exists but no one is accountable for keeping it current. Review cadences are defined but not enforced.
Benchmark-as-evidence. Lab evaluation results presented as production evidence. The system passes the benchmark but fails under real conditions.
Logs-as-evidence. Infrastructure logs presented as compliance evidence. Volume is high but the logs are not obligation-mapped and cannot answer specific compliance questions.

Continue through the compliance system

LatentMesh organizes AI compliance as a practical loop: obligation, control, evaluation, evidence, and response.