What an AI Evidence Pack Should Contain

An evidence pack is a structured, continuously generated collection of artifacts that proves your AI system meets its obligations across time. This guide explains what belongs in one, who owns each piece, and what makes the difference between documentation and evidence.

Why evidence packs exist

An evaluation proves a control worked at a point in time. An evidence pack proves the system is trustworthy across time — under real conditions, with proper oversight, through incidents and changes.

Without evidence packs, compliance is a collection of assertions. With them, compliance is demonstrable.

A trace explains what happened in one run. An evidence pack explains why the system should be trusted across time.

Minimum evidence pack contents

A complete evidence pack covers nine areas. Each area maps to specific compliance obligations and produces concrete artifacts that an auditor can inspect.

1. System scope and purpose

What the system does, who it serves, what data it processes, and the boundaries of its intended use. This is the foundation — everything else in the pack is evidence that the system operates within this scope.

2. Obligation and requirement mapping

Every applicable obligation identified and linked to a specific control. If an obligation has no control, that gap is visible. If a control has no obligation, it is overhead.

3. Control inventory

The complete list of controls: what each one does, which obligation it serves, how it is evaluated, what evidence it produces, and who owns it.

4. Evaluation results

Outputs of structured tests that verify controls are working. Not benchmarks — production evaluations that reflect the conditions the system actually operates in.

5. Trace and logging evidence

Automatic records that allow any interaction to be reconstructed: inputs, model reasoning, tool calls, oversight decisions, and outputs. Obligation-mapped, not just infrastructure logs.

6. Approvals and sign-offs

Records showing that consequential actions received human review. Each record should include who approved, when, the rationale, and what was approved.

7. Model and system version history

A change log covering model updates, prompt changes, configuration changes, and deployment history. Attributable and timestamped so any version can be reconstructed.

8. Incidents and remediation history

Structured records of incidents: detection, investigation, root cause, remediation, and follow-up. AI incidents need AI-specific playbooks — generic IT incident templates are insufficient.

9. Ownership and review cadence

A register showing who owns each control and evidence artifact, and how often each is reviewed. Ownership without cadence is a commitment without follow-through.

Artifact reference table

Each row describes one artifact, what it proves, who typically owns it, what triggers an update, the expected review cadence, and the most common way it fails.

Artifact	What it proves	Typical owner	Update trigger	Review cadence	Common failure mode
System scope document	The system's intended purpose, context of use, and deployment boundaries are defined and current.	Product / ML Engineering	New system or significant design change	Pre-release, reviewed annually	Scope described at proposal stage and never updated to match what was actually built.
Obligation map	Every applicable legal obligation is identified and linked to a specific control.	Legal / Compliance	New regulation, new system version, or change in intended purpose	Pre-release, reviewed quarterly	Obligations listed but not mapped to controls. The document says 'we comply' without saying how.
Control inventory	Each obligation has at least one verifiable mechanism that enforces or satisfies it.	Safety / ML Engineering	New obligation, failed evaluation, or control gap audit	Reviewed quarterly	Controls defined during initial assessment and never revisited as the system changed.
Evaluation results	Controls are working as intended under production conditions.	ML Engineering / Safety	Pre-release, post-update, scheduled cadence	Per-release and periodic	Benchmarks used as evidence instead of production evaluations. Results from controlled conditions presented as production proof.
Trace / logging evidence	The system's behaviour can be reconstructed for any given interaction.	ML Engineering / Infrastructure	Always-on (continuous)	Continuous capture, reviewed on incident	Logs exist but do not capture the full chain from input to output. Logs are voluminous but not obligation-mapped.
Approval and sign-off records	Consequential decisions or tool calls required human review and received it.	Operations / Safety	Oversight rule triggered during operation	Per-event	Human-in-the-loop process exists on paper but approvals are rubber-stamped or not recorded.
Model / system version history	Changes to the model, prompts, or system configuration are tracked and attributable.	ML Engineering	Any model update, prompt change, or config change	Per-change	System updated without a version record. Nobody can reconstruct which version was running during an incident.
Incident and remediation records	Incidents were detected, investigated, and remediated with a documented response.	Safety / Operations	Incident occurrence	Per-incident, reviewed in post-mortems	Incidents handled ad hoc without structured records. Post-mortem culture exists but outputs are not retained as compliance evidence.
Ownership and cadence register	Every control, evaluation, and artifact has a named owner and a defined review schedule.	Compliance / Program Management	Control inventory change, organizational change	Reviewed quarterly	Ownership assigned at setup and never updated when teams change. Cadences defined but not enforced.

Logs are not evidence

This is the most common confusion. Logs record system events. Evidence proves the system meets its obligations. The difference:

Logs

Record what happened
Infrastructure-level
Useful for debugging
High volume, low signal
Not obligation-mapped

Evidence

Prove obligations are met
Obligation-mapped
Useful for audit
Curated and structured
Tied to controls and owners

Common failure modes

Snapshot compliance. The pack was assembled once before launch and never updated. The system changed; the evidence didn't.

Unowned artifacts. Evidence exists but no one is accountable for keeping it current. Review cadences are defined but not enforced.

Benchmark-as-evidence. Lab evaluation results presented as production evidence. The system passes the benchmark but fails under real conditions.

Logs-as-evidence. Infrastructure logs presented as compliance evidence. Volume is high but the logs are not obligation-mapped and cannot answer specific compliance questions.

Continue through the compliance system

LatentMesh organizes AI compliance as a practical loop: obligation, control, evaluation, evidence, and response.