Drift Detection Patterns for Production Agents
Your agent is still answering. That does not mean it is still behaving the same way. Five drift classes, three detection layers, and the patterns that catch regression before your customers do.
Your agent is still answering. That does not mean it is still behaving the same way.
This is a companion essay to Drift Is the Default in the Reliable Agent Systems series.
In early August 2025, users of Anthropic’s Claude began reporting degraded responses. The initial reports were hard to distinguish from normal variation. By late August, the persistence of complaints prompted an investigation that uncovered not one bug but three, overlapping in time and producing different symptoms on different hardware platforms: a routing error sending requests to the wrong server pool, a TPU misconfiguration that injected Thai characters into English responses, and a latent compiler bug that excluded probable tokens during generation.
Anthropic’s standard quality checks, benchmarks and safety evaluations, caught none of them. The model often recovered from isolated mistakes, masking the systemic drift in aggregate metrics. Sticky routing meant some users experienced persistent degradation while others saw nothing wrong. The team’s own postmortem was direct: their evaluations were not broad or deep enough to catch the specific patterns users were reporting. Resolution took roughly six weeks.
This is a company that builds the model. They have access to the weights, the infrastructure, and the evaluation tooling. If it took them six weeks to localize three bugs producing visible output corruption, what happens when drift is subtler, when the model still produces clean English but chooses the wrong tool, retrieves from the wrong corpus, or approves a refund it should have escalated?
Drift in agents is not one thing
Drift detection is the problem of identifying which layer changed before business metrics tell you something is wrong.
Drift Is the Default made the case that change in agent systems is constant and comes from everywhere: model updates, prompt edits, tool API changes, retrieval corpus shifts, user population changes. The implication most teams miss is that drift detection cannot be a single metric or a single check. You are monitoring a changing system made of prompts, tools, retrieval, policies, user mix, and provider behavior. The job is to detect change at the right layer, fast enough to act, and with enough evidence to localize the cause.
Most teams monitor lagging indicators: CSAT collapse, escalation spikes, users complaining. By the time those move, the damage is already in production. Mature teams monitor structural change: tool-selection shifts, retrieval quality decay, route mix changes, rising judge-score variance, a step in the trace suddenly taking a different path.
Five drift classes matter in agent systems, and each requires different detection signals:
Input drift. The distribution of user intents, formats, languages, or task difficulty changes. A customer support agent trained on warranty questions starts receiving billing disputes after a product launch. The agent’s accuracy drops, but the drop is caused by harder work arriving, not by the agent getting worse.
Context drift. The quality or composition of retrieved documents, memory state, or system context changes. A retrieval index gets updated with new policy documents that contradict older ones. The agent starts producing conflicting answers depending on which document gets ranked first.
Decision drift. The agent’s tool choices, routing decisions, or planning behavior shift. The planner starts favoring one tool over another, or begins skipping a required approval step, even though the final output still looks acceptable.
Execution drift. Downstream tools or APIs change shape, latency, or semantics. A refund API starts returning a new error code the agent was not designed to handle, so it retries silently and processes duplicate refunds.
Outcome drift. Task success, policy compliance, cost, latency, or human escalation rates move. This is the layer most teams already watch, but it is the slowest to signal and the hardest to attribute to a root cause.
These five classes are not independent. A retrieval corpus update (context drift) can change tool selection (decision drift), which can change task success (outcome drift). The detection challenge is distinguishing cause from effect fast enough to fix the right layer.
Three detection layers
You need detectors at three levels: distribution, trace, and outcome. Each catches a different class of drift, and none of them is sufficient alone.
Distribution detectors watch the statistical shape of inputs, intermediate states, and outputs over time. Is the mix of intents changing? Are retrieval confidence scores trending down? Is the token-count distribution of agent responses shifting? These catch input drift and context drift early, before they affect outcomes. The signal is often noisy, so the practical pattern is to track rolling baselines per segment and alert on sustained deviation, not single-point anomalies.
Trace detectors watch the structure of individual agent runs. Did the agent call the expected tools in the expected order? Did a required control evaluation actually execute? Did the retrieval step return documents from the allowed corpus? These catch decision drift and execution drift, and they connect directly to the control integrity that the rest of the series has been building toward. A trace detector that asserts “destructive actions require explicit authorization scope” is not a quality check. It is a control verification running on every production request.
Outcome detectors watch the results. Task completion rates, policy compliance scores, escalation rates, cost per resolution, judge scores on sampled traces. These catch outcome drift, but they are lagging indicators. By the time the outcome detector fires, the structural change already happened. Their value is confirming that distribution or trace-level signals are real, not noise.
Together, these three layers tell you not only that performance moved, but whether the cause is traffic, context, decision logic, or downstream execution.
The patterns that matter
Pattern 1: Replay a frozen canary set
Maintain a small, versioned set of high-value tasks. Replay them against the live system or a shadow copy on a fixed schedule, daily at minimum. Score the results against the same rubric every time. This is your stable baseline when real traffic is noisy. Canary replay catches model or provider changes, prompt regressions, routing drift, and retrieval quality collapse. It works because the inputs are fixed, so any change in output is attributable to the system, not the traffic.
The practical constraint is coverage. A canary set of 50 tasks will not catch everything. Design it to cover the highest-risk paths: actions with financial impact, actions that touch customer data, actions that require human authorization. The eval harness architecture from Building an Eval Harness That Survives Production gives you the scaffolding: declarative specs, versioned rubrics, separated scoring from execution.
Pattern 2: Score production traces with control assertions
Attach evaluators to live traces. Some are deterministic: did the required policy check run? Was the authorization scope present? Did the retrieval step return documents from the allowed corpus? Others use judges: was the response grounded in the retrieved context? Was the escalation decision appropriate? Was the tone compliant with brand guidelines?
The key move is to assert things about the path, not just the final answer. A trace where the agent skipped the fraud-screening control but still produced a reasonable-looking response is a control failure, even if the customer is satisfied. This is where drift detection becomes control-aware rather than just quality-aware.
If you built the control evaluation pattern from Controls Are Not Guardrails, these assertions already exist. The drift detection layer reuses them as continuous monitors rather than one-time checks.
Pattern 3: Monitor structural signals
Even if final answers still look fine, the system may be choosing a different path to get there. Track tool-selection rates, route distributions, retrieval source distributions, and step-count distributions over time. A planner that starts routing 40% of requests through a fallback path when the baseline was 5% is signaling something, even if the fallback answers are decent. These structural signals often move before user-visible failure, making them the most useful early-warning layer.
Slice every metric by model version, prompt version, retriever version, tool version, customer segment, task type, language, and channel. Never look at one aggregate score. Aggregates are where drift goes to hide. The Anthropic incident demonstrated exactly why: aggregate metrics looked normal because most users were unaffected. The degradation was concentrated in specific routing cohorts that only showed up when sliced by server pool assignment.
Pattern 4: Target human review where uncertainty is rising
Not random review everywhere. Review where judge disagreement increases, variance increases, new slices appear, route choices change sharply, or canary performance diverges from live traffic. This keeps the cost manageable and focuses human attention on the traces most likely to reveal a real problem.
Human review also recalibrates your automated evaluators. Judges drift too. If your groundedness scorer was calibrated against a retrieval corpus that has since been updated, it may be scoring correctly against an outdated standard. Periodic human review on high-uncertainty traces is how you catch evaluator drift, not just agent drift.
Pattern 5: Pre-bind detectors to response playbooks
A drift alert without a response path is observability theater. Every alert should map to one of: rollback to a previous model or prompt version, prompt patch to address a specific behavioral shift, retriever reindex or corpus fix, tool contract update, evaluator recalibration, temporary routing restriction to a known-good path, or human-only fallback for the affected task type.
The mapping should be documented before the alert fires, not improvised during the incident. This connects directly to the incident response infrastructure from The Incident Response Gap in AI Systems. The drift detector generates the signal. The response playbook determines the action. The evidence record captures what changed, when, and what was done about it.
A running example
Consider a customer support agent handling refund requests. The system has been stable for three months. Then three things happen in the same week:
The retrieval corpus gets updated with a new return policy that tightens the refund window from 30 days to 14 days. The update is correct, but some older policy documents were not removed. The agent starts producing contradictory answers depending on which document gets ranked first.
The payment provider updates their refund API. The response schema adds a new field. The agent’s tool integration ignores unknown fields, so it does not break, but the new field contains a fraud-risk flag the agent never sees.
A product launch shifts the traffic mix. The agent starts receiving questions about a product category it was not optimized for. Task completion rates drop, but only for that category.
With lagging indicators alone, the team sees CSAT drop after a week and opens an investigation. With the detection stack described here, the signals arrive earlier and separately: the canary set catches the contradictory retrieval results on day one. The structural signal monitor flags the rising retrieval-source entropy. The trace detector notices the missing fraud-risk field in the tool response schema. The traffic-mix monitor shows the new intent cluster. Each signal points to a different layer, and the team can triage each one independently instead of debugging a single aggregate quality number.
What this is really about
Drift detection is not a dashboard. It is not a quality score. It is the continuous verification that your controls still hold under changing conditions. Drift Is the Default established that change is inevitable. This essay is about building the infrastructure to detect it before your customers do, and to produce the evidence trail that proves you handled it.
The point is not to know that drift happened. The point is to know which layer moved, what control no longer held, and what action to take before customers absorb the failure.
Previously: Drift Is the Default. Next: The Regulatory Mapping Table.
Selected references
- Anthropic. “A postmortem of three recent issues.” Anthropic Engineering Blog, September 2025. https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
- OpenAI. “Sycophancy in GPT-4o: What happened and what we’re doing about it.” OpenAI Blog, April 29, 2025. https://openai.com/index/sycophancy-in-gpt-4o/
- OpenAI. “Expanding on what we missed with sycophancy.” OpenAI Blog, May 2, 2025. https://openai.com/index/expanding-on-sycophancy/
- Chen L, Zaharia M, Zou J. “How Is ChatGPT’s Behavior Changing over Time?” Harvard Data Science Review, Special Issue 6.2, Spring 2024. https://hdsr.mitpress.mit.edu/pub/y95zitmz
- EU AI Act, Articles 9 and 12: Risk Management System and Record-Keeping. Regulation (EU) 2024/1689, Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj