AI Agent Testing: How to QA Non-Deterministic Systems

Ask your AI agent to book a flight, and it might succeed nine times out of ten. On the tenth try, it books the wrong date, picks a layover in the opposite direction, or confidently tells the user the reservation is confirmed when it is not. Traditional unit tests cannot catch this. The agent followed a valid code path each time. The problem is that AI agents are non-deterministic: same input, different output, every run.

This is why Anthropic published “Demystifying Evals for AI Agents” in January 2026, calling evaluations the single most important practice for shipping agents with confidence. Their recommendation: start with 20 to 50 test cases drawn from real failures. Not hundreds. Not a comprehensive test suite. Just the things that actually broke.

Why Standard Testing Falls Apart for Agents

A function that calculates tax returns the same number every time you call it. You write a test, assert the output, and move on. AI agents break this model in three ways.

Non-deterministic outputs. The same prompt produces different tool calls, different reasoning chains, and different final answers across runs. Temperature settings, model updates, and context window contents all introduce variation. You cannot assert on exact output strings.

Multi-step execution. An agent that researches a topic might call a search API, read three documents, synthesize findings, and produce a report. A failure at step four might be caused by a bad decision at step two. The test needs to evaluate the entire trajectory, not just the final output.

Environmental dependence. Agents interact with APIs, databases, and web pages that change independently. A test that passed yesterday can fail today because a website updated its HTML structure, an API changed its response format, or rate limits kicked in.

Anthropic frames this clearly: “The capabilities that make agents useful, autonomy, intelligence, and flexibility, also make them more difficult to evaluate.”

The Eval Mindset: What to Test and How

Anthropic distinguishes two types of evaluations that every agent team needs.

Capability evals answer “What can this agent do?” They start at low pass rates and improve as the agent gets better. These are your frontier tests: tasks the agent cannot reliably handle yet, but should eventually.

Regression evals answer “Does it still work?” They should maintain close to 100% pass rates. When a regression eval fails, something broke that used to work. These are your safety net.

Three Grader Types

Not every eval needs an LLM to grade it. Anthropic identifies three approaches, and the best teams combine all three:

Code-based graders are fast, cheap, and objective. String matching, regular expressions, database state checks, unit test execution. If the agent should create a file, check if the file exists. If it should update a database record, query the database. Use these whenever possible.

Model-based graders handle subjective quality. An LLM evaluates whether the agent’s response was helpful, accurate, or appropriately toned. These are slower and more expensive but necessary for conversational agents, content generation, and any task where “correct” is not binary. Calibrate them against human judgments.

Human graders are the gold standard for subjective tasks but do not scale. Use them to validate your automated graders, not to run daily evals. Anthropic recommends that two domain experts should independently reach the same pass/fail verdict on a properly specified task.

The pass@k vs. pass^k Distinction

This is the metric that separates teams who understand agent reliability from those who do not.

pass@k measures whether the agent succeeds at least once in k attempts. If your agent has a 50% success rate, pass@3 is 87.5%. This is useful for tools where one success is enough: code generation (generate three candidates, pick the best one) or creative writing.

pass^k measures whether the agent succeeds on all k attempts. A 75% per-trial success rate gives you only ~42% pass^3. This is the metric for customer-facing agents. If a customer service bot fails one in four times, that is unacceptable, and pass^k makes that visible.

Sierra’s Tau-Bench uses pass^k as its primary metric, and the results are sobering: even GPT-4o succeeds on fewer than 50% of tasks, with pass^8 dropping below 25% in retail customer service scenarios.

The Benchmarks That Matter in 2026

Dozens of agent benchmarks exist. These five are the ones teams actually reference when evaluating production agents.

GAIA

GAIA tests general AI assistant capabilities across three difficulty levels requiring reasoning, web browsing, multi-modality, and tool use. Humans score ~92%. The best AI agent (Writer’s Action Agent) reached 61% on Level 3 tasks in mid-2025. GPT-4 with plugins initially struggled to break 15%. GAIA exposes the gap between what agents can do on simple tasks and what they cannot do on complex, multi-step problems that require real-world knowledge.

SWE-bench Verified

SWE-bench Verified is the standard for evaluating coding agents. It contains 500 human-validated GitHub issues from real Python repositories. The task: generate a code patch that resolves the issue and passes the test suite. Performance improved from ~40% to over 70% in a single year (Warp’s agent scores 71%). This benchmark shows that coding agents are approaching practical usefulness for routine bug fixes, though multi-file refactoring remains hard.

Tau-Bench and Tau2-Bench

Sierra’s Tau-Bench simulates real customer service conversations. An LLM plays the customer, the agent handles the request, and grading checks whether the agent achieved the correct database state (not whether the conversation “sounded” right). This is a critical distinction: it measures whether the agent actually solved the problem, not whether it was polite about it. Tau2-Bench adds a telecom domain and expanded scenarios.

WebArena

WebArena tests agents on 812 tasks across self-hosted websites: browsing e-commerce, managing forums, editing code repos. Agent performance improved from ~14% to roughly 60% in two years. It is the standard for evaluating browser-based agents and web automation workflows.

BrowseComp

BrowseComp from OpenAI evaluates research agents on complex web browsing tasks. It tests whether agents can find accurate, grounded answers to questions that require navigating multiple sources and synthesizing information. Coverage, groundedness, and source quality are the key metrics.

Building Your Eval Suite: A Practical Playbook

Anthropic’s engineering team offers a step-by-step approach that works regardless of what your agent does.

Step 1: Start with Real Failures

Do not brainstorm test cases. Pull them from production incidents, customer complaints, and manual QA sessions. If you do not have production data yet, use the scenarios you tested manually before launch. Anthropic’s recommendation: “Begin with behaviors already verified before releases and common user-reported failures prioritized by impact.”

Twenty test cases that cover real failure modes are worth more than 200 synthetic scenarios that test happy paths.

Step 2: Make Tasks Unambiguous

Every eval task needs a reference solution proving it is solvable. Anthropic’s bar: “Two domain experts should independently reach the same pass/fail verdict on properly specified tasks.” If two engineers disagree about whether the agent’s output is correct, the task is underspecified, not the agent.

Step 3: Test Absence, Not Just Presence

Most eval suites are lopsided. They test “Does the agent do X when it should?” but not “Does the agent avoid doing X when it should not?” Balance your test set: include tasks where the correct behavior is to refuse, ask for clarification, or do nothing.

Step 4: Isolate Each Trial

Shared state between eval runs causes correlated failures that look like systematic bugs but are actually test infrastructure problems. Give each trial a clean environment: fresh database state, empty context, no carryover from previous runs.

Step 5: Watch for Saturation

When an eval suite hits 100% pass rate, it stops providing useful signal. That does not mean your agent is perfect. It means your tests are too easy. Refresh capability evals as the agent improves. Keep regression evals at 100%.

Anthropic found this firsthand: their Claude Opus 4.5 initially scored 42% on CORE-Bench, then jumped to 95% after fixing grading bugs. The lesson: “Don’t take eval scores at face value until someone digs into the details and reads transcripts.”

Tools for Agent Evaluation

The tooling ecosystem has matured significantly. Here are the platforms teams use in production.

LangSmith

LangSmith is LangChain’s evaluation and observability platform. It traces every LLM call, captures prompts and outputs, tracks costs and latency, and supports dataset-based evaluation with LLM-as-judge workflows. If you are in the LangChain/LangGraph ecosystem, LangSmith is the default choice. It supports regression testing and provides tracing that is especially valuable for debugging multi-step agent chains.

Braintrust

Braintrust combines evaluation with production monitoring. Run evals against datasets, compare prompts and configurations, and score results with automated graders. It stands out for its TypeScript/JavaScript support and unified eval-plus-monitoring workflow. Enterprise teams like the self-hosting option and automated evaluation pipelines.

Evidently AI

Evidently focuses on testing and monitoring LLM applications with over 25 million downloads of their open-source library. Evidently Cloud provides a no-code workspace for generating synthetic data, running adversarial tests, and tracking agent performance. Their guide to 10 AI agent benchmarks is one of the most comprehensive references available.

AgentEval (Anthropic)

Anthropic’s internal eval framework, described in their engineering blog, uses a combination of code-based, model-based, and human graders organized into capability and regression suites. While not open-sourced as a product, the methodology is fully documented and replicable with any eval tooling.

Tool	Best For	Key Feature	Open Source
LangSmith	LangChain users	Tracing + eval integration	No (free tier)
Braintrust	Full-stack eval	Unified eval + monitoring	No (free tier)
Evidently AI	Open-source teams	Synthetic data + adversarial tests	Yes
Inspect AI	Research teams	UK AISI eval framework	Yes

Common Mistakes and How to Avoid Them

Evaluating the conversation instead of the outcome. A chatbot that sounds confident and polite can still give wrong answers. Sierra’s Tau-Bench grades on database state, not conversation quality. Apply the same principle: check what the agent actually did, not what it said it did.

Running evals once and calling it done. Models update. APIs change. User behavior shifts. Evals need to run continuously, ideally on every code change and on a schedule against production. Treat your eval suite like your test suite: it is part of CI/CD.

Ignoring transcript review. Automated graders catch quantifiable failures. But reading raw agent transcripts builds intuition for failure patterns that no metric captures. Anthropic recommends regular transcript reviews as a core practice, not an occasional audit.

Over-indexing on benchmarks. GAIA and SWE-bench measure general capability. They do not tell you whether your specific agent handles your specific use cases. Custom evals on your data always matter more than benchmark scores.

Skipping the “should not do” cases. The most dangerous agent failures are not when the agent fails to act. They are when the agent acts confidently in situations where it should not. Test for appropriate refusal and escalation as aggressively as you test for task completion.

Frequently Asked Questions

How do you test AI agents?

AI agents are tested through evaluations (evals): structured test tasks with defined inputs and grading logic. Anthropic recommends starting with 20-50 test cases drawn from real production failures. Use code-based graders for objective checks (did the agent update the database correctly?), model-based graders for subjective quality, and human graders to calibrate automated scoring. Run both capability evals (what can it do?) and regression evals (does it still work?) on every code change.

What is the difference between pass@k and pass^k for AI agents?

pass@k measures whether an agent succeeds at least once in k attempts, useful for tools where one success is enough (like code generation). pass^k measures whether an agent succeeds on all k attempts, critical for customer-facing reliability. A 75% per-trial success rate gives only about 42% pass^3. Sierra’s Tau-Bench uses pass^k as its primary metric because customer service agents must be reliable on every interaction, not just most.

What are the best benchmarks for evaluating AI agents in 2026?

The five most-referenced benchmarks are GAIA (general AI assistant tasks, top score 61% at Level 3), SWE-bench Verified (coding agents resolving real GitHub issues, top scores above 70%), Tau-Bench (customer service agent reliability with pass^k metric), WebArena (browser-based web tasks, improved from 14% to 60% in two years), and BrowseComp (complex web research). Custom evals on your own data always matter more than benchmark scores.

What tools are used for AI agent evaluation?

LangSmith (LangChain’s platform for tracing and evaluation), Braintrust (unified eval and production monitoring), and Evidently AI (open-source with 25M+ downloads) are the most widely used. LangSmith excels for LangGraph/LangChain teams with built-in tracing. Braintrust suits full-stack teams needing eval plus monitoring. Evidently AI is best for open-source teams wanting synthetic data generation and adversarial testing.

Why can’t you use unit tests for AI agents?

Unit tests assert exact outputs for given inputs. AI agents produce different outputs every run due to non-deterministic LLM behavior, temperature settings, and context variations. Agents also execute multi-step workflows where a failure at step four may be caused by a decision at step two, and they interact with external systems that change independently. Instead of exact assertions, agent testing uses outcome-based grading, statistical metrics like pass^k, and evaluation suites that check whether the agent achieved the correct end state.

Cover image by Tudor Baciu on Unsplash Source

Why Standard Testing Falls Apart for Agents#

The Eval Mindset: What to Test and How#

Three Grader Types#

The pass@k vs. pass^k Distinction#

The Benchmarks That Matter in 2026#

GAIA#

SWE-bench Verified#

Tau-Bench and Tau2-Bench#

WebArena#

BrowseComp#

Building Your Eval Suite: A Practical Playbook#

Step 1: Start with Real Failures#

Step 2: Make Tasks Unambiguous#

Step 3: Test Absence, Not Just Presence#

Step 4: Isolate Each Trial#

Step 5: Watch for Saturation#

Tools for Agent Evaluation#

LangSmith#

Braintrust#

Evidently AI#

AgentEval (Anthropic)#

Common Mistakes and How to Avoid Them#

Frequently Asked Questions#

How do you test AI agents?#

What is the difference between pass@k and pass^k for AI agents?#

What are the best benchmarks for evaluating AI agents in 2026?#

What tools are used for AI agent evaluation?#

Why can’t you use unit tests for AI agents?#