Claude Opus 4.5 scores 80.9% on SWE-bench Verified. The same model scores 45.9% on SWE-bench Pro. That 35-point gap is not a rounding error. It is the difference between a benchmark that has been contaminated by training data and one that has not. In March 2026, OpenAI published evidence that every frontier model, including their own GPT-5.2, shows signs of having seen SWE-bench Verified solutions during training. They stopped reporting Verified scores entirely.
This matters because AI agent benchmarks are how the industry decides which model is “best.” If the benchmarks are broken, the rankings are fiction. Here is what the major benchmarks actually measure, where they fail, and what to use instead.
SWE-bench: The Benchmark That Ate the AI Industry
SWE-bench, introduced by Princeton researchers in 2023, asks AI agents to resolve real GitHub issues from open-source Python repositories. Give the agent a bug report and a codebase. See if it produces a patch that passes the repo’s test suite. Simple premise, enormous influence.
The benchmark spawned three variants, each trying to fix problems with the last.
SWE-bench Verified contains 500 hand-validated tasks from the original dataset. Human annotators confirmed each task is solvable and the tests are correct. As of March 2026, the leaderboard shows Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, and GPT-5.2 at 80.0%. The scores are tightly clustered at the top because the benchmark has effectively saturated.
But the saturation is artificial. OpenAI’s investigation found that nearly 60% of problems their models failed contained fundamentally broken tests. When GPT-5.2 solved 31 tasks classified as “nearly impossible,” the team discovered the model had memorized information from release notes that described the exact fixes. Every frontier model showed similar contamination patterns.
SWE-bench Pro is Scale AI’s response. It draws from 1,865 tasks across multiple programming languages, using GPL-licensed and private codebases that are unlikely to appear in training data. The results are sobering: Claude Opus 4.5 drops to 45.9%. GPT-5.2 falls to around 23% on the private split. These numbers are probably closer to what agents can actually do on code they have never seen.
SWE-bench Live adds 50 new verified issues every month from active repositories. Because the tasks are fresh, contamination is structurally impossible. The dataset now includes over 1,565 tasks across 164 repositories, making it the most sustainable option for tracking genuine progress over time.
Why the Scaffold Matters as Much as the Model
Here is a number that should change how you read leaderboards: in February 2026, three different tools running the same Claude Opus 4.5 model scored 17 problems apart on 731 SWE-bench issues. Same model, different agent frameworks, different results. The scaffold (how the agent manages context, selects tools, retries failures, and structures its workflow) accounts for a significant portion of the final score.
This means SWE-bench does not measure model quality alone. It measures the combined performance of model plus agent architecture. A mediocre model in a well-designed scaffold can outperform a frontier model in a naive wrapper.
WebArena: Testing Agents That Browse the Web
While SWE-bench focuses on coding, WebArena tests whether AI agents can complete real tasks on websites. It deploys clones of real web applications (e-commerce sites, forums, content management systems, maps, code repositories) and asks agents to do things like “find the cheapest product matching these criteria” or “post a reply to this thread with specific information.”
The benchmark includes 812 tasks across these simulated environments. Each task requires multiple steps: navigating pages, filling forms, clicking buttons, interpreting results, and deciding what to do next. A human baseline scores around 78%.
Progress has been rapid. In 2023, the best agent scored 14.4%. By February 2026, optimized agents reached 61.7% on the full benchmark. That jump came from convergence on a modular architecture: a high-level planner that decomposes tasks, a specialized executor that interacts with pages, and a structured memory that tracks state across steps.
WebArena-Verified is a recent addition from ServiceNow, applying the same hand-validation approach as SWE-bench Verified to remove ambiguous or broken tasks. Optimized Docker images became available in February 2026, making it significantly easier to run evaluations locally.
BrowserGym is the underlying framework that WebArena and several other web benchmarks are built on. It provides a unified environment for running browser-based agent tasks, including MiniWoB (simple web interactions), WebArena (complex web tasks), and WorkArena (enterprise workflows). If you are building a web-browsing agent and want to benchmark it, BrowserGym is where you start.
ST-WebAgentBench adds a safety dimension that the others miss. Published in late 2025, it evaluates whether web agents handle sensitive data appropriately, avoid unauthorized actions, and respect user permissions. An agent that completes a task but leaks personal information along the way would score well on WebArena and fail on ST-WebAgentBench.
AgentBench and General-Purpose Evaluation
Not every agent writes code or browses the web. AgentBench, introduced in 2023 by Tsinghua University researchers, evaluates agents across eight distinct environments: operating systems, databases, knowledge graphs, card games, lateral thinking puzzles, house-holding tasks, web shopping, and web browsing.
The breadth is the point. An agent that excels at coding might fail at navigating a file system or querying a database. AgentBench treats these as fundamentally different capabilities rather than variations of the same skill.
Other notable general-purpose benchmarks include:
GAIA (General AI Assistants) tests multi-step reasoning that requires combining web search, document reading, and logical deduction. Tasks are designed so humans can solve them in minutes but current AI systems struggle with the multi-step planning required.
Tau-bench from Sierra focuses specifically on customer service scenarios. Even GPT-4o succeeded on fewer than 50% of real-world customer service tasks in its evaluation, revealing a significant gap between chatbot performance on general benchmarks and actual business workflows.
The AI Agent Benchmark Compendium maintained by Philipp Schmid catalogs over 50 agent benchmarks, organized by category: function calling, tool use, coding, computer interaction, and general reasoning. If a benchmark exists for your agent’s domain, it is probably listed there.
Building Evals That Actually Matter for Your Agent
Public benchmarks tell you which model is trending upward. They do not tell you whether an agent will work for your specific use case. Anthropic’s engineering guide on agent evals offers a practical framework that has influenced how most teams think about evaluation in 2026.
Their core recommendation: start with 20 to 50 test cases drawn from real failures. Not synthetic scenarios, not comprehensive coverage, just the things that actually broke in production. Early in an agent’s lifecycle, changes have large effect sizes, so small sample sizes are sufficient to detect real improvements.
The guide describes three evaluation dimensions that apply regardless of what your agent does:
Don’t break things. The agent should not corrupt data, trigger unintended side effects, or leave systems in an inconsistent state. For a code agent, this means the existing test suite still passes. For a customer service agent, this means it does not make unauthorized promises.
Do what was asked. The agent should complete the requested task. This sounds obvious, but measuring it is tricky when tasks are open-ended. Descript, the video editing company, built evals that use a second LLM as judge, periodically calibrated against human graders, to assess whether editing actions matched user intent.
Do it well. Beyond completing the task, was the result high quality? A code patch that passes tests but introduces technical debt scores differently than an elegant solution. This dimension is the hardest to automate and where LLM-as-judge approaches add the most value.
The payoff of investing in evals is speed. Teams with automated evaluation suites can upgrade models in days. Teams without them face weeks of manual testing every time a new model version ships. In a market where model capabilities improve quarterly, that difference determines whether you stay current or fall behind.
Frequently Asked Questions
What is SWE-bench and why does it matter for AI agents?
SWE-bench is a benchmark that tests whether AI agents can resolve real GitHub issues from open-source repositories. It matters because it measures practical coding ability rather than theoretical knowledge. However, the original SWE-bench Verified dataset has been contaminated by training data, with scores inflated by 30+ percentage points. SWE-bench Pro and SWE-bench Live are more reliable alternatives.
Why are SWE-bench Verified scores unreliable in 2026?
OpenAI found that every frontier model, including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash, had been trained on SWE-bench Verified solutions. Models scoring 80% on Verified dropped to roughly 23-46% on SWE-bench Pro, which uses code that was not in training data. OpenAI has stopped reporting Verified scores and recommends SWE-bench Pro instead.
What is the difference between SWE-bench Verified and SWE-bench Pro?
SWE-bench Verified contains 500 Python-only tasks from public open-source repos, which have been contaminated by model training. SWE-bench Pro uses 1,865 multi-language tasks from GPL-licensed and private codebases that are unlikely to appear in training data. The performance gap is dramatic: Claude Opus 4.5 scores 80.9% on Verified but only 45.9% on Pro.
How does WebArena evaluate AI agents differently from SWE-bench?
WebArena tests whether AI agents can complete multi-step tasks on real websites, such as e-commerce shopping, forum posting, and content management. It includes 812 tasks requiring navigation, form filling, and decision making. While SWE-bench focuses exclusively on code editing, WebArena evaluates the kind of web-based work that many real-world agents need to perform.
How should I evaluate my own AI agent if public benchmarks don’t fit?
Anthropic recommends starting with 20 to 50 test cases drawn from real failures your agent has encountered. Focus on three dimensions: the agent should not break existing functionality, it should complete the requested task, and the results should be high quality. Using an LLM as a judge, calibrated against periodic human review, scales this evaluation without requiring manual checking of every output.
