APEX-Agents Benchmark: Why AI Models Score Under 25% on Real Professional Tasks

Photo by Helloquence on Unsplash Source

The best AI model in the world completes fewer than one in three professional work tasks correctly on its first try. That is the headline finding from APEX-Agents, a benchmark released by Mercor in January 2026 that tested frontier models on 480 real tasks from investment banking, management consulting, and corporate law. Claude Opus 4.6 topped the leaderboard at 29.8%. GPT-5.2 scored 23%. Most models clustered around 18% or below. These are not toy problems or academic puzzles. They are the actual daily work of professionals at Goldman Sachs, McKinsey, and Latham & Watkins.

The benchmark arrives at an awkward moment. Venture capital firms are declaring 2026 “the year of AGI.” Enterprise budgets for AI agents have tripled. CEOs keep telling earnings calls that AI will replace lawyers, accountants, and analysts. APEX-Agents does not say they are wrong. It says they are early, and the gap between demo and production is larger than most people think.

What APEX-Agents Actually Tests

Most AI benchmarks test narrow skills: answer a trivia question, write a function, solve a math problem. APEX-Agents tests something different: can an AI agent do an entire professional’s job across a realistic digital workspace?

The benchmark was built by Mercor’s research team working with over 200 domain experts, including practitioners from Goldman Sachs, McKinsey, and Cravath, Swaine & Moore. They created 33 simulated work environments containing 480 tasks spread across three professions:

Investment Banking Analyst. Tasks include building financial models from scattered source documents, drafting pitch book sections, analyzing comparable company data across multiple spreadsheets, and producing client-ready memos. The agent has to pull data from PDFs, emails, Slack threads, and Google Drive documents, then synthesize it into deliverables that a managing director would actually use.

Management Consultant. Tasks cover market sizing exercises, competitive analysis, process mapping, and deck writing. A typical task might require reading a client brief in one document, pulling financial data from a spreadsheet, cross-referencing Slack messages from “team members” for context, and producing a structured recommendation with supporting data.

Corporate Lawyer. Tasks include contract review, due diligence analysis, regulatory research, and memo drafting. The agent navigates document rooms, cross-references clauses across multiple agreements, and applies legal standards to specific factual scenarios.

Each task has 1 to 10 pass/fail rubrics written by the professionals who would actually evaluate this work. The standard is “client-ready,” not “technically present.” If the analysis is there but formatted wrong, it fails. If the data is correct but the reasoning is unsupported, it fails.

Crucially, web search is disabled. The agent can only work with the information available in its simulated environment, just like a junior analyst on a deal team who has to work with what is in the data room, not what is on the internet. This forces genuine information retrieval and synthesis rather than googling for answers.

The Leaderboard: How Every Major Model Performed

Here is the full leaderboard as of February 2026, measured by Pass@1 (first-try success rate):

Model	Overall Score	Investment Banking	Consulting	Law
Claude Opus 4.6 (Thinking=High)	29.8% ± 3.6%	33%	33%	24%
Gemini 3 Flash (Thinking=High)	24.0% ± 3.3%	-	19%	26%
GPT-5.2 (Thinking=High)	23.0% ± 3.2%	27%	23%	-
Claude Opus 4.5 (Thinking=High)	18.4% ± 2.9%	-	-	-
Gemini 3 Pro (Thinking=High)	18.4% ± 2.7%	-	-	24%
GPT-5 (Thinking=High)	18.3% ± 2.9%	27%	-	-
Grok 4	15.2% ± 2.4%	-	-	-
Kimi K2.5	14.4% ± 2.25%	-	-	-
Applied Compute 01-15	8.5% ± 2.0%	-	-	-

A few patterns stand out.

The best model fails 70% of the time. Opus 4.6 leads the pack, but a 30% success rate would get a junior analyst fired within a week. In professional services, “wrong seven out of ten times” is not a rounding error, it is a liability.

Models differ by domain. Opus 4.6 dominates banking and consulting (33% each) but drops to 24% on law. Gemini 3 Flash, by contrast, scores best on law (26%) and worst on consulting (19%). GPT-5.2 is strongest in banking (27%). There is no model that is uniformly good across all three professions.

Multiple attempts help, but not enough. When models get eight tries at each task (Pass@8), accuracy climbs to roughly 40%. That is better, but production environments do not give agents eight attempts. A client expects the answer once, correctly.

Progress is real but slow. A year ago, the best models scored 5 to 10% on comparable tasks. The jump to 24 to 30% is genuine progress. But the gap from 30% to 90% (minimum viable for autonomous professional work) is a different kind of problem than the gap from 5% to 30%.

Why Professional Work Breaks AI Agents

The APEX-Agents results reveal specific failure modes that explain why agents struggle with professional tasks, even when they perform well on isolated benchmarks like coding or math.

Cross-Application Information Tracking

Mercor CEO Brendan Foody put it directly: “The way we do our jobs isn’t with one individual giving us all the context in one place. In real life, you’re operating across Slack and Google Drive.”

A typical APEX task requires the agent to find a relevant email thread, open three linked documents, pull numbers from a spreadsheet, cross-reference them with a PDF, and synthesize everything into a memo. Most models lose track of information when switching between these contexts. They “forget” data they saw two steps ago, or they fail to locate the right file in the first place.

This is fundamentally different from, say, a coding benchmark where all the relevant context sits in one repository. Professional work is distributed across tools and formats by default.

Ambiguity and Judgment Calls

Coding tasks have correct answers. A function either passes the test suite or it does not. Professional services work is loaded with ambiguity. “Analyze the competitive landscape” does not specify how many competitors to include, what metrics to prioritize, or how deep the analysis should go. Professionals use experience and client knowledge to scope these decisions. Current models either over-scope (producing irrelevant detail) or under-scope (missing critical factors).

Format and Presentation Standards

Investment banks and consulting firms have exacting presentation standards. Numbers need specific decimal places. Charts need particular formatting. Memos follow rigid structures. Models frequently produce correct analysis wrapped in formatting that would never pass review. The APEX rubrics account for this, which is why scores are lower than raw “correctness” metrics would suggest.

What This Means for Enterprise AI Strategy

APEX-Agents is not a doom-and-gloom story. It is a calibration tool. Here is what the data actually tells decision-makers.

AI Agents Are Copilots, Not Replacements (Yet)

The 30% first-try success rate for the best model means agents can handle some professional tasks autonomously, but most still need human review. The practical deployment model is not “replace the analyst” but “give the analyst an assistant that handles the 30% of tasks it can do, and drafts the other 70% for human editing.”

A Workday study found that 37% of time saved through AI use was lost to rework, which is correcting or verifying what the AI produced. That ratio matters. If checking the agent’s work takes longer than doing it yourself, the productivity gain disappears.

Domain-Specific Tuning Matters

The fact that models perform differently across banking, consulting, and law means generic “AI for enterprise” deployments will underperform domain-specific ones. An organization deploying agents for legal work should evaluate models on legal benchmarks, not overall scores. Gemini 3 Flash’s 26% on law tasks outperforms its 19% on consulting by a wide margin.

Benchmarks Should Drive Procurement

Before APEX-Agents, enterprise buyers had limited ways to compare agent capabilities for knowledge work. SWE-Bench covers coding. GAIA covers general assistants. APEX now provides the first standardized benchmark for professional services. Procurement teams should require agent vendors to disclose their APEX scores (or equivalent domain-specific benchmarks) before signing contracts.

The Progress Curve Is What Matters

The absolute numbers (24%, 30%) are less important than the trajectory. Models went from 5-10% to 24-30% in roughly one year. If that pace continues, 50%+ accuracy by early 2027 is plausible. But benchmark progress does not always translate linearly to real-world capability. The last 20% (from 80% to “production-ready”) is historically the hardest in any engineering discipline.

Mercor publishes an open leaderboard and invites AI labs to submit their models. This means the benchmark will keep pace with model improvements. Watch the scores quarterly to track whether the “agents replace knowledge workers” thesis is converging with reality.

Frequently Asked Questions

What is the APEX-Agents benchmark?

APEX-Agents is a benchmark created by Mercor that tests AI agents on 480 real professional tasks from investment banking, management consulting, and corporate law. Tasks were designed by practitioners from Goldman Sachs, McKinsey, and Cravath, and require agents to navigate realistic digital workspaces with documents, spreadsheets, emails, and chat.

What is the highest score on the APEX-Agents benchmark?

As of February 2026, Claude Opus 4.6 holds the top score at 29.8% (Pass@1). Gemini 3 Flash scored 24.0% and GPT-5.2 scored 23.0%. No model has exceeded 33% in any single professional category.

Can AI agents replace lawyers and consultants?

Not yet. The APEX-Agents benchmark shows that the best AI models complete fewer than 30% of real professional tasks correctly on their first try. While progress is rapid (scores were 5-10% a year ago), models still struggle with cross-application information tracking, ambiguity, and professional formatting standards.

How does APEX-Agents differ from other AI benchmarks?

Unlike coding benchmarks (SWE-Bench) or general assistant benchmarks (GAIA), APEX-Agents simulates full professional workspaces with Slack, Google Drive, spreadsheets, PDFs, and email. It tests multi-step, multi-application tasks that mirror actual professional work rather than isolated skills.

Who created the APEX-Agents benchmark?

APEX-Agents was created by Mercor in collaboration with over 200 domain experts from leading professional services firms including Goldman Sachs, McKinsey, and Cravath, Swaine & Moore. The research paper is published on arXiv.

What APEX-Agents Actually Tests#

The Leaderboard: How Every Major Model Performed#

Why Professional Work Breaks AI Agents#

Cross-Application Information Tracking#

Ambiguity and Judgment Calls#

Format and Presentation Standards#

What This Means for Enterprise AI Strategy#

AI Agents Are Copilots, Not Replacements (Yet)#

Domain-Specific Tuning Matters#

Benchmarks Should Drive Procurement#

The Progress Curve Is What Matters#

Frequently Asked Questions#

What is the APEX-Agents benchmark?#

What is the highest score on the APEX-Agents benchmark?#

Can AI agents replace lawyers and consultants?#

How does APEX-Agents differ from other AI benchmarks?#

Who created the APEX-Agents benchmark?#