Photo from Pexels (free license) Source

An AI agent that solves 90% of tasks sounds production-ready. It is not. If those 10% failures are random and unpredictable, you cannot automate anything that matters with it. A human still has to watch every run, which defeats the entire point of an autonomous agent. Princeton researchers Sayash Kapoor and Arvind Narayanan (the authors of AI Snake Oil) just published a 66-page paper that formalizes this problem. Their core argument: the industry is measuring AI agents wrong, and that measurement failure is hiding a reliability crisis.

The paper, “Towards a Science of AI Agent Reliability,” proposes twelve concrete metrics across four dimensions. It evaluates 14 frontier models across two benchmarks spanning 18 months of releases. The headline finding: accuracy has improved dramatically over that period, but reliability has barely budged.

Related: AI Agent Reliability: Why OpenAI and Anthropic Are Becoming Consultants

Why Average Accuracy Hides the Real Problem

Every major agent benchmark works the same way. Run the agent on a set of tasks, count successes, divide by total, report the percentage. SWE-bench Verified, WebArena, GAIA: they all reduce agent performance to a single accuracy number. That number goes up every few months, and press releases celebrate the improvement.

The Princeton team argues this is fundamentally misleading. A single success rate compresses away everything that matters for deployment. An agent with 85% accuracy could be one that reliably solves 85% of task types and consistently fails on the remaining 15%. That agent is useful: you can route the hard tasks to humans and automate the rest. Or it could be an agent that solves a different random 85% each time you run it. Same accuracy, completely different operational profile. The second agent is nearly useless for automation because you never know which tasks will fail.

As the authors write: “An agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system.” This distinction between assistant (human-in-the-loop) and autonomous system (no human watching) is the crux of the paper.

Related: AI Agent Benchmarks Explained: What SWE-bench, WebArena, and AgentBench Actually Measure

The Four Dimensions of Agent Reliability

The framework borrows from decades of reliability engineering in aviation, nuclear power, automotive systems, and industrial process control. These industries figured out long ago that “it works most of the time” is not a safety standard. The Princeton team maps those principles onto AI agents with four dimensions and twelve metrics.

Consistency: Does It Behave the Same Way Twice?

Consistency measures whether an agent produces the same outcomes, follows the same reasoning paths, and uses the same resources when given the same task multiple times. The paper breaks this into three metrics:

  • Outcome consistency: Does the agent reach the same final answer across runs?
  • Trajectory consistency: Does it take the same steps to get there?
  • Resource consistency: Does it use similar amounts of computation and tool calls?

An agent with high accuracy but low outcome consistency solves a task on some runs and fails on others, with no external change. This is the most dangerous reliability gap because it means you cannot predict whether a given run will succeed.

Robustness: Does It Break When Conditions Shift?

Robustness tests whether the agent holds up when inputs or conditions change slightly. The three metrics here cover:

  • Fault tolerance: Can it recover from tool failures or API errors?
  • Environmental robustness: Does it still work when external conditions change?
  • Prompt robustness: Does rephrasing the same request cause different outcomes?

Most agents are evaluated under ideal conditions. The paper tests what happens when conditions are merely realistic.

Predictability: Does It Know When It Is Wrong?

This dimension measures calibration, the agent’s ability to signal uncertainty when it is likely to fail. The metrics include:

  • Calibration: When the agent says it is 80% confident, does it succeed about 80% of the time?
  • Discrimination: Can it distinguish between tasks it will solve and tasks it will not?
  • Brier score: A composite measure of prediction accuracy.

An agent with good calibration is operationally valuable even when it fails because it warns you in advance. An agent with poor calibration fails silently, which is far worse.

Safety: How Bad Are the Failures?

The final dimension asks not “does it fail?” but “how much damage does the failure cause?” Two metrics:

  • Compliance: Does the agent follow task constraints and boundaries?
  • Harm severity: When it fails, are the consequences minor (wrong formatting) or catastrophic (deleted production database)?

A well-calibrated agent with bounded failure severity is deployable even at moderate accuracy. An uncalibrated agent with unbounded failure severity is dangerous even at 95% accuracy.

What 14 Models Actually Scored

The paper evaluates models spanning 18 months of releases from OpenAI, Anthropic, and Google. The lineup includes GPT-4o mini, GPT-4 Turbo, o1, and GPT-5.2 from OpenAI; Claude 3.5 Haiku, Claude 3.7 Sonnet, Claude 4.5 Sonnet, and Claude 4.5 Opus from Anthropic; and Gemini 2 Flash through Gemini 3 Pro from Google. They tested on two complementary benchmarks, and the results are sobering.

Accuracy Up, Reliability Flat

Over the 18-month window, raw accuracy improved substantially across all model families. Meanwhile, reliability as measured by the twelve metrics improved only modestly. The gap between capability and reliability is widening, not closing. More capable agents are not automatically more reliable agents.

Bigger Models Are Not Uniformly Better

This was the most counterintuitive finding. Scaling up model size within a family improves some reliability dimensions (calibration, robustness) but actively hurts others (consistency). The paper’s explanation: larger models have more strategies available for solving a given task. That is good for accuracy. But it means they take different approaches across runs, which tanks consistency.

A smaller model that only knows one way to solve a task will solve it the same way every time. A larger model with five approaches will pick a different one each run. Both might reach 90% accuracy, but the smaller model will be more consistent.

Calibration Is the Bright Spot

Claude models in particular demonstrated stronger calibration across both benchmarks, maintaining well-aligned confidence estimates even as task complexity increased. This matters because calibration is the dimension that makes human-agent collaboration practical: if the agent reliably flags uncertainty, a human can intervene on exactly the right tasks.

Consistency and Predictability Need the Most Work

The paper isolates consistency and predictability as the dimensions requiring “immediate research focus.” Outcome consistency remains low across all models, meaning agents that can solve a task often fail to solve it consistently. Agents show better distribution consistency (they pick similar action types across runs) but poor sequence consistency (the order of operations varies wildly).

Related: AI Agent Testing: How to QA Non-Deterministic Systems

What This Means for Agent Builders

The Princeton framework is not purely academic. It reshapes how you should evaluate agents before deploying them.

Run Every Eval Multiple Times

If you are only running your evaluation suite once and reporting the number, you are hiding your own reliability gaps. The paper recommends running each task multiple times and reporting variance alongside accuracy. A task that passes 9 out of 10 times is not the same as a task that passes 10 out of 10, even though a single run cannot tell them apart.

Match the Dimension to the Use Case

Not all four dimensions matter equally for every deployment. A coding agent where a human reviews every diff cares most about calibration (does it flag uncertain code?). An autonomous customer support agent cares most about consistency and safety (does it give the same answer to the same question, and does it avoid harmful responses?). A data pipeline agent cares most about robustness (does it recover from API failures?).

The paper suggests building reliability profiles per deployment rather than chasing a single reliability score.

Track Reliability Across Model Updates

The paper found that model updates within the same family can improve accuracy while degrading specific reliability dimensions. If you update from Claude 3.7 Sonnet to Claude 4.5 Sonnet, your accuracy will likely improve. But your consistency profile might change in ways that break your specific workflow. Test for this explicitly.

Use Calibration as a Routing Signal

The strongest practical takeaway: if your model has good calibration, use its confidence scores to route tasks. High-confidence tasks go to the autonomous path. Low-confidence tasks go to human review. This is not new advice, but the paper provides the first rigorous evidence that calibration varies significantly across models and should inform model selection.

The Princeton team plans to launch an AI agent reliability index to track these metrics systematically across releases. Until that exists, the twelve metrics in the paper give you a concrete checklist for evaluating any agent before deployment.

Related: Goal vs. Rules: The AI Agent Safety Benchmark Where 71% of Models Break Constraints

Frequently Asked Questions

What are the four dimensions of AI agent reliability?

Princeton researchers define four dimensions: consistency (same results across runs), robustness (stability under changing conditions), predictability (calibrated uncertainty signals), and safety (bounded failure severity). Each dimension has three metrics, totaling twelve concrete measurements.

Why is AI agent accuracy not enough to measure reliability?

A single accuracy score hides critical operational information. An agent with 90% accuracy that fails on a random 10% of tasks each run is far less useful than one that consistently fails on the same 10% of tasks. The first cannot be automated safely; the second can, because you can route its failure cases to humans.

Do bigger AI models mean more reliable AI agents?

Not uniformly. Princeton’s study of 14 models found that scaling up improves calibration and robustness but can decrease consistency. Larger models have more solution strategies available, which increases run-to-run variability. Smaller models within the same family sometimes score higher on consistency because they have fewer approaches to choose from.

How should teams evaluate AI agent reliability before deployment?

Run every evaluation multiple times and report variance alongside accuracy. Build reliability profiles matched to your specific use case. Track reliability across model updates, since accuracy improvements do not guarantee reliability improvements. Use calibration scores to route tasks between autonomous and human-reviewed paths.

Which AI agent reliability dimension needs the most improvement in 2026?

Consistency and predictability. Princeton’s study found that outcome consistency remains low across all 14 evaluated models: agents that can solve a task often fail to solve it consistently across runs. Predictability (how well agents signal their own uncertainty) also lags behind improvements in raw accuracy.