AI Agent Reliability: Why OpenAI and Anthropic Are Becoming Consultants

Photo from Pexels (free license) Source

OpenAI currently employs around 60 consulting engineers who customize models with customer data and build AI agents on-site. They have another 200+ staff in technical support. And they are hiring hundreds more. Anthropic is doing the same thing from the other side: signing a $200 million multi-year deal with Snowflake, embedding with ServiceNow, and publishing deployment playbooks distilled from dozens of customer implementations.

The two most advanced AI model companies in the world are becoming consulting firms. That is the single most honest signal about where AI agent reliability stands in 2026. The models are good. Getting them to work reliably inside a real enterprise is a different problem entirely.

The Concierge Phase of Enterprise AI

There is a term in startup circles for this: the concierge phase. You do things manually for your first customers because the product cannot do them alone yet. OpenAI calls their team “Forward Deployed Engineers” (FDEs), borrowing Palantir’s playbook. These engineers sit with customer teams, build custom agent workflows, and optimize model behavior for specific use cases. Anthropic takes the same approach with what they call direct customer implementation work.

This is not a temporary growth hack. It is structural. LangChain’s State of Agent Engineering survey of 1,340 teams shows that 57.3% now run agents in production, up from 51% a year ago. But 32% of those teams cite quality as their primary barrier: accuracy, consistency, tone adherence. Another 20% cite latency. For enterprises with 2,000+ employees, security jumps to the second-biggest concern at 24.9%.

The gap between “the model can do this” and “the model does this reliably at scale” requires human engineering to close. That is why both companies are sending people, not just shipping APIs.

What Actually Breaks

French retailer Fnac tested models from both OpenAI and Google for customer support and hit the same wall: the agents kept mixing up serial numbers. The model understood the question. It could generate a fluent response. But it pulled the wrong identifier from the context window, routing the customer to the wrong product.

This type of failure is representative. Error rates for autonomous multi-step reasoning have dropped from 8-12% to 3-5%, which sounds acceptable until you run the math on a customer service operation handling 10,000 conversations per day. At 3%, that is 300 wrong answers daily. At 5%, it is 500. Each one potentially damages a customer relationship, triggers a complaint, or creates liability.

The Deployment Playbook Both Companies Agree On

Anthropic published “Building Effective Agents” in late 2024 and has been updating it since. OpenAI documented deployment strategies from seven Frontier customers including HP, Intuit, Oracle, State Farm, Thermo Fisher Scientific, and Uber. Despite different product philosophies, both converge on the same core principles.

Start Simple, Stay Simple

The most counterintuitive finding from both playbooks: the most successful implementations do not use complex frameworks or specialized libraries. They use simple, composable patterns. Anthropic’s guide explicitly warns against reaching for multi-agent architectures when a single prompt chain will do.

Their recommended progression:

Prompt chaining: Break a task into sequential steps with validation between each one. Slower, but each LLM call is simpler and more reliable.
Routing: Classify inputs and direct them to specialized handlers. A customer complaint goes to one workflow, a billing question to another.
Parallelization: Run subtasks simultaneously or generate multiple outputs for comparison. Useful when you need confidence through redundancy.
Orchestrator-workers: A central LLM delegates to specialized workers. Best for unpredictable tasks like multi-file code changes.
Evaluator-optimizer: One LLM generates, another critiques. Iterate until quality meets a threshold.

The pattern to notice: each level adds complexity and latency. Most production deployments that work use levels one and two.

Categorize Tool Risk

The WorkOS Enterprise AI Agent Playbook synthesizes both companies’ guidance into a tool risk framework:

Data tools (lowest risk): Read-only database queries, document parsing, web search. If the agent calls the wrong one, the worst outcome is irrelevant information.
Action tools (medium risk): Sending emails, updating CRM records, escalating tickets. Mistakes here are visible to customers.
Orchestration tools (highest risk): Agent-as-tool deployments where one agent delegates to another. Permissions cascade, and a single misconfigured guardrail can propagate across the entire chain.

Both companies recommend a layered security model: LLM-based guardrails that detect sophisticated prompt injections through reasoning, rules-based protections for known attack patterns, and content safety APIs that flag harmful inputs before they reach the core agent.

What White-Glove Deployments Actually Produce

The companies that went through this hands-on deployment process with OpenAI or Anthropic report numbers that stand out from the general population of AI projects.

Klarna deployed a customer service agent (on OpenAI) that now handles two-thirds of all customer chats. Average resolution time dropped from 11 minutes to 2 minutes. The company reported a $40 million profit improvement attributable to the deployment.

Morgan Stanley achieved a 98% AI adoption rate across the firm, though details on which workflows and agent architectures they use remain under NDA.

BBVA, the Spanish banking group, deployed 2,900 custom agents in five months. Their credit risk team uses AI to assess creditworthiness faster than their previous rule-based system.

Lowe’s improved product tagging accuracy by 20% and error detection by 60% using agents to process unstructured product data.

These results share a common pattern: each company had dedicated engineering support from the model vendor, each scoped their initial deployment to a specific workflow, and each invested heavily in evaluation infrastructure before going live.

The Numbers for Everyone Else

For organizations without white-glove vendor support, the picture is murkier. Deloitte’s State of AI in the Enterprise report and the LangChain survey show 89% of teams have implemented some form of observability for their agents, but only 52% actually run evaluations. That is a dangerous gap: teams are watching their agents but not systematically testing whether the outputs are correct.

75% of enterprise leaders now rank security, compliance, and auditability as the most critical requirements for agent deployment. Three out of four current agentic AI projects have encountered or expect significant security challenges, including prompt injection, unauthorized data exposure, and agents exceeding their intended permissions.

What This Means for Your AI Agent Strategy

The consulting pivot by OpenAI and Anthropic is a pricing signal disguised as a service offering. It means:

Agent reliability is not a model problem. If it were, a model upgrade would fix it. The vendors are sending engineers because the gap is in integration, evaluation, and operational design. Better models will help, but they will not eliminate the need for careful engineering.

Start with one workflow, not a platform. Every successful deployment in the data started narrow. Klarna did customer service. BBVA did credit risk. Lowe’s did product tagging. None of them launched an “AI transformation initiative” that tried to agent-ify everything at once.

Invest in evaluation before you invest in features. Anthropic’s own guidance: start with 20-50 evals drawn from real failures. The teams that skip this step are the ones contributing to Gartner’s prediction that over 40% of agentic AI projects will be canceled by 2027.

Budget for the human bridge. Whether that is vendor consulting hours, internal AI engineers, or third-party integration partners, the human cost of making agents reliable in production is real and will remain real through at least 2027. Plan for it instead of being surprised by it.

The concierge phase will end eventually. Models will get more reliable. Tool-calling error rates will drop below 1%. Evaluation frameworks will mature. But for now, the most capable AI companies in the world are telling you, through their hiring patterns, that agents need human help to work. Believe them.

Frequently Asked Questions

Why are OpenAI and Anthropic sending engineers to enterprise customers?

AI agents frequently fail to work reliably in production environments. OpenAI employs 60+ consulting engineers (and is hiring hundreds more) to customize models with customer data and build agents on-site. Anthropic does similar work directly with customers. The models are capable, but reliable deployment requires human engineering for integration, evaluation, and operational design.

What is the current error rate for AI agents in production?

Error rates for autonomous multi-step AI agent reasoning have fallen from 8-12% to approximately 3-5% in 2026. While this is a significant improvement, at enterprise scale (e.g., 10,000 daily interactions), a 3% error rate still means 300 incorrect responses per day, which is why vendors invest heavily in deployment support.

What are the biggest barriers to AI agent deployment in enterprise?

According to LangChain’s 2026 survey of 1,340 teams, quality (accuracy, consistency, tone) is the #1 barrier at 32%. Latency is second at 20%. For larger enterprises with 2,000+ employees, security rises to the second-biggest concern at 24.9%. 75% of enterprise leaders rank security, compliance, and auditability as the most critical deployment requirements.

Both companies recommend starting with simple, composable patterns rather than complex multi-agent frameworks. Anthropic’s five recommended patterns in order of complexity are: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. Most successful production deployments use only the first two levels.

What ROI have enterprises achieved with white-glove AI agent deployments?

Klarna’s customer service agent handles two-thirds of chats, cutting resolution from 11 to 2 minutes and contributing $40 million in profit improvement. BBVA deployed 2,900 custom agents in five months. Morgan Stanley achieved 98% AI adoption. Lowe’s improved product tagging accuracy by 20% and error detection by 60%. All had dedicated vendor engineering support.

The Concierge Phase of Enterprise AI#

What Actually Breaks#

The Deployment Playbook Both Companies Agree On#

Start Simple, Stay Simple#

Categorize Tool Risk#

What White-Glove Deployments Actually Produce#

The Numbers for Everyone Else#

What This Means for Your AI Agent Strategy#

Frequently Asked Questions#

Why are OpenAI and Anthropic sending engineers to enterprise customers?#

What is the current error rate for AI agents in production?#

What are the biggest barriers to AI agent deployment in enterprise?#

What deployment patterns do OpenAI and Anthropic recommend for AI agents?#

What ROI have enterprises achieved with white-glove AI agent deployments?#