State of Agent Engineering 2026: What 1,300 Teams Actually Report

Photo by Stephen Dawson on Unsplash Source

57.3% of teams building AI agents have them running in production. That is up from 51% a year ago and comes from LangChain’s State of Agent Engineering survey, which collected 1,340 responses from engineers, product managers, and executives between November and December 2025. The number itself is not the story. The story is what those teams are struggling with once the agent is live.

Quality remains the number one barrier at 32%. Latency has emerged as a new pain point. And here is the most telling statistic: 89% of teams have observability (they can see what their agents do), but only 52% run evaluations (they actually test whether agents do the right thing). Teams are watching their agents like hawks while skipping the part where they grade the homework.

Who Responded and Why It Matters

The survey skews toward teams already building. 63% work in technology, 10% in financial services, 6% in healthcare. Nearly half (49%) come from companies with fewer than 100 employees, while 18% represent organizations with over 2,000 people. This is not a random sample of all businesses. It is a snapshot of the people actively building agent systems, which makes the findings more useful, not less.

Enterprise adoption tells a specific story: 67% of organizations with 10,000+ employees already have agents in production, compared to 50% of smaller companies. But smaller companies are closing the gap. 36% of sub-100-employee teams are actively developing with concrete deployment plans, compared to 24% at large enterprises. The small teams move faster. The big teams have more agents already deployed.

These numbers align with G2’s Enterprise AI Agents Report, which independently found 57% of organizations with agents in production. When two different surveys land on nearly the same number, the signal is real.

What Teams Are Building: Coding Agents Win

The official use case rankings put research and summarization first (58%), personal productivity second (53.5%), and customer service third (45.8%). But the most revealing data comes from the write-in responses where people described the agents they use every single day.

Coding agents dominated. Claude Code, Cursor, GitHub Copilot, Amazon Q, Windsurf. These tools appeared far more often than any enterprise use case. Research agents (ChatGPT, Claude, Gemini, Perplexity) came second. Custom internal agents built on LangChain and LangGraph came third, covering everything from QA testing to text-to-SQL to demand planning.

This is a gap between what companies officially deploy and what individual engineers actually rely on. The enterprise-sanctioned agent handles customer tickets. The coding agent that saves each developer two hours a day does not show up in the official project list. Deloitte’s State of AI in the Enterprise report confirms this pattern: 60% of workers now have access to sanctioned AI tools, up from under 40% a year prior. The unsanctioned tools are harder to count.

Primary Deployment by Use Case

When teams pick a single primary use case, the ranking shifts:

Customer service: 26.5% (triaging, resolving, accelerating response times)
Research and data analysis: 24.4%
Internal workflow automation: 18%

For enterprises with 10,000+ employees, internal productivity jumps to the top at 26.8%, with customer service close behind at 24.7%. Large companies are automating internal operations first and customer-facing workflows second. That priority order makes sense: internal mistakes are cheaper than customer-facing ones.

The Observability-Evaluation Gap

This is the finding that should worry engineering leaders most. Agent observability is nearly universal: 89% of teams have implemented some form of it, and 62% have detailed tracing where they can inspect individual agent steps and tool calls. For teams already in production, those numbers climb to 94% and 71.5%.

But evaluation, the practice of systematically testing whether agents produce correct outputs, trails far behind. Only 52.4% run offline evaluations on test sets. Only 37.3% run online evaluations monitoring real-world performance. And 29.5% are not evaluating at all, a number that drops to 22.8% for production deployments but is still startlingly high.

The gap makes a certain kind of sense. Observability answers “what did the agent do?” Evaluation answers “was that the right thing to do?” The first question is easier to instrument. The second requires defining what “right” means for your specific use case, which is genuinely hard for non-deterministic systems.

How Teams Evaluate (When They Bother)

Among teams that do evaluate, the methods are telling:

Human review: 59.8% (still the most trusted approach)
LLM-as-judge: 53.3% (using one model to grade another)
Traditional ML metrics (ROUGE/BLEU): limited adoption

About 25% of evaluation adopters use both offline and online approaches, which is the gold standard. The rest split between one or the other. LLM-as-judge adoption at 53.3% shows that the practice of using AI to evaluate AI has gone mainstream, even if the methodology is still being refined.

Quality, Latency, and the Fading Cost Problem

The barrier landscape has shifted meaningfully since the previous year’s survey.

Quality at 32% remains the top blocker. Hallucinations and output inconsistency are specifically cited by enterprise respondents. This matches Gartner’s prediction that over 40% of agentic AI projects started in 2025 will be canceled by 2027, primarily due to quality and trust failures.

Latency at 20% is the newcomer. As agents move from demos to production workflows where users wait for responses, speed matters. A research agent that takes 45 seconds is fine. A customer service agent that takes 45 seconds loses the customer.

Security rises to second place for enterprises at 24.9%, reflecting the reality that giving agents access to production systems creates attack surface. Smaller companies rank it lower, likely because they have fewer production systems at risk.

Cost at 18.4% has dropped significantly from the previous year. Model pricing improvements from OpenAI, Anthropic, and open-source alternatives have taken the edge off. This is one of the few genuinely positive developments: the technology is getting cheaper fast enough that cost is becoming a secondary concern.

Frameworks and Models: The Multi-Everything Approach

75%+ of teams use multiple models. OpenAI leads adoption at 67%+, but Claude, Gemini, and open-source models see significant usage. 33% of teams are investing in self-hosted and open-source model infrastructure, and 57% rely on base models with prompt engineering and RAG rather than fine-tuning.

On the framework side, the survey shows LangGraph as the most popular low-level orchestration framework with 12 million monthly downloads and production deployments at Uber, Klarna, LinkedIn, and J.P. Morgan. CrewAI claims 60% of Fortune 500 companies and 100,000+ daily agent executions. Microsoft merged AutoGen and Semantic Kernel into a unified Agent Framework.

LangChain’s own team has been refreshingly direct about this: “Use LangGraph for agents, not LangChain.” The original LangChain library is the starting point. LangGraph is the production tool.

Enterprise Permission Patterns

The survey revealed an interesting split in how organizations handle agent permissions. Larger enterprises (2,000+ employees) lean heavily toward read-only agent permissions, restricting agents to information retrieval and analysis without allowing them to take actions. Smaller companies prioritize tracing and rapid iteration, giving agents write access but instrumenting everything they do.

This aligns with the finding that security is a top-two concern for enterprises. When an agent can read your Salesforce data but cannot modify it, the blast radius of any failure is limited. Regulated industries need agents that can prove exactly what they did and did not do.

What This Means for 2026

The LangChain survey captures a field that is past the hype cycle and into the “actually building things” phase. The 57.3% production figure is real, corroborated by independent surveys, and growing. But the quality barrier at 32%, the evaluation gap (89% observability vs. 52% evaluation), and the governance deficit (only 21% of companies have mature agent governance per Deloitte) point to a field that is deploying faster than it is learning to verify.

Three things are worth watching:

The evaluation tooling market will explode. If 89% of teams have observability but only 52% evaluate, there is a massive gap waiting for better tools. Expect LangSmith, Braintrust, Cleanlab, and a wave of startups to compete hard on agent evaluation in 2026.

Coding agents will reshape how we measure “AI agent adoption.” They are already the most-used daily agents by a wide margin, but they do not appear in most enterprise deployment counts. As companies start tracking developer tool usage as AI agent adoption, the headline numbers will change.

The governance reckoning is coming. PwC reports that 78% of organizations plan to increase agent autonomy in the next year. But only 21% have governance models ready for it. That gap will produce the first wave of high-profile agent failures in regulated industries.

Agent engineering is no longer a niche practice. It is a discipline with its own survey data, its own failure modes, and its own rapidly forming set of best practices. The teams that treat it as such, investing in evaluation alongside observability, governance alongside deployment, will be the ones still running agents in production a year from now.

Frequently Asked Questions

What percentage of teams have AI agents in production in 2026?

57.3% of teams surveyed by LangChain have AI agents running in production, up from 51% the previous year. This figure is corroborated by G2’s independent survey, which found 57% production adoption among enterprise respondents.

What is the biggest barrier to AI agent deployment?

Quality and unreliable performance remain the top barrier at 32%, according to LangChain’s State of Agent Engineering survey. Hallucinations and output inconsistency are the most commonly cited quality issues. Latency has emerged as the second major barrier at 20%, while cost concerns have decreased significantly due to model pricing improvements.

How many teams evaluate their AI agents?

Only 52.4% of teams run offline evaluations on test sets, and 37.3% run online evaluations monitoring real-world performance. 29.5% of teams are not evaluating their agents at all. This contrasts sharply with observability adoption at 89%, revealing a significant gap between watching agents and testing them.

What are the most popular AI agent use cases in 2026?

Research and summarization leads at 58%, followed by personal productivity at 53.5% and customer service at 45.8%. However, write-in responses reveal that coding agents (Claude Code, Cursor, GitHub Copilot) are the most frequently used daily agents by a wide margin, despite not topping the official categories.

What is agent engineering as a discipline?

Agent engineering is an emerging discipline that combines product thinking, software engineering, and data science to build and maintain AI agent systems. It focuses on the iterative refinement of non-deterministic systems, including evaluation, observability, governance, and production deployment of autonomous AI agents.

Who Responded and Why It Matters#

What Teams Are Building: Coding Agents Win#

Primary Deployment by Use Case#

The Observability-Evaluation Gap#

How Teams Evaluate (When They Bother)#

Quality, Latency, and the Fading Cost Problem#

Frameworks and Models: The Multi-Everything Approach#

Enterprise Permission Patterns#

What This Means for 2026#

Frequently Asked Questions#

What percentage of teams have AI agents in production in 2026?#

What is the biggest barrier to AI agent deployment?#

How many teams evaluate their AI agents?#

What are the most popular AI agent use cases in 2026?#

What is agent engineering as a discipline?#