A single AI agent running at 95% accuracy sounds fine. Chain ten sequential calls together to resolve a complex incident, and that accuracy compounds down to roughly 60%. Dynatrace CTO Bernd Greifeneder demonstrated this math at Perform 2026, and it captures the core problem with agentic AI in production: probabilistic models compound errors across steps, and traditional application performance monitoring has no way to detect it happening.
This is why observability for AI agents is no longer just about watching dashboards. It is becoming the control plane: the layer that grounds agent decisions in deterministic data, enforces governance boundaries, and intervenes before cascading failures reach customers. The agentic AI monitoring market reflects this urgency, growing from $550 million in 2025 to a projected $2.05 billion by 2030 at a 30% CAGR.
What Makes Agent Observability Different from Traditional APM
Traditional APM watches requests flow through services. It measures latency, error rates, and throughput. It works because each service does the same thing every time: receive request, process it, return response. AI agents break every one of those assumptions.
Non-deterministic execution paths. The same input produces different tool calls, different reasoning chains, and different outputs across runs. A traditional trace shows a fixed call graph. An agent trace is a tree that branches differently every time.
Multi-step reasoning with compounding errors. An agent researching a topic might call a search API, read three documents, decide two are irrelevant, synthesize the third, and produce a report. A bad decision at step two (discarding the right document) causes a wrong answer at step five. APM sees five successful API calls. Agent observability needs to track the reasoning behind each decision.
Tool orchestration with side effects. Agents call external tools: databases, APIs, browsers, code interpreters. Each tool call can modify state. A customer service agent that processes a refund cannot undo it. You need to see what the agent decided, why, and whether it had the right context before the action happened.
Token economics. Every LLM call costs money. An agent stuck in a reasoning loop can burn through $50 in tokens before anyone notices. Traditional APM tracks compute costs at the infrastructure level. Agent observability tracks token usage per decision step, per model, per agent.
LangChain’s State of Agent Engineering survey puts numbers on this gap: 89% of teams have some form of observability, but only 52% run evaluations that actually test whether the agent’s output was correct. Teams are watching their agents without grading them.
Dynatrace’s Bet: Determinism Before Autonomy
Dynatrace is making the boldest play in this space, reframing observability from passive visibility into active governance. Their 2026 Pulse of Agentic AI report surveyed organizations running agentic AI and found that 52% cite security and privacy concerns as their top barrier, while 51% point to technical challenges, specifically limited visibility into agent behavior.
The numbers explain why 69% of agentic decisions still undergo human review: organizations do not trust agents to act alone because they cannot see what agents are doing well enough to verify the decisions.
Dynatrace Intelligence: Three Layers of Deterministic Grounding
At Perform 2026, Dynatrace announced an architecture that places deterministic analysis before any generative AI invocation:
Root Cause Agent analyzes causal dependencies using their Smartscape topology graph across millions of entity relationships. No LLM involved. Pure deterministic graph traversal.
Analytics Engine transforms data in their Grail data lakehouse into context-rich, AI-optimized information. Structured data, not probabilistic inference.
Forecasting Agent runs predictive analytics across millions of metrics simultaneously using Davis AI, their deterministic AI engine.
Only after these three layers provide structured context does a generative model get involved. The results, according to Dynatrace benchmarks: 12x higher success rates compared to LLM-only approaches, 3x faster resolution times, and 50% reduction in token costs.
Steve Tack, Dynatrace’s Chief Product Officer, framed the shift directly: observability is no longer just about visibility. It is about “shaping when and how action is taken.” When every agentic interaction is unique, the monitoring system must decide whether the agent should proceed, pause, or escalate.
Real Results: United Airlines
United Airlines consolidated 800 applications onto Dynatrace and uses it to monitor their operations. Head of IT Operations Ramiro Zavala reported that incident response went from involving “upwards of 250 people” to minutes-long resolution, contributing to two consecutive years of best-on-record operational performance and a number one industry ranking for on-time departures. When CareSource integrated Dynatrace with ServiceNow, they saw a 45% reduction in mean time to resolution and 35% increase in self-healing through closed-loop incident management.
OpenTelemetry GenAI: The Emerging Standard
While Dynatrace builds a proprietary control plane, the open-source world is converging on OpenTelemetry GenAI semantic conventions as the vendor-neutral standard for agent tracing.
OpenTelemetry already dominates traditional observability. Now their GenAI working group has defined semantic conventions specifically for agent systems, including:
create_agent spans that capture agent initialization with attributes like gen_ai.agent.name, gen_ai.agent.id, and gen_ai.request.model.
invoke_agent spans that trace each agent invocation with conversation IDs, input/output messages, tool definitions, system instructions, and token usage (gen_ai.usage.input_tokens / gen_ai.usage.output_tokens).
execute_tool spans for individual tool calls within agent workflows, linking each external action back to the agent decision that triggered it.
This matters because it lets you instrument an agent built with LangGraph, trace it through Langfuse, and analyze it in Datadog, all using the same attribute schema. 85% of organizations already use some form of GenAI for observability according to Elastic’s 2026 observability trends report, and 89% of production users consider compliance with full OpenTelemetry specifications “very important” or higher.
The Tooling Landscape: Who Monitors the Agents
The observability tooling market for AI agents splits into three tiers: AI-native tools built specifically for LLM applications, traditional APM vendors adding agent capabilities, and open-source frameworks.
AI-Native: Langfuse, LangSmith, Arize Phoenix
Langfuse is the most widely adopted open-source LLM observability tool. It provides tracing, evaluations, prompt management, and cost tracking. Self-hosted deployment gives teams full control over where data lives, which matters for regulated industries bound by DSGVO or data residency requirements. Langfuse traces capture the full agent execution tree: every LLM call, every tool invocation, every decision point.
LangSmith integrates tightly with LangChain and LangGraph, making it the default for teams in that ecosystem. It combines tracing with dataset-based evaluation, so you can both watch what agents do and systematically test whether they do it right. The integration is the selling point and the limitation: if you are not using LangChain, LangSmith’s value drops.
Arize Phoenix focuses on deep agent evaluation, capturing complete multi-step traces that let teams assess how agents make decisions over time. It is open-source and self-hosted, with stronger support for agent-specific evaluation compared to Langfuse.
Traditional APM: Datadog, New Relic, Elastic
Datadog LLM Observability added native support for OpenTelemetry GenAI semantic conventions, making it the natural choice for teams already running Datadog. It provides service maps across interconnected agents, token cost tracking, and integration with existing infrastructure monitoring. If your ops team already lives in Datadog, adding LLM Observability avoids vendor sprawl.
New Relic AI Monitoring takes a similar approach: extend your existing New Relic deployment with AI-specific traces, metrics, and dashboards. The advantage is correlation. You can see an agent’s LLM call latency alongside the database query it triggered alongside the infrastructure metrics of the host running it, all in one timeline.
Elastic Observability leverages their search heritage for log-based agent analysis. Their 2026 prediction: within two years, 98% of organizations will use GenAI for observability tasks, up from 85% today.
Choosing the Right Tool
| Scenario | Recommended Tool | Why |
|---|---|---|
| LangChain/LangGraph stack | LangSmith | Native integration, eval + tracing |
| Data residency requirements | Langfuse (self-hosted) | Full data control, open source |
| Already running Datadog/New Relic | Add their LLM module | Avoids vendor sprawl, correlates infra |
| Deep agent evaluation focus | Arize Phoenix | Multi-step decision analysis |
| Vendor-neutral, portable | OpenTelemetry + any backend | Standard semantic conventions |
From Monitoring to Control Plane: What Changes
The shift from “observability” to “control plane” is not marketing language. It describes a concrete architectural change in how organizations govern autonomous agents.
Traditional monitoring is reactive. Something breaks, an alert fires, a human investigates. Agent observability as a control plane is proactive: the monitoring layer actively participates in agent execution by providing context, enforcing boundaries, and triggering interventions.
Three Functions of an Observability Control Plane
Context grounding. The observability layer feeds real-time system state into agent decisions. Dynatrace’s approach is the clearest example: their Root Cause Agent provides the LLM with deterministic causal data before it generates a response. Without this grounding, the LLM hallucinates about system state. With it, the LLM reasons over facts.
Governance enforcement. The control plane monitors agent actions against policy. If an agent attempts an action outside its authorized scope, the observability layer can block the action, log the attempt, and escalate to a human. Dynatrace’s 2026 Pulse report found that 44% of organizations still rely on manual monitoring for this, creating operational bottlenecks that a control plane automates.
Cascading failure detection. When agents chain calls, each step’s output becomes the next step’s input. The control plane tracks confidence and quality across the chain, detecting when compounding errors push the overall success probability below an acceptable threshold. This is the math Greifeneder demonstrated: 95% per step becomes 60% by step ten. The control plane catches the degradation at step three, not step ten.
The 90-Day Implementation Path
Dynatrace’s report outlines a phased approach that applies regardless of which tooling you choose:
Days 1-30: Establish governance boundaries and baseline instrumentation. Define which agent actions require human approval. Instrument all LLM calls with OpenTelemetry GenAI conventions. Capture token usage, latency, and tool call success rates.
Days 31-60: Build human-in-the-loop workflows with data-quality checks. Use the observability data to identify which agent decisions are reliable enough to automate and which still need human review.
Days 61-90: Scale proven use cases with incrementally higher autonomy levels. Remove human checkpoints only where observability data confirms consistent quality.
The report’s core principle: “Autonomy only scales with trust.” Observability is what generates that trust.
What to Monitor: The Six Signals of Agent Health
If you are instrumenting an agent system today, these are the signals that separate agent observability from traditional APM:
Token cost per task. Not total cost, but cost broken down by agent, by task type, by step. A coding agent spending $0.50 per code review is fine. The same agent spending $8 because it entered a reasoning loop is not.
Tool call success rate. What percentage of external tool calls succeed on the first attempt? Retries compound latency and cost. A tool call failure rate above 10% usually indicates a prompt or schema issue, not an infrastructure issue.
Decision chain length. How many reasoning steps does the agent take per task? Longer chains correlate with lower accuracy (the 95%-to-60% problem). Track the median and 95th percentile.
Hallucination rate on grounded tasks. For tasks where the correct answer is verifiable (database lookups, API responses), what percentage of agent outputs contradict the source data? This is the most direct measure of agent reliability.
Human escalation rate. How often does the agent escalate to a human? Trending upward means the agent is hitting more edge cases. Trending downward might mean improved capability or decreased caution. Both signals require investigation.
Latency per decision step. End-to-end latency is not enough. Break it down by step to identify bottlenecks: is the agent slow because of the LLM call, the tool call, or because it is reasoning through too many options?
Frequently Asked Questions
What is agentic AI observability?
Agentic AI observability is the practice of monitoring autonomous AI agents in production by tracking their reasoning steps, tool calls, token usage, decision quality, and outcomes. Unlike traditional APM which watches request-response patterns, agent observability must handle non-deterministic execution paths, multi-step reasoning chains, and compounding error rates. It is increasingly treated as a control plane that actively governs agent behavior rather than passively monitoring it.
Why is observability called the control plane for AI agents?
Observability becomes a control plane when it actively participates in agent execution rather than just watching. It provides context grounding (feeding real-time system state into agent decisions), enforces governance (blocking unauthorized actions), and detects cascading failures (catching compounding errors across chained agent calls). Dynatrace demonstrated that a single agent at 95% accuracy drops to 60% by step ten of a chain. The observability control plane catches this degradation early and intervenes.
What tools are used for AI agent observability?
AI-native tools include Langfuse (open-source, self-hosted), LangSmith (best for LangChain/LangGraph stacks), and Arize Phoenix (deep agent evaluation). Traditional APM vendors like Datadog, New Relic, and Elastic have added LLM observability modules. OpenTelemetry GenAI semantic conventions provide a vendor-neutral standard for instrumenting agents. Enterprise platforms like Dynatrace Intelligence offer deterministic grounding layers that combine observability with active agent governance.
How big is the AI observability market?
The agentic AI monitoring and observability tools market was valued at $550 million in 2025 and is projected to reach $2.05 billion by 2030, growing at a 30% CAGR. The broader observability market is expected to grow from $3.35 billion in 2026 to $6.93 billion by 2031. This growth is driven by the fact that 72% of organizations now run 2-10 agentic AI initiatives, but 44% still rely on manual monitoring methods.
What metrics should you track for AI agent monitoring?
The six key signals for agent observability are: token cost per task (broken down by agent and step), tool call success rate (first-attempt success percentage), decision chain length (reasoning steps per task), hallucination rate on grounded tasks (where answers are verifiable), human escalation rate (trending direction matters more than absolute number), and latency per decision step (to identify bottlenecks in LLM calls, tool calls, or reasoning).
