The team that wrote the original RAG paper at Meta in 2020 now says standalone RAG is a dead end for enterprise use cases. Douwe Kiela, who led that research, founded Contextual AI and launched Agent Composer in January 2026: a platform that wraps retrieval into agentic loops where the system decides when, what, and how often to retrieve. One manufacturer cut root-cause analysis from 8 hours to 20 minutes. A logistics provider saw 60x faster issue resolution. These are not benchmarks on synthetic datasets. They are production numbers from companies like Qualcomm and Advantest.

Related:

What Agent Composer Actually Does

Agent Composer is not another RAG wrapper. It is an orchestration layer that turns knowledge-intensive engineering workflows into autonomous agents. The platform ships three ways to build agents:

Pre-built templates cover common enterprise patterns: root-cause analysis (sensor data parsing, log correlation, failure diagnosis), deep research across technical documentation, compliance checking against regulatory requirements, and structured extraction from unstructured documents. These templates are production-ready out of the box for aerospace, semiconductors, manufacturing, and logistics.

A natural language prompt builder generates a working agent architecture from a text description. Describe what you need in plain English, and Agent Composer scaffolds the retrieval strategy, tool connections, and reasoning chain.

A visual drag-and-drop canvas lets engineers compose custom logic with specialized integrations, mixing strict rules (compliance gates, data validation, approval workflows) with dynamic reasoning for exploratory analysis.

The key architectural insight: all components are jointly optimized as a single system. Document understanding, retrieval, reranking, generation, and evaluation are not stitched together from different vendors. They share gradients and training signals, which is why Contextual AI’s Grounded Language Model (GLM) scores 88% on the FACTS factuality benchmark, beating Gemini 2.0 Flash (84.6%), Claude 3.5 Sonnet (79.4%), and GPT-4o (78.8%).

From RAG to Agents: The Architecture That Makes It Work

Traditional RAG follows a rigid pipeline: query goes in, documents come back, LLM generates a response. It works for simple Q&A. It falls apart when the answer requires reasoning across multiple sources, conditional retrieval, or multi-step analysis.

Active Retrieval Changes the Game

Agent Composer introduces what Kiela calls “RAG 2.0.” Instead of retrieving once and hoping for the best, agents decide dynamically:

  • When to retrieve: A simple factual question might not trigger retrieval at all if the model is confident. A complex root-cause analysis triggers multiple retrieval rounds across sensor logs, maintenance records, and engineering specs.
  • What to retrieve: The system routes queries to the right data sources, whether that is a vector store, a SQL database, a web search, or a custom API endpoint.
  • Whether to course-correct: If retrieved documents do not answer the question, the agent reformulates the query and tries again, rather than hallucinating an answer.

The Grounded Language Model

The GLM, built on Meta Llama 3.3, is trained specifically to favor retrieved context over parametric knowledge. When the model generates a response, it provides inline attributions citing exactly which source documents support each claim. This is not a post-hoc citation layer bolted on top. The grounding behavior is baked into the model weights through joint training.

On the RAG-QA Arena benchmark, the full Contextual AI stack scores 71.2%, a 5.4% improvement over the next best system (Cohere + Claude 3.5 Sonnet at 66.8%). On document understanding (OmniDocBench), it hits 87.0, beating LlamaParse Premium by 4.6%.

Related:

Hybrid Agentic Behavior

Most agent frameworks force a choice: deterministic workflows or fully autonomous reasoning. Agent Composer mixes both. Compliance checks, data validation steps, and approval gates follow strict rules. Exploratory analysis, cross-document reasoning, and hypothesis generation use dynamic planning. This hybrid approach matters in regulated industries where you cannot let an agent freestyle through a compliance audit but still want it to reason creatively about root-cause analysis.

Production Numbers That Actually Matter

Benchmarks are one thing. Production deployments are another. Here is what Agent Composer delivers in the real world.

Qualcomm: Thousands of Engineers, Millions of Pages

Qualcomm deployed Contextual AI across their Customer Engineering organization. The system ingests millions of pages of multimodal content (PDFs, HTML, Excel) and handles tens of thousands of annual support cases. New documentation becomes available within 24 hours of publication. Qualcomm’s VP of Engineering, Yogi Chiniga, described the challenge as requiring “more than a basic AI assistant” capable of understanding “the technical depth and specialization of our work.”

Manufacturing: 8 Hours to 20 Minutes

An advanced manufacturer uses Agent Composer for root-cause analysis. Before: engineers manually parsed sensor data, correlated logs across systems, and diagnosed failures in a process that took roughly 8 hours. After: the agent handles sensor data parsing, log correlation, and failure diagnosis in 20 minutes. That is a 96% reduction in time-to-diagnosis.

Rocket Propulsion: When It Actually Is Rocket Science

Contextual AI’s benchmark use case for aerospace shows what is possible for technically dense workflows:

TaskBeforeAfter
Test telemetry analysis4 hours20 minutes
Technical Q&A across engineering docs4 hours10 minutes
Test code creation4-8 hours30-60 minutes
Test Readiness Review package assembly8-10 hours1-2 hours

Advantest, a major test equipment manufacturer, has rolled out Agent Composer to multiple teams and select end customers for test code generation and customer engineering workflows.

How It Compares to Building It Yourself

The obvious question: why not use LangChain, LlamaIndex, or another open-source framework to build the same thing?

You can. Many teams try. Most get stuck at the “impressive demo, unreliable in production” stage. The difference comes down to three things:

Joint optimization vs. component assembly. With LangChain or LlamaIndex, you pick a retriever, a reranker, an LLM, and a generation strategy from different providers. Each component is optimized independently. Agent Composer optimizes the entire pipeline end-to-end, which is why its factuality scores beat systems that use individually stronger components.

Enterprise readiness vs. engineering effort. Agent Composer ships with SOC2 Type II certification, HIPAA compliance, SAML/SSO, role-based access control, and VPC deployment options. Building equivalent security and compliance infrastructure on top of an open-source framework takes months of engineering time.

Domain engineering vs. AI engineering. LangChain and LlamaIndex target ML engineers who think in embeddings and prompt templates. Agent Composer targets domain engineers, the semiconductor designers, aerospace engineers, and chemical researchers who know the problem domain but should not need to understand vector databases to build an agent.

The pricing reflects this positioning. The self-serve tier starts with $25 in free credits and pay-as-you-go pricing: $3 per 1,000 pages for text parsing, $0.05 per million tokens for reranking, and $3/$15 per million input/output tokens for generation. Enterprise pricing is custom.

Related:

What This Means for Enterprise AI Teams

Contextual AI’s bet is that the future of enterprise AI is not general-purpose chatbots but specialized agents that automate expert-level knowledge work. Agent Composer is the first production platform from the team that invented the underlying retrieval technique, and the early results suggest the approach works.

If your organization is building RAG pipelines that need to do more than simple Q&A, or if your engineers spend hours on analysis tasks that follow predictable patterns, Agent Composer is worth evaluating. The $25 free tier makes it low-risk to prototype. The real question is whether the jointly optimized approach delivers enough accuracy improvement over open-source alternatives to justify the platform lock-in.

For teams already deep into LangChain or LlamaIndex: watch the benchmarks. If Contextual AI’s factuality advantage holds across more diverse workloads, the “build vs. buy” math changes fast.

Frequently Asked Questions

What is Contextual AI Agent Composer?

Agent Composer is an enterprise platform from Contextual AI that turns RAG pipelines into production-grade AI agents. It provides pre-built templates, a natural language builder, and a visual canvas for creating agents that automate knowledge-intensive tasks like root-cause analysis, compliance checking, and technical research.

Who founded Contextual AI and what is their connection to RAG?

Contextual AI was founded by Douwe Kiela and Amanpreet Singh, both former researchers at Meta AI (FAIR). Kiela led the team that published the original RAG paper in 2020, which introduced retrieval-augmented generation as a technique for grounding LLM outputs in external knowledge. The company has raised approximately $100M in funding.

How does Agent Composer compare to LangChain and LlamaIndex?

Unlike LangChain and LlamaIndex, which are open-source frameworks where you assemble components from different providers, Agent Composer jointly optimizes all pipeline components (retrieval, reranking, generation, evaluation) as a single system. This produces higher factuality scores (88% on FACTS benchmark). It also ships with SOC2 Type II, HIPAA compliance, and VPC deployment out of the box.

What are the real-world performance results of Agent Composer?

Production deployments show significant time savings: one manufacturer reduced root-cause analysis from 8 hours to 20 minutes, a logistics provider achieved 60x faster issue resolution, and Qualcomm deployed it across thousands of engineers handling tens of thousands of annual support cases.

How much does Contextual AI Agent Composer cost?

The self-serve tier starts with $25 in free credits. Pay-as-you-go pricing includes $3 per 1,000 pages for text parsing, $0.05 per million tokens for reranking, and $3/$15 per million input/output tokens for generation. Enterprise pricing requires contacting sales.