Photo by Jordan Harrison on Unsplash Source

Mastra’s observational memory scores 84.23% on LongMemEval with gpt-4o, beating RAG-based memory (80.05%) by over 4 points and outperforming Supermemory’s previous state of the art by 2.6 points. With gpt-5-mini, it hits 94.87%. The technique compresses raw conversation history 5-40x, needs no vector database, and works with Anthropic and OpenAI prompt caching out of the box.

The core idea is simple: instead of storing raw message history and retrieving it via embeddings, an Observer agent watches the conversation in real time and writes compressed observation notes. When those notes pile up, a Reflector agent garbage-collects the stale ones. The result is a memory system that acts like a sharp note-taker rather than a tape recorder.

Related: Context Engineering: The Architecture Pattern Replacing Prompt Engineering

Why RAG Falls Short for Agent Memory

RAG works well for static knowledge retrieval: query a vector database, pull the top-k chunks, inject them into the prompt. For long-running agent sessions, though, RAG has three structural problems that observational memory solves.

Retrieval noise. A RAG pipeline searching over conversation history returns chunks by embedding similarity, not by relevance to the agent’s current task. If a user discussed three different projects across 50 messages, the retrieval might surface fragments from the wrong project. Mastra’s own benchmarks on LongMemEval confirm this: their RAG implementation scored 80.05% with gpt-4o, while observational memory hit 84.23% on the same model.

Token bloat. Each RAG retrieval dumps 1,500-3,000 tokens of context into the prompt, regardless of whether the agent needs all of it. Multiply that by multiple retrievals per turn, and you are burning 10,000+ tokens on context that may or may not help. Observational memory’s 5-40x compression means the same information fits in a fraction of the tokens.

No prompt caching. RAG results change with every query, which means the prompt prefix changes too. That kills prompt caching on both Anthropic and OpenAI’s APIs. Observational memory appends observations sequentially, so the prefix stays stable across turns. Full cache hits on every turn until the next observation cycle.

The Mem0 research team reported similar findings: their memory approach (which also uses compression rather than raw retrieval) achieved a 26% accuracy boost over baseline RAG on conversational benchmarks. The pattern is clear across implementations: compressed, structured memory outperforms raw retrieval for conversational agents.

How Observer-Reflector Architecture Works

Observational memory runs two background agents alongside your primary agent. Neither one changes how the user interacts with the system. They operate behind the scenes, managing context.

The Observer

The Observer watches the conversation between the user and the primary agent. When the raw message history crosses a configurable threshold (default: 30,000 tokens), the Observer activates and compresses the messages into observation notes.

These notes are not summaries. They are structured observations with priority tags:

  • Red priority: Facts the agent must remember (user preferences, project requirements, decisions made)
  • Yellow priority: Contextual information that might matter later (tools used, approaches discussed)
  • Green priority: Background information (timestamps, session metadata)

The Observer also tracks two special fields: the current task the agent is working on, and a suggested response for picking up where it left off. This means if a session gets interrupted, the agent resumes with full context from a few hundred tokens of observations instead of replaying thousands of tokens of raw history.

The Reflector

When observations themselves grow past their threshold (default: 40,000 tokens), the Reflector steps in. It garbage-collects outdated observations, merges related ones, and distills patterns. Think of it as the difference between daily notes and a weekly summary: the Reflector turns a week of observations into a concise brief.

This two-tier compression is what makes the system scale. Without the Reflector, observations would eventually grow as large as the original conversation. With it, even months-long agent sessions maintain a manageable memory footprint.

Architecture in Practice

A real-world example makes this concrete. Consider a coding agent using Playwright MCP to browse and interact with web pages. Each page snapshot can be 50,000+ tokens. Without observational memory, the agent’s context fills up after two or three pages. With it, the Observer compresses each page visit into a few hundred tokens: “User visited the Stripe dashboard. Found three failed payments from the last 24 hours. Exported CSV of failed charges.”

The raw page content is gone. What remains is exactly what the agent needs to continue working.

import { Memory } from '@mastra/memory';

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      observation: { messageTokens: 30_000 },
      reflection: { observationTokens: 40_000 },
    },
  },
});

The configuration is minimal. The model used for the Observer and Reflector can be a smaller, cheaper model than the primary agent. Mastra recommends Gemini 2.5 Flash for its 1M token context window, which gives the Reflector enough space to process large observation histories in a single pass.

Related: The Open-Source Agentic AI Stack in 2026: What Teams Actually Run in Production

Benchmark Results: Where Observational Memory Wins

The LongMemEval benchmark tests how well memory systems retain and retrieve information across long conversations. It measures accuracy on questions that require the model to recall specific details from earlier in a multi-turn dialogue.

ModelObservational MemoryRAGDifference
gpt-5-mini94.87%N/ASOTA (3+ points above any prior score)
gemini-3-pro-preview93.27%N/ASecond highest recorded
gemini-3-flash-preview89.20%N/AThird highest recorded
gpt-4o84.23%80.05%+4.18 points

The gpt-4o comparison is the most telling because it controls for model quality. Same model, same benchmark, different memory architecture. Observational memory’s 4-point advantage comes entirely from better context management.

The gpt-5-mini result (94.87%) is notable for a different reason: it suggests that observational memory’s compression actually helps smaller models perform better by reducing context noise. A smaller model with clean, compressed context outperforms a larger model with noisy, raw context.

Cost Implications

The token savings translate directly to API costs. Mem0’s research quantified this for their memory approach: roughly 1,800 tokens per conversation turn with structured memory versus 26,000 tokens for full-context methods. That is a 93% reduction in token consumption.

Observational memory’s 5-40x compression range depends on the conversation type. Coding sessions with large tool outputs compress at the high end (30-40x). Text-heavy conversations with shorter messages compress less aggressively (5-10x). Either way, the cost reduction is substantial: a team spending $10,000/month on inference for long-running agent sessions could realistically cut that to $1,000-2,000.

Prompt caching amplifies the savings further. With consistent observation prefixes, Anthropic’s prompt caching gives you 90% cost reduction on cached tokens. OpenAI’s equivalent offers 50%. Combine memory compression with prompt caching and the total cost reduction can exceed 95% compared to naive full-context approaches.

How Observational Memory Fits the Agent Memory Landscape

Observational memory is not the only structured memory approach gaining traction. The broader AI agent memory ecosystem is converging on a key insight: raw retrieval is losing to structured compression.

Mem0 uses a hybrid architecture combining graph, vector, and key-value stores. Their graph memory achieves a 26% accuracy improvement over baseline on conversational benchmarks and offers managed infrastructure for teams that do not want to run their own graph database.

Zep builds temporal knowledge graphs that track how facts change over time. Their approach delivers 18.5% accuracy improvements with 90% latency reduction versus baseline RAG, and works particularly well for enterprise scenarios requiring audit trails and relationship modeling.

Letta (formerly MemGPT) takes the most radical approach: agents that manage their own memory through self-editing memory blocks. Letta agents decide what to keep in context, what to archive, and what to retrieve, scoring 74.0% on LoCoMo with GPT-4o mini.

Observational memory’s differentiator is its simplicity. No vector database, no graph database, no self-editing memory loops. Just two background agents compressing text. That makes it the easiest approach to adopt for teams that want better memory without a significant infrastructure investment.

When to Use What

Memory TypeBest ForInfrastructureComplexity
Observational memoryLong-running sessions, coding agents, high-context workflowsNone (text-only)Low
RAGStatic knowledge retrieval, document QAVector databaseMedium
Graph memory (Mem0/Zep)Relationship tracking, temporal reasoning, enterprise complianceGraph databaseHigh
Self-editing (Letta)Fully autonomous agents managing their own contextAgent runtimeHigh

For most teams building agents today, observational memory offers the best ratio of improvement to complexity. You add a few lines of configuration, and your agents immediately handle longer sessions with lower costs and higher accuracy.

Related: AI Agent Frameworks Compared: LangGraph, CrewAI, AutoGen

Getting Started: Practical Integration

If you are building with Mastra, observational memory is a configuration flag. If you are not, the pattern is straightforward to implement in any framework.

With Mastra (TypeScript)

import { Mastra } from '@mastra/core';
import { Memory } from '@mastra/memory';

const memory = new Memory({
  options: {
    observationalMemory: true, // Uses gemini-2.5-flash by default
  },
});

const mastra = new Mastra({
  memory,
  agents: { myAgent },
});

Two storage options work immediately: PostgreSQL (@mastra/pg) and LibSQL (@mastra/libsql). MongoDB support is available through @mastra/mongodb.

Without Mastra (Any Framework)

The Observer-Reflector pattern ports to any agent framework. The core logic is:

  1. Maintain a running token count of the agent’s message history
  2. When it crosses your threshold (30K tokens works well), call a smaller model with instructions to compress the messages into structured observations
  3. Replace the raw messages with the compressed observations in the agent’s context
  4. When observations themselves grow past a second threshold (40K tokens), call the model again to consolidate

The key implementation detail is priority tagging. Without it, the Observer treats all information equally and compression quality drops. Assigning red/yellow/green priorities to observations lets the Reflector make intelligent decisions about what to keep and what to discard.

Migration Path

Existing agent sessions work without manual migration. When a thread’s messages cross the observation threshold for the first time, the Observer processes the entire backlog. Subsequent interactions benefit from compression immediately. There is no need to re-index, re-embed, or restructure existing data.

Frequently Asked Questions

What is observational memory for AI agents?

Observational memory is a memory architecture for AI agents that uses two background agents, an Observer and a Reflector, to compress raw conversation history into structured observation notes. Instead of storing and retrieving full message histories via vector search (like RAG), it creates compressed, prioritized summaries that reduce token usage by 5-40x while maintaining higher accuracy on memory benchmarks. Mastra’s implementation achieves state-of-the-art results on LongMemEval.

How does observational memory compare to RAG for AI agents?

On the LongMemEval benchmark with gpt-4o, observational memory scores 84.23% compared to RAG’s 80.05%, a 4+ point improvement. Beyond accuracy, observational memory compresses context 5-40x (reducing API costs proportionally), maintains stable prompt prefixes (enabling prompt caching), and requires no vector database infrastructure.

How much does observational memory reduce AI agent costs?

Observational memory reduces token consumption by 5-40x depending on conversation type. Combined with prompt caching (which observational memory enables due to stable prefixes), total cost reductions can exceed 90-95% compared to full-context or naive RAG approaches. Mem0’s research confirmed similar findings: roughly 1,800 tokens per turn with structured memory versus 26,000 for full-context methods.

What is the Observer-Reflector architecture pattern?

The Observer-Reflector pattern uses two background agents. The Observer watches conversations and creates compressed observation notes when message history exceeds a token threshold (typically 30,000 tokens). The Reflector activates when observations grow past their own threshold (typically 40,000 tokens) and garbage-collects outdated observations while merging related ones. This two-tier compression enables long-running agent sessions without context overflow.

Which AI agent memory approach should I use in 2026?

Use observational memory for long-running sessions, coding agents, and high-context workflows where simplicity matters. Use RAG for static knowledge retrieval and document QA. Use graph memory (Mem0 or Zep) when you need relationship tracking, temporal reasoning, or enterprise compliance. Use self-editing memory (Letta) for fully autonomous agents managing their own context.