Mastra Observational Memory is a new state-of-the-art (SOTA) memory system for AI agents, now available in the latest version of Mastra. It delivers the highest scores ever recorded on the LongMemEval benchmark, 84.2% with GPT-4o and 94.9% with GPT-5-mini.
Unlike traditional approaches that rely on retrieval-augmented generation (RAG), vector databases, or graph-based memory systems, Observational Memory uses a continuously evolving text representation. The result: higher accuracy, massive token compression, and effortless memory integration for AI agents.
This article explains what Observational Memory is, how it works, why it matters, and what its benchmark results reveal about the future of agent memory systems.
What Is Mastra Observational Memory?
Mastra Observational Memory (OM) is an open-source memory architecture designed for AI agents. It replaces query-based retrieval systems with a background observation model that continuously compresses conversation history into dense, evolving summaries.
Instead of:
- Storing embeddings in a vector database
- Running retrieval queries per prompt
- Managing graph-based memory structures
OM maintains a single evolving text “blob” that represents the agent’s accumulated knowledge.
The primary innovation is that the main agent does not directly read from or write to memory. Memory formation happens automatically via background agents.
How Mastra Observational Memory Works?
Background Agents as Subconscious Memory
Observational Memory introduces background agents that:
- Monitor full message histories
- Extract high-signal observations
- Compress interactions into dense summaries
- Continuously update a consolidated memory representation
These background agents function like a subconscious layer. The primary agent simply receives the updated memory context without issuing retrieval calls.
This eliminates:
- Input-based retrieval steps
- Manual memory writes
- Memory query overhead
Async Buffering vs Blocking Mode
The currently available version processes observations in a blocking manner, pausing the conversation while memory updates occur.
An upcoming async buffering mode allows memory updates to happen in parallel, removing conversation delays. In async mode:
- Conversations continue uninterrupted
- Memory processing happens in the background
- Performance feels seamless
The async implementation ships under the oma tag in Mastra dependencies.
No RAG, No Graphs, No Retrieval
One of the most notable aspects of Mastra Observational Memory is what it does not use:
- No retrieval-augmented generation (RAG)
- No graph databases
- No input-triggered memory searches
Traditional RAG systems require embedding storage and similarity search per query. OM removes this layer entirely.
This simplifies architecture and reduces latency while still improving benchmark performance.
Benchmark Results: LongMemEval Performance
LongMemEval is a memory benchmark containing:
- 500 questions
- ~57 million tokens of conversation history
- Complex cross-session memory requirements
On this benchmark, Observational Memory achieved:
| Model | Score with OM | Previous SOTA | Improvement |
|---|---|---|---|
| GPT-4o | 84.2% | ~81.6% | +2.6% |
| GPT-5-mini | 94.9% | Highest ever | New record |
ModelScore with OMPrevious SOTAImprovement
Notably, GPT-4o scored 2% higher using OM than when provided only the raw answer conversations directly.
This suggests that dense observational summaries can outperform full transcript exposure under certain conditions.
Token Compression: 6×–40× Reduction
A major benefit of Observational Memory is aggressive token compression.
Reported compression rates range from:
- 6× reduction
- Up to 40× reduction, especially with heavy tool use
Why This Matters?
Large conversation histories quickly exceed model context windows. OM allows:
- Smaller context windows to behave like much larger ones
- Reduced token costs
- Better scalability for long-running agents
Traditional vs Observational Memory
| Approach | Token Efficiency | Retrieval Required | Complexity |
|---|---|---|---|
| RAG-based systems | Medium | Yes | High |
| Graph memory systems | Medium | Yes | High |
| Raw conversation history | Low | No | Low |
| Observational Memory (OM) | High (6–40×) | No | Moderate |
OM achieves high efficiency without retrieval queries.
Why Observational Memory Outperforms Raw Transcripts?
A common concern with compression is information loss.
However, LongMemEval results show that dense observational summaries can outperform direct exposure to full answer conversations.
Possible reasons based on benchmark evidence:
- High-signal extraction reduces noise
- Redundant conversational tokens are removed
- Context becomes structured and semantically focused
- Cognitive load on the model decreases
The benchmark result (+2% over raw answer sessions for GPT-4o) suggests that structured summarization can enhance memory accuracy rather than degrade it.
Real-World Applications
Observational Memory is especially relevant for:
1. Long-Running AI Agents
Agents that operate across days or weeks benefit from continuous memory evolution without ballooning context sizes.
2. Tool-Heavy Systems
Compression increases with more tool calls, making OM suitable for:
- Autonomous workflows
- Developer agents
- Multi-step reasoning systems
3. Constrained Context Windows
Smaller models or deployments with token limits can simulate large-memory behavior.
4. Production AI Assistants
Simplified architecture (no retrieval layer) reduces system complexity and maintenance overhead.
Benefits of Mastra Observational Memory
- State-of-the-art benchmark performance
- Highest LongMemEval score recorded
- Massive token compression (6×–40×)
- No retrieval overhead
- Open-source availability
- Works with multiple LLMs
Limitations and Considerations
While Observational Memory shows strong benchmark performance, practical considerations include:
- Blocking mode may introduce latency (until async buffering is fully adopted)
- Memory quality depends on observation fidelity
- Requires background agent infrastructure
As with all memory systems, production behavior should be evaluated under domain-specific workloads.
Practical Implementation Notes
To experiment with async buffering:
- Update Mastra dependencies to the oma tag
- @mastra/core@oma
- @mastra/memory@oma
- mastra@oma
Async buffering removes conversation blocking during memory processing.
The standard release blocks conversation while observations are generated.
My Final Thoughts
Mastra Observational Memory introduces a fundamentally different approach to AI agent memory. By replacing retrieval systems with continuously evolving observational summaries, it achieves state-of-the-art performance on LongMemEval while dramatically reducing token usage.
With compression rates up to 40× and record-breaking benchmark scores, including 94.9% with GPT-5-mini, Observational Memory demonstrates that dense structured memory can outperform raw transcript exposure and retrieval-based architectures.
As AI agents become longer-lived and more autonomous, efficient memory systems will be critical. Observational Memory represents a strong step toward scalable, high-performance, low-overhead memory architectures for next-generation AI systems.
Frequently Asked Questions (FAQs)
1. What makes Mastra Observational Memory different from RAG?
Observational Memory does not perform retrieval queries. Instead, it continuously updates a dense memory summary via background agents, eliminating the need for vector searches.
2. What is LongMemEval?
LongMemEval is a memory benchmark consisting of 500 questions across approximately 57 million tokens of conversation history, designed to test long-term memory performance.
3. Does compression cause loss of important details?
Benchmark results show that compressed observational summaries can outperform raw conversation transcripts, suggesting that structured compression may improve signal quality.
4. What models achieved SOTA with OM?
GPT-4o scored 84.2%, and GPT-5-mini scored 94.9%—the highest score recorded on LongMemEval as of the announcement date.
5. Is Observational Memory open source?
Yes. It is available in the latest Mastra release and can be used today.
6. What is async buffering in Observational Memory?
Async buffering allows memory observations to process in the background without blocking the main conversation flow.
Also Read –
Devin Autofix: Automated PR Fixes with AI


