OpenExp Scores 98.6% on LongMemEval — How We Built AI Memory That Actually Works
The Problem: AI Agents Have No Long-Term Memory
Your AI agent handles a customer conversation brilliantly today. Tomorrow, it has no idea the conversation happened. It doesn't remember that the customer prefers email over Slack, that the last proposal was rejected because of pricing, or that you already tried approach A and it didn't work.
This isn't a model problem — it's a memory problem. And most solutions either cost a fortune in API calls or simply don't work well enough to trust.
We built OpenExp to solve this: a fully local memory layer for AI agents that learns from outcomes. But "it works in production" isn't enough — we needed numbers. So we put it through the standard benchmark.
What is LongMemEval?
LongMemEval is the industry-standard benchmark for AI agent memory systems, published at ICLR 2025. It tests one thing: can your memory system find the right information across many conversations?
The setup is simple but demanding: 500 questions, each with ~48 conversation sessions as a "haystack." The system must find the needle — the session that contains the answer. Six question types test different memory abilities:
| Question Type | What It Tests | Example |
|---|---|---|
| Single-session (user) | Find what the user said | "What book did I mention?" |
| Single-session (assistant) | Find what AI responded | "What did you recommend?" |
| Preferences | Find user preferences | "What coffee do I like?" |
| Multi-session | Connect info across conversations | "How did my plan evolve?" |
| Knowledge update | Find updated information | "Where do I live now?" |
| Temporal reasoning | Time-based logic | "What happened before I moved?" |
How We Tested
For each of 500 questions, our benchmark:
- Builds a corpus from ~48 haystack sessions
- Embeds everything using BAAI/bge-small-en-v1.5 (384 dimensions — same model OpenExp uses in production)
- Indexes into Qdrant (in-memory, fresh collection per question)
- Searches for the most relevant sessions
- Checks if the correct session appears in the top-k results
This is retrieval-only evaluation — no LLM generates an answer. We measure whether the memory system finds the right document, period.
We tested three scoring strategies to understand what each component contributes:
- Raw — pure vector similarity (cosine distance between embeddings)
- Hybrid — vector similarity (90%) + BM25 keyword matching (10%)
- Full — vector (75%) + BM25 (10%) + recency boost (15%)
Results
Overall performance across 500 questions
| Metric | Raw | Hybrid | Full |
|---|---|---|---|
| Recall@1 | 83.0% | 88.0% | 87.8% |
| Recall@5 | 96.2% | 96.4% | 96.4% |
| Recall@10 | 97.8% | 98.6% | 98.6% |
| NDCG@10 | 89.3% | 92.4% | 92.5% |
Recall@k = "did the correct session appear in the top-k results?" At Recall@10 = 98.6%, only 7 out of 500 questions missed. NDCG@10 measures how high in the list the correct result ranks — 92.5% means it's almost always near the top.
Breakdown by question type (Recall@10)
| Question Type | Raw | Hybrid | Full |
|---|---|---|---|
| Knowledge update | 98.7% | 100% | 100% |
| Multi-session | 100% | 100% | 100% |
| Single-session (user) | 98.6% | 100% | 100% |
| Preferences | 90.0% | 96.7% | 96.7% |
| Single-session (assistant) | 98.2% | 94.6% | 92.9% |
| Temporal reasoning | 96.2% | 97.7% | 98.5% |
What Each Component Contributes
BM25 keyword matching: the biggest win
Adding just 10% BM25 weight to vector search improved Recall@1 by 5 percentage points and NDCG by 3.1%. Why? Vector embeddings understand that "book" and "novel" are related — but they can also confuse any conversation about coffee with the specific one where you said you prefer espresso. BM25 catches exact keyword matches that embeddings miss.
The biggest improvement: preference questions jumped from 90.0% to 96.7%. These are exactly the type where keyword precision matters most.
Recency: matters in production, not in benchmarks
Adding recency scoring barely changed overall numbers but improved temporal reasoning by 0.8%. This makes sense — LongMemEval distributes questions uniformly across time. In real usage, people ask about recent events far more often, so recency provides more value than this benchmark shows.
Key insight: Simple beats complex. Adding BM25 keyword matching to vector search gives a bigger improvement than any fancy architecture. MemPalace's structured "palace" system actually makes retrieval worse (see comparison below). The right 10% of keyword signal outperforms months of architectural complexity.
How OpenExp Compares to Other Memory Systems
Direct comparison (retrieval-only, same methodology)
| System | Recall@5 | LLM Required? | License |
|---|---|---|---|
| OpenExp hybrid | 96.4% | No | MIT (open source) |
| MemPalace raw | 96.6% | No | Open source |
| MemPalace AAAK | 84.2% | No | Open source |
OpenExp matches MemPalace's raw retrieval (96.4% vs 96.6% — within noise). But here's the interesting part: MemPalace's structured "palace" architecture (AAAK mode), which is their main selling point, drops to 84.2%. Their added complexity hurts, not helps.
Broader landscape (end-to-end QA — different methodology)
Most other systems report end-to-end accuracy (retrieval + LLM answer generation), which is a different metric. For context:
| System | End-to-End Accuracy | Requires LLM |
|---|---|---|
| OMEGA | 95.4% | Yes (on-device) |
| Mastra OM | 94.9% | Yes (gpt-5-mini) |
| Letta (MemGPT) | 91.4% | Yes (GPT-4o) |
| Zep | 63.8% | Yes (GPT-4o) |
| ChatGPT memory | ~53% | Yes (GPT-4o) |
| Mem0 | 49.0% | Yes (GPT-4o) |
These numbers aren't directly comparable to retrieval recall — it's like comparing engine horsepower to lap times. But with 98.6% retrieval accuracy, an LLM would only need to read the right context and generate a correct answer, which is the easy part. We plan to add end-to-end evaluation in a future update.
What Makes OpenExp Different
Most memory systems stop at retrieval. OpenExp adds a layer that none of the systems above have: Q-learning.
Every memory in OpenExp has a Q-value (0 to 1) that changes based on outcomes. When a memory contributes to a successful task — a closed deal, a merged PR, a resolved ticket — its Q-value increases. Memories that lead to dead ends sink over time.
This means OpenExp doesn't just remember — it learns what's worth remembering. After a month of use, the memories that actually led to results surface first. This Q-learning signal was not included in the benchmark above — it's an additional boost on top of the 98.6% retrieval score.
Want to add persistent memory to your AI agent? OpenExp is open source and runs entirely on your machine. No API costs, no data leaves your system. Check it out on GitHub or book a call if you want help integrating it.
Reproduce These Results
Everything is open source. Run the benchmark yourself:
git clone https://github.com/anthroos/openexp.git
cd openexp && pip install -e .
# Download dataset (264 MB)
mkdir -p benchmarks/data
curl -L -o benchmarks/data/longmemeval_s_cleaned.json \
"https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json"
# Run (~60 min each on Apple Silicon)
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode raw
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode hybrid
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode full
Key Takeaways
- Hybrid search is the sweet spot. Vector embeddings + 10% BM25 keyword matching gives the best balance. This is OpenExp's production default.
- Simple scoring beats complex architecture. MemPalace's "palace" structure regresses from 96.6% to 84.2%. Our straightforward hybrid achieves 98.6% Recall@10. Don't over-engineer.
- Zero LLM calls, fully local. No API costs, no latency, no data leaving your machine. Every search completes in under 100ms in production.
- Q-learning is the next frontier. The benchmark doesn't capture OpenExp's unique strength — memories that led to real outcomes rank higher over time. This is the difference between a search engine and a learning system.
Building AI agents that need to remember? Whether it's a sales assistant that learns from closed deals, a support bot that recalls customer preferences, or a coding agent that remembers what worked — we can help you integrate persistent, learning memory. Book a free 30-min call or email us.