April 8, 2026 · By Ivan Pasichnyk

OpenExp Scores 98.6% on LongMemEval — How We Built AI Memory That Actually Works

AI agents forget everything between sessions. We built OpenExp to fix that — an open-source memory system with Q-learning that gets smarter over time. We just ran it through LongMemEval, the industry-standard benchmark for AI memory, and scored Recall@10 = 98.6%. No LLM calls. Fully local. Here's the full breakdown.

The Problem: AI Agents Have No Long-Term Memory

Your AI agent handles a customer conversation brilliantly today. Tomorrow, it has no idea the conversation happened. It doesn't remember that the customer prefers email over Slack, that the last proposal was rejected because of pricing, or that you already tried approach A and it didn't work.

This isn't a model problem — it's a memory problem. And most solutions either cost a fortune in API calls or simply don't work well enough to trust.

We built OpenExp to solve this: a fully local memory layer for AI agents that learns from outcomes. But "it works in production" isn't enough — we needed numbers. So we put it through the standard benchmark.

What is LongMemEval?

LongMemEval is the industry-standard benchmark for AI agent memory systems, published at ICLR 2025. It tests one thing: can your memory system find the right information across many conversations?

The setup is simple but demanding: 500 questions, each with ~48 conversation sessions as a "haystack." The system must find the needle — the session that contains the answer. Six question types test different memory abilities:

Question Type	What It Tests	Example
Single-session (user)	Find what the user said	"What book did I mention?"
Single-session (assistant)	Find what AI responded	"What did you recommend?"
Preferences	Find user preferences	"What coffee do I like?"
Multi-session	Connect info across conversations	"How did my plan evolve?"
Knowledge update	Find updated information	"Where do I live now?"
Temporal reasoning	Time-based logic	"What happened before I moved?"

How We Tested

For each of 500 questions, our benchmark:

Builds a corpus from ~48 haystack sessions
Embeds everything using BAAI/bge-small-en-v1.5 (384 dimensions — same model OpenExp uses in production)
Indexes into Qdrant (in-memory, fresh collection per question)
Searches for the most relevant sessions
Checks if the correct session appears in the top-k results

This is retrieval-only evaluation — no LLM generates an answer. We measure whether the memory system finds the right document, period.

We tested three scoring strategies to understand what each component contributes:

Raw — pure vector similarity (cosine distance between embeddings)
Hybrid — vector similarity (90%) + BM25 keyword matching (10%)
Full — vector (75%) + BM25 (10%) + recency boost (15%)

Results

Overall performance across 500 questions

Metric	Raw	Hybrid	Full
Recall@1	83.0%	88.0%	87.8%
Recall@5	96.2%	96.4%	96.4%
Recall@10	97.8%	98.6%	98.6%
NDCG@10	89.3%	92.4%	92.5%

Recall@k = "did the correct session appear in the top-k results?" At Recall@10 = 98.6%, only 7 out of 500 questions missed. NDCG@10 measures how high in the list the correct result ranks — 92.5% means it's almost always near the top.

Breakdown by question type (Recall@10)

Question Type	Raw	Hybrid	Full
Knowledge update	98.7%	100%	100%
Multi-session	100%	100%	100%
Single-session (user)	98.6%	100%	100%
Preferences	90.0%	96.7%	96.7%
Single-session (assistant)	98.2%	94.6%	92.9%
Temporal reasoning	96.2%	97.7%	98.5%

What Each Component Contributes

BM25 keyword matching: the biggest win

Adding just 10% BM25 weight to vector search improved Recall@1 by 5 percentage points and NDCG by 3.1%. Why? Vector embeddings understand that "book" and "novel" are related — but they can also confuse any conversation about coffee with the specific one where you said you prefer espresso. BM25 catches exact keyword matches that embeddings miss.

The biggest improvement: preference questions jumped from 90.0% to 96.7%. These are exactly the type where keyword precision matters most.

Recency: matters in production, not in benchmarks

Adding recency scoring barely changed overall numbers but improved temporal reasoning by 0.8%. This makes sense — LongMemEval distributes questions uniformly across time. In real usage, people ask about recent events far more often, so recency provides more value than this benchmark shows.

Key insight: Simple beats complex. Adding BM25 keyword matching to vector search gives a bigger improvement than any fancy architecture. MemPalace's structured "palace" system actually makes retrieval worse (see comparison below). The right 10% of keyword signal outperforms months of architectural complexity.

How OpenExp Compares to Other Memory Systems

Direct comparison (retrieval-only, same methodology)

System	Recall@5	LLM Required?	License
OpenExp hybrid	96.4%	No	MIT (open source)
MemPalace raw	96.6%	No	Open source
MemPalace AAAK	84.2%	No	Open source

OpenExp matches MemPalace's raw retrieval (96.4% vs 96.6% — within noise). But here's the interesting part: MemPalace's structured "palace" architecture (AAAK mode), which is their main selling point, drops to 84.2%. Their added complexity hurts, not helps.

Broader landscape (end-to-end QA — different methodology)

Most other systems report end-to-end accuracy (retrieval + LLM answer generation), which is a different metric. For context:

System	End-to-End Accuracy	Requires LLM
OMEGA	95.4%	Yes (on-device)
Mastra OM	94.9%	Yes (gpt-5-mini)
Letta (MemGPT)	91.4%	Yes (GPT-4o)
Zep	63.8%	Yes (GPT-4o)
ChatGPT memory	~53%	Yes (GPT-4o)
Mem0	49.0%	Yes (GPT-4o)

These numbers aren't directly comparable to retrieval recall — it's like comparing engine horsepower to lap times. But with 98.6% retrieval accuracy, an LLM would only need to read the right context and generate a correct answer, which is the easy part. We plan to add end-to-end evaluation in a future update.

What Makes OpenExp Different

Most memory systems stop at retrieval. OpenExp adds a layer that none of the systems above have: Q-learning.

Every memory in OpenExp has a Q-value (0 to 1) that changes based on outcomes. When a memory contributes to a successful task — a closed deal, a merged PR, a resolved ticket — its Q-value increases. Memories that lead to dead ends sink over time.

This means OpenExp doesn't just remember — it learns what's worth remembering. After a month of use, the memories that actually led to results surface first. This Q-learning signal was not included in the benchmark above — it's an additional boost on top of the 98.6% retrieval score.

Want to add persistent memory to your AI agent? OpenExp is open source and runs entirely on your machine. No API costs, no data leaves your system. Check it out on GitHub or book a call if you want help integrating it.

Reproduce These Results

Everything is open source. Run the benchmark yourself:

git clone https://github.com/anthroos/openexp.git
cd openexp && pip install -e .

# Download dataset (264 MB)
mkdir -p benchmarks/data
curl -L -o benchmarks/data/longmemeval_s_cleaned.json \
  "https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json"

# Run (~60 min each on Apple Silicon)
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode raw
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode hybrid
python benchmarks/longmemeval_bench.py benchmarks/data/longmemeval_s_cleaned.json --mode full

Key Takeaways

Hybrid search is the sweet spot. Vector embeddings + 10% BM25 keyword matching gives the best balance. This is OpenExp's production default.
Simple scoring beats complex architecture. MemPalace's "palace" structure regresses from 96.6% to 84.2%. Our straightforward hybrid achieves 98.6% Recall@10. Don't over-engineer.
Zero LLM calls, fully local. No API costs, no latency, no data leaving your machine. Every search completes in under 100ms in production.
Q-learning is the next frontier. The benchmark doesn't capture OpenExp's unique strength — memories that led to real outcomes rank higher over time. This is the difference between a search engine and a learning system.

Building AI agents that need to remember? Whether it's a sales assistant that learns from closed deals, a support bot that recalls customer preferences, or a coding agent that remembers what worked — we can help you integrate persistent, learning memory. Book a free 30-min call or email us.

OpenExp AI Memory Benchmark LongMemEval Vector Search Q-Learning Open Source

OpenExp Scores 98.6% on LongMemEval — How We Built AI Memory That Actually Works

The Problem: AI Agents Have No Long-Term Memory

What is LongMemEval?

How We Tested

Results

Overall performance across 500 questions

Breakdown by question type (Recall@10)

What Each Component Contributes

BM25 keyword matching: the biggest win

Recency: matters in production, not in benchmarks

How OpenExp Compares to Other Memory Systems

Direct comparison (retrieval-only, same methodology)

Broader landscape (end-to-end QA — different methodology)

What Makes OpenExp Different

Reproduce These Results

Key Takeaways

Let's Talk

Book a Free Call

Send a Message

OpenExp Scores 98.6% on LongMemEval — How We Built AI Memory That Actually Works

The Problem: AI Agents Have No Long-Term Memory

What is LongMemEval?

How We Tested

Results

Overall performance across 500 questions

Breakdown by question type (Recall@10)

What Each Component Contributes

BM25 keyword matching: the biggest win

Recency: matters in production, not in benchmarks

How OpenExp Compares to Other Memory Systems

Direct comparison (retrieval-only, same methodology)

Broader landscape (end-to-end QA — different methodology)

What Makes OpenExp Different

Reproduce These Results

Key Takeaways

Related Articles

Book a Free Call

Send a Message