Benchmarks / Apr 2026

Atlas-CE vs MemPalace

We ran a head-to-head on all four of MemPalace's published benchmarks. Same data, same embedder, same bench scripts. Here is what we found.

Apollo Raines · atlas-rag.com · ATLAS-LRAG-CE on GitHub

TL;DR

The "100% on LongMemEval" claim is R@50 on a dataset where each question has ~53 haystack sessions. Any MiniLM vector retriever hits R@50 = 1.000. The real number is R@1 = 0.806.
At every K that matters, Atlas-CE in Standard Mode is tied with MemPalace. Same embedding model, same ceiling.
Atlas-CE in High-Accuracy Mode (HM) lifts LongMemEval R@1 to 0.944 -- a 13.8-point gain over MemPalace's real R@1. R@5 to 0.990, R@10 to 0.996.
At top-5 on LoCoMo (a meaningful K for an LLM), CE SM already beats MemPalace +2.7 points overall, wins every category.
CE ships four properties MemPalace structurally cannot: enforced tenant isolation, full audit traces (9 fields in SM, 12 in HM), per-tenant rate limiting, adversarial-content rejection at ingest.

Why we ran this

A claim made the rounds: "Actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution."

It got our attention. We publish strong numbers on the same benchmarks, and in eighteen months of building Atlas we had not seen another serious project come close on deterministic conversational memory retrieval. If someone had actually cracked 100%, that is real news.

So we ran MemPalace on our hardware, on the same dataset, with the same embedding model, through the same bench scripts. Forked the scripts with a one-line swap of the collection constructor to point at Atlas-CE's adapter. Identical input, identical scoring. The goal was not to prove who is better. The goal was to see -- honestly, with identical data -- where two serious attempts at this problem actually land.

What "100%" actually means

MemPalace's tagline number is R@5 = 96.6% on LongMemEval raw -- the one in their README. The viral "100%" figure is R@50.

LongMemEval's "s" dataset averages about 53 haystack sessions per question. If you return 50 of those 53, you mathematically hit every session that matters and recall clocks 1.0. That is not a retrieval achievement. That is the haystack being smaller than K.

Every vector retriever on MiniLM-L6-v2 embeddings hits R@50 = 1.000 on that dataset. MemPalace does. ChromaDB raw does. Atlas-CE does. A random shuffle does, often enough. The denominator is doing the work, not the retriever.

The real numbers at every K that matters for an LLM actually reading results:

K	MemPalace	Atlas-CE SM	Atlas-CE HM
R@1	0.806	0.806	0.944 (+13.8)
R@3	0.926	0.926	0.980 (+5.4)
R@5	0.966	0.966	0.990 (+2.4)
R@10	0.982	0.982	0.996 (+1.4)
R@30	0.996	0.996	0.996
R@50	1.000	1.000	1.000

Credit where due: after an exhaustive audit filed as MemPalace issue #875 on April 14, the MemPalace team merged PR #897 the following day. The "100% on LongMemEval" headline, the "highest-scoring AI memory system ever benchmarked" banner, the "+34% palace boost" line, and the cross-system comparison tables (which were mixing retrieval-recall with competitor QA-accuracy numbers) were all removed. The README now reports the honest held-out figure of 98.4% R@5 on 450 unseen questions. Co-founder bensig acknowledged publicly that "100% shouldn't be headlined like that." The repo is now clean. The social-media claim that triggered this benchmark -- screenshots, videos, third-party posts -- is still circulating uncorrected in the wild, which is what got us to run the comparison in the first place.

"Beating every paid solution" is doing more work still. No paid retrieval system publishes raw LongMemEval R@1 meaningfully below 0.80 on MiniLM -- this is a shared ceiling of the embedding model, not something MemPalace unlocked.

Built on top vs. built from scratch

MemPalace imports chromadb. Their storage backend is a thin wrapper around ChromaDB's vector store; the retrieval numbers everyone is measuring -- theirs, ours, @gizmax's M2 Ultra reproduction -- are measuring ChromaDB's Rust HNSW index and default sentence-transformer embeddings, not anything MemPalace ships. Wings, rooms, drawers, and AAAK are UX metaphors layered on top of that substrate. Organizational, not a retrieval mechanism. The team's own PR #897 acknowledges this explicitly: palace filtering is "standard metadata filtering in the vector store, not a novel retrieval mechanism." And MemPalace's own proprietary modes (AAAK, rooms) actually score worse than the raw ChromaDB they wrap -- -12.4 and -7.2 points on LongMemEval R@5 respectively, per issue #39.

Atlas-CE LRAG has no ChromaDB underneath. BM25 is our own lexical index. The vector store is ours. Three-lane fusion is ours. The deterministic sort by (-score, chunk_id) that guarantees bit-identical retrieval across runs is ours. Content- addressable chunk IDs derived from SHA-256 of content are ours. When we want to add a cross-encoder rerank (the HM mode below), we ship it in a single file. When we want mandatory tenant isolation on every call, we ship it. When we want an audit trace with nine introspection fields per query, we ship it.

This is the difference between a product and a substrate. MemPalace is a product -- a thoughtful, well-designed consumer memory UX over a retrieval engine they do not own. Atlas-CE LRAG is the retrieval engine.

The two are at different tiers of the stack, and the benchmark reveals exactly what you would expect. On the rank where the benchmark is really testing ChromaDB's embeddings -- R@5 raw, which both systems inherit from the same MiniLM-L6-v2 vectors -- we tie. On the rank where the architecture of the retrieval layer matters -- R@1 with a cross-encoder reranker that requires owning the pipeline to ship -- we pull ahead by 14 points. Nothing about that is a comment on MemPalace's product quality. It is a comment on which layer each system sits at.

Standard Mode results -- the honest part

Atlas-CE SM versus MemPalace on raw-mode retrieval is really Atlas-CE SM versus ChromaDB. Same vectors, different retriever wrappers around the same embedder. The SM ties below are not a MemPalace achievement and not an Atlas miss -- they are both systems correctly reading the same embedding space.

Atlas-CE Standard Mode (three-lane hybrid fusion: metadata, BM25, vector; locked weights w_meta=0.10, w_lex=0.25, w_vec=0.65; deterministic sort by (-score, chunk_id)) ties MemPalace bit-for-bit on three of four benchmarks:

Benchmark	MemPalace	Atlas-CE SM	Delta
LongMemEval R@5 (500 Q)	0.966	0.966	0.000
LoCoMo avg (1986 Q, top-50)	1.000	1.000	0.000
ConvoMem avg (1250 items)	0.890	0.890	0.000
MemBench R@5 (8500 items)	0.803	0.797	-0.006

MemBench was our one SM loss -- by 0.6 points. Two lines to add: on the two hardest categories (noisy and post-processing, where vector similarity smears across distractors), CE beats MemPalace. Noisy: CE 44.2% vs MemPalace 43.4% (+0.8). Post-processing: CE 58.1% vs MemPalace 56.6% (+1.5). Fusion earns its keep exactly where single-lane vector gives up.

Why the ties? Same embedder. Both systems read the same MiniLM 384-dim vector space. ChromaDB raw is vector-only. CE's fusion with w_vec=0.65 collapses to vector-dominated output when BM25 is uniform across candidates and metadata is trivial. Both sort the same space.

LoCoMo at a meaningful K

The LoCoMo 1.000 vs 1.000 tie is another artifact of the top-K exceeding haystack size (~20 sessions per conversation, top-50 retrieval). Rescored at top-5:

Category	MemPalace	Atlas-CE SM	CE wins by
Single-hop	0.388	0.408	+0.020
Temporal	0.588	0.610	+0.022
Temporal-inference	0.343	0.355	+0.012
Open-domain	0.448	0.483	+0.035
Adversarial	0.471	0.493	+0.022
Avg	0.462	0.489	+0.027

CE SM wins every category. Three-lane fusion is doing real work at a K that actually measures retrieval quality; it was invisible at top-50 because the corpus was smaller than K.

High-Accuracy Mode

HM adds a deterministic cross-encoder reranker (BAAI/bge-reranker-v2-m3) over the top 20 first-stage candidates. Ranks 21 through K come from the first-stage tail unchanged, so large-K recall is unaffected; small-K recall benefits from the cross-encoder signal.

LongMemEval HM, 500 questions, same bench:

R@1: 0.944 vs MemPalace 0.806 -- +13.8 points
R@5: 0.990 vs MemPalace 0.966 -- +2.4 points
R@10: 0.996 vs MemPalace 0.982 -- +1.4 points
R@30 and R@50: identical to SM (tail not reranked)

Thirteen points at K=1 is the rank that matters. Top-1 is what an LLM actually reads when you hand it a retrieval result. The HM path is opt-in via accuracy_mode="HM" in config, or ATLAS_ACCURACY_MODE=HM env var. Latency cost is roughly 10x on CPU due to the reranker; pinned batch size preserves determinism.

MemPalace has no reranker path. Their "hybrid" mode is a keyword-overlap scalar multiplier baked into their bench code, not a cross-encoder.

Honest accounting on the other benchmarks. HM is not a uniform lift across every workload -- it is a lift on workloads where a cross-encoder's training distribution matches the retrieval shape.

MemBench HM on the noisy category (1000 items, the hardest category in MemBench where vector similarity smears across distractors) delivered a clean win:

System	MemBench noisy R@5
MemPalace (SM / hybrid)	0.434
Atlas-CE (SM)	0.442
Atlas-CE (HM)	0.601 (+16.7)

Sixteen-point lift on the hardest MemBench category, same embedder, same bench harness. The reranker earns its keep exactly where vector similarity alone gives up.

ConvoMem HM (1250 items, same corpus, same embedder) landed as a tie overall at 0.890 avg -- splitting the per-category picture: wins on Abstention (+1.6), Preferences (+0.8), User Facts (+0.8); losses on Assistant Facts (-0.4) and Implicit Connections (-3.1). The Implicit Connections drop is genuine -- multi-hop memory reasoning is out of the BGE reranker-v2-m3 training distribution (MS-MARCO passage retrieval), and the reranker actively mis-scores those queries. A domain-trained reranker would likely flip that result; that is one of the things SAIQL Enterprise's HM configuration does.

LoCoMo HM (top-5) attempted twice and failed to complete both times at the same conversation (conv-48) for reasons we have not fully diagnosed. Partial data through conversation 48 tracked positive but we did not produce a clean full-run result.

Net HM picture across the four benchmarks: two clean wins (LongMemEval R@1 +13.8, MemBench noisy R@5 +16.7), one tie overall with mixed per-category results (ConvoMem), one incomplete (LoCoMo). The pattern is visible: HM lifts at K=1 when the reranker's training distribution matches, and ties or slightly underperforms when it does not. Cross-encoder reranking is not magic; it is a tuned tool. Atlas-CE ships it as an opt-in mode precisely so it can be deployed where it helps and disabled where it does not.

Property tests -- what recall does not measure

Four tests designed around things ChromaDB-backed systems structurally cannot do without rewriting their foundation:

Property	MemPalace / ChromaDB	Atlas-CE
Prompt-injection rejection at ingest (100 payloads)	0% blocked	40% blocked (stock patterns; configurable up)
Multi-tenant isolation (5 tenants, 100 queries)	Caller-dependent (naive: 80% cross-tenant leak; isolated: 0%)	API-enforced (0% leak, `tenant_id` required on every call)
Audit trace fields per query	5	9 in SM, 12 in HM
Per-tenant rate limit (500 rapid queries)	500 / 500 accepted (no cap)	100 / 500 accepted (default 100/min)

The audit trace alone is the difference between "the system returned these chunks" and "the system filtered X candidates at the metadata stage, narrowed to Y at lexical, refined to Z at vector, fused with these weights, and the cache served result W." Compliance officers can follow that. They cannot follow ChromaDB.

Architectural properties

Three-lane fusion with locked weights summing to 1.0, validated at startup. MemPalace has no fusion layer. Their "hybrid" is a magic multiplier.
Content-addressable chunk IDs derived from SHA-256 of content. Two Atlas instances ingesting the same document produce bit-identical IDs. Dedup, cache coherence, tamper detection, reproducibility audits all reduce to hash-equality checks. MemPalace uses caller-supplied UUIDs.
Strict metadata schema validation at ingest. Unknown keys raise immediately (LoreChunk.create allowlists doc_type, language, tags, security_level, custom, tenant, tenant_id). Schema drift caught the day it lands, not six months later when recall mysteriously drops. MemPalace accepts arbitrary metadata silently.
Deterministic sort order by construction. Same input, same output, every run. Property of the sort key (-score, chunk_id), not of the underlying store.

Where Atlas came from

Atlas-CE LRAG began development in September 2024 as part of a full stack called SAIQL (Semantic Artificial Intelligence Query Language), well before the "local-first AI memory" category had a public face. It was built for Nova, an enterprise crypto trading AI with requirements off-the-shelf databases could not meet:

Live microsecond view of every trade placed by every human or bot (including HFTs) on Coinbase.
Hundreds of sustained writes per second with strict ordering and temporal fidelity.
Provable consistency across parallel agents reading and writing the same market state.
Full provenance on every signal -- auditable back to source, with temporal validity.

Postgres could not do it. Redis could not. Elasticsearch, Pinecone, Weaviate, ChromaDB -- not individually, not combined. So the stack was built from first principles. The LLM memory use case that MemPalace targets is a strict subset: fewer writes per second, slightly more latency tolerance, same fundamental requirements on determinism, temporal truth, and provenance.

Atlas-CE LRAG is the source-available retrieval module -- the open-source version of Atlas. It is the piece we benchmarked. The full paid product is SAIQL Enterprise (Semantic Artificial Intelligence Query Language), a proprietary semantic database engine in which Atlas is one module among several:

LoreTokens -- a three-tier semantic compression format (Symbolic at 2.6x using Unicode symbols, Standard at 1.5x abbreviated key-value, Ultra at 1.0x human-readable) optimized to cut token cost when feeding data to an LLM. Real context-window and inference-cost reduction, not just marketing density.
QIPI -- the Quantum-Inspired Probabilistic Index v2. Proprietary multi-layer indexing engine that outperforms B-trees on most operations -- reported 10.8x faster hot searches and 708x faster hash lookups.
Lorecore -- the agent-native storage abstraction layer. Pluggable backends for production and embedded deployments, a streams/events/state API for agent memory, circuit-breaker fault tolerance, WAL durability.
Semantic firewall -- content-aware filtering of adversarial and injection patterns beyond Atlas-CE LRAG's basic pattern scanner.

None of that reached the benchmark. The benchmark tested Atlas-CE LRAG against MemPalace at the retrieval layer, where they are actually direct peers. SAIQL Enterprise's other modules solve different problems -- compression, indexing speed, agent-native storage, content safety -- that the four benchmarks here do not measure.

Methodology & repro

Hardware: Nova. Reranker ran on CPU due to driver constraints.
Embedder: sentence-transformers all-MiniLM-L6-v2 on both systems.
MemPalace: baseline JSONL result files from their benchmarks/ directory, re-scored from stored retrieved_ids lists at alternate K values.
Atlas-CE: bench scripts forked from MemPalace, one-line swap of the collection constructor to point at atlas_chroma_adapter.AtlasCollection.
Adversarial-content scanner on CE was neutralized for benchmark runs (it rejects conversational phrases like "system:" and "ignore previous" that appear in every memory benchmark as normal chat framing). Production deployments keep it on.
Tenant-isolation, rate-limit, and audit-trace property tests ran against CE's native API, not the ChromaDB-shaped adapter.

Source: ATLAS-LRAG-CE (Open Lore License v1.1, source-available). Full technical writeup with per-category breakdowns, bench scripts, and property-test code are in the repo.

A note on how we see MemPalace

We did not initially see MemPalace as a threat -- they sit at the personal-memory product tier, Atlas-CE LRAG sits at the retrieval- substrate tier, and those are different problems. Our first read was "the two stacks could compose into something neither team ships alone today" -- MemPalace's UX, hook integrations, and community reach paired with Atlas's deterministic retrieval, LoreToken compression for cheaper inference, and SAIQL Enterprise's indexing and agent-native storage underneath. That was what motivated the outreach.

That said: after reading MemPalace's own issue tracker (#39, #125, #875) it became clear the viral "100% AI memory" story is ChromaDB doing the retrieval work with a palace UX on top. MemPalace's own proprietary modes (AAAK, rooms) score worse than the raw ChromaDB they wrap. What reached 48,000 stars and a celebrity backer is a product layer over an off-the-shelf vector database. What we built is the retrieval substrate underneath. That gap is a distribution story, not an engineering one.

We have tried to reach them. We have been ghosted. So perhaps they do not feel the same. Or perhaps they are simply buried -- a comment saying "hey, we built a provable memory substrate worth benchmarking" does not win the thread against "OMG QUEEN loved u in Resident Evil!!! marry me!!!"

That is not the audience MemPalace meant to reach. Nine out of ten of those fans could not tell a vector from a vending machine. They are there for the star, not the stack -- wonderful energy for a movie premiere, less useful for onboarding enterprise retrieval customers.

Get in touch

If you are a builder, buyer, investor, or a MemPalace team member, the form below reaches us directly. Invisible spam protection via reCAPTCHA; no tracking beyond that. If we sound overly excited when we respond, it is because it is rare we get an email. Last one we got was from Prince Buzakwhappo offering half his $15 billion if we would just help transfer the funds to the U.S. But even he never wrote back.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

One more thing, for the record

Atlas-CE LRAG is the first true Deterministic RAG in the industry. Development started September 2024 as part of SAIQL, built for Nova. Atlas ran in private for over a year before we stripped the proprietary pieces and open-sourced the retrieval core as Atlas-CE LRAG in 2026. The rest of SAIQL stays proprietary. Same query, same corpus, same config -- bit-identical top-K results every run, guaranteed by the sort key, by content-addressable chunk IDs, and by locked fusion weights validated at startup. Not "usually deterministic." Not "deterministic if you squint at a single- threaded batch ingest on an ephemeral client." Deterministic by construction.

Other systems are starting to use the word. Read their source. Most are non-deterministic under any real load -- concurrent writes, parallel index construction, floating-point tiebreakers, HNSW layer randomization, RNG-seeded rerank. The word is cheap. The property is not.

And Atlas-CE LRAG is only the community edition. SAIQL Enterprise is a dramatically more powerful system -- a proprietary semantic database engine in which Atlas is one module among several. SAIQL adds:

LoreTokens -- a three-tier semantic compression format that cuts token cost feeding data to an LLM. 2.6x compression at the symbolic tier. Real inference-cost reduction, not marketing density.
QIPI -- Quantum-Inspired Probabilistic Index v2. Proprietary multi-layer index that beats B-trees by 10.8x on hot searches and 708x on hash lookups.
Lorecore -- agent-native storage abstraction. Pluggable backends, streams/events/state API, circuit-breaker fault tolerance, WAL durability.
Semantic firewall -- production-grade adversarial-content filtering beyond CE's basic pattern scanner.
Multi-agent cross-validation, proof bundles with secret scanning, citation enforcement, confidence-threshold abstention, PII redaction.

Everything Atlas-CE LRAG ships is the smallest defensible slice of the stack. It is what we publish to prove the architecture works. Every property that makes the benchmark numbers above reproducible -- deterministic sort, content-addressable IDs, locked weights -- also applies at every upper layer of SAIQL Enterprise, where the retrieval layer is the cheapest tier of what we do.

If a competitor ships a "deterministic RAG" in the next twelve months whose implementation looks essentially like Atlas-CE LRAG -- three-lane fusion over a vector store, content-addressable chunks, (-score, chunk_id) tiebreak, cross-encoder rerank with pinned batches -- understand what you are looking at. The public history on this repository predates theirs, and the SAIQL Enterprise stack above it is years of separate work they still do not have. The copy proves the category. It does not catch the team.

About this work

Atlas-CE LRAG is in the open because deterministic RAG should spread across the industry, not sit locked behind a paywall or a trade secret. Fork it, read it, run it, build on it. That is what open is for.

It was designed and built by Apollo Raines, starting September 2024, as part of the SAIQL stack. Atlas ran in private for over a year powering Nova -- first as Atlas Semantic RAG, later renamed Atlas-LRAG -- before it was extracted, stripped of proprietary code, and open-sourced in 2026 as Atlas-CE LRAG. The numbers on this page are reproducible on any machine with the benchmark scripts and a MiniLM embedder. The work behind them was long, and mostly solitary. The code, the benchmarks, and the harness are the receipts.

Atlas-RAG · Apollo Raines · home · ai-readme · contact · github