Benchmarks / Apr 2026

Atlas-CE vs MemPalace

We ran a head-to-head on all four of MemPalace's published benchmarks. Same data, same embedder, same bench scripts. Here is what we found.

Apollo Raines · atlas-rag.com · ATLAS-LRAG-CE on GitHub

TL;DR

Why we ran this

A claim made the rounds: "Actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution."

It got our attention. We publish strong numbers on the same benchmarks, and in eighteen months of building Atlas we had not seen another serious project come close on deterministic conversational memory retrieval. If someone had actually cracked 100%, that is real news.

So we ran MemPalace on our hardware, on the same dataset, with the same embedding model, through the same bench scripts. Forked the scripts with a one-line swap of the collection constructor to point at Atlas-CE's adapter. Identical input, identical scoring. The goal was not to prove who is better. The goal was to see -- honestly, with identical data -- where two serious attempts at this problem actually land.

What "100%" actually means

MemPalace's tagline number is R@5 = 96.6% on LongMemEval raw -- the one in their README. The viral "100%" figure is R@50.

LongMemEval's "s" dataset averages about 53 haystack sessions per question. If you return 50 of those 53, you mathematically hit every session that matters and recall clocks 1.0. That is not a retrieval achievement. That is the haystack being smaller than K.

Every vector retriever on MiniLM-L6-v2 embeddings hits R@50 = 1.000 on that dataset. MemPalace does. ChromaDB raw does. Atlas-CE does. A random shuffle does, often enough. The denominator is doing the work, not the retriever.

The real numbers at every K that matters for an LLM actually reading results:

KMemPalaceAtlas-CE SMAtlas-CE HM
R@10.8060.8060.944 (+13.8)
R@30.9260.9260.980 (+5.4)
R@50.9660.9660.990 (+2.4)
R@100.9820.9820.996 (+1.4)
R@300.9960.9960.996
R@501.0001.0001.000

Credit where due: after an exhaustive audit filed as MemPalace issue #875 on April 14, the MemPalace team merged PR #897 the following day. The "100% on LongMemEval" headline, the "highest-scoring AI memory system ever benchmarked" banner, the "+34% palace boost" line, and the cross-system comparison tables (which were mixing retrieval-recall with competitor QA-accuracy numbers) were all removed. The README now reports the honest held-out figure of 98.4% R@5 on 450 unseen questions. Co-founder bensig acknowledged publicly that "100% shouldn't be headlined like that." The repo is now clean. The social-media claim that triggered this benchmark -- screenshots, videos, third-party posts -- is still circulating uncorrected in the wild, which is what got us to run the comparison in the first place.

"Beating every paid solution" is doing more work still. No paid retrieval system publishes raw LongMemEval R@1 meaningfully below 0.80 on MiniLM -- this is a shared ceiling of the embedding model, not something MemPalace unlocked.

Built on top vs. built from scratch

MemPalace imports chromadb. Their storage backend is a thin wrapper around ChromaDB's vector store; the retrieval numbers everyone is measuring -- theirs, ours, @gizmax's M2 Ultra reproduction -- are measuring ChromaDB's Rust HNSW index and default sentence-transformer embeddings, not anything MemPalace ships. Wings, rooms, drawers, and AAAK are UX metaphors layered on top of that substrate. Organizational, not a retrieval mechanism. The team's own PR #897 acknowledges this explicitly: palace filtering is "standard metadata filtering in the vector store, not a novel retrieval mechanism." And MemPalace's own proprietary modes (AAAK, rooms) actually score worse than the raw ChromaDB they wrap -- -12.4 and -7.2 points on LongMemEval R@5 respectively, per issue #39.

Atlas-CE LRAG has no ChromaDB underneath. BM25 is our own lexical index. The vector store is ours. Three-lane fusion is ours. The deterministic sort by (-score, chunk_id) that guarantees bit-identical retrieval across runs is ours. Content- addressable chunk IDs derived from SHA-256 of content are ours. When we want to add a cross-encoder rerank (the HM mode below), we ship it in a single file. When we want mandatory tenant isolation on every call, we ship it. When we want an audit trace with nine introspection fields per query, we ship it.

This is the difference between a product and a substrate. MemPalace is a product -- a thoughtful, well-designed consumer memory UX over a retrieval engine they do not own. Atlas-CE LRAG is the retrieval engine.

The two are at different tiers of the stack, and the benchmark reveals exactly what you would expect. On the rank where the benchmark is really testing ChromaDB's embeddings -- R@5 raw, which both systems inherit from the same MiniLM-L6-v2 vectors -- we tie. On the rank where the architecture of the retrieval layer matters -- R@1 with a cross-encoder reranker that requires owning the pipeline to ship -- we pull ahead by 14 points. Nothing about that is a comment on MemPalace's product quality. It is a comment on which layer each system sits at.

Standard Mode results -- the honest part

Atlas-CE SM versus MemPalace on raw-mode retrieval is really Atlas-CE SM versus ChromaDB. Same vectors, different retriever wrappers around the same embedder. The SM ties below are not a MemPalace achievement and not an Atlas miss -- they are both systems correctly reading the same embedding space.

Atlas-CE Standard Mode (three-lane hybrid fusion: metadata, BM25, vector; locked weights w_meta=0.10, w_lex=0.25, w_vec=0.65; deterministic sort by (-score, chunk_id)) ties MemPalace bit-for-bit on three of four benchmarks:

BenchmarkMemPalaceAtlas-CE SMDelta
LongMemEval R@5 (500 Q)0.9660.9660.000
LoCoMo avg (1986 Q, top-50)1.0001.0000.000
ConvoMem avg (1250 items)0.8900.8900.000
MemBench R@5 (8500 items)0.8030.797-0.006

MemBench was our one SM loss -- by 0.6 points. Two lines to add: on the two hardest categories (noisy and post-processing, where vector similarity smears across distractors), CE beats MemPalace. Noisy: CE 44.2% vs MemPalace 43.4% (+0.8). Post-processing: CE 58.1% vs MemPalace 56.6% (+1.5). Fusion earns its keep exactly where single-lane vector gives up.

Why the ties? Same embedder. Both systems read the same MiniLM 384-dim vector space. ChromaDB raw is vector-only. CE's fusion with w_vec=0.65 collapses to vector-dominated output when BM25 is uniform across candidates and metadata is trivial. Both sort the same space.

LoCoMo at a meaningful K

The LoCoMo 1.000 vs 1.000 tie is another artifact of the top-K exceeding haystack size (~20 sessions per conversation, top-50 retrieval). Rescored at top-5:

CategoryMemPalaceAtlas-CE SMCE wins by
Single-hop0.3880.408+0.020
Temporal0.5880.610+0.022
Temporal-inference0.3430.355+0.012
Open-domain0.4480.483+0.035
Adversarial0.4710.493+0.022
Avg0.4620.489+0.027

CE SM wins every category. Three-lane fusion is doing real work at a K that actually measures retrieval quality; it was invisible at top-50 because the corpus was smaller than K.

High-Accuracy Mode

HM adds a deterministic cross-encoder reranker (BAAI/bge-reranker-v2-m3) over the top 20 first-stage candidates. Ranks 21 through K come from the first-stage tail unchanged, so large-K recall is unaffected; small-K recall benefits from the cross-encoder signal.

LongMemEval HM, 500 questions, same bench:

Thirteen points at K=1 is the rank that matters. Top-1 is what an LLM actually reads when you hand it a retrieval result. The HM path is opt-in via accuracy_mode="HM" in config, or ATLAS_ACCURACY_MODE=HM env var. Latency cost is roughly 10x on CPU due to the reranker; pinned batch size preserves determinism.

MemPalace has no reranker path. Their "hybrid" mode is a keyword-overlap scalar multiplier baked into their bench code, not a cross-encoder.

Honest accounting on the other benchmarks. HM is not a uniform lift across every workload -- it is a lift on workloads where a cross-encoder's training distribution matches the retrieval shape.

MemBench HM on the noisy category (1000 items, the hardest category in MemBench where vector similarity smears across distractors) delivered a clean win:

SystemMemBench noisy R@5
MemPalace (SM / hybrid)0.434
Atlas-CE (SM)0.442
Atlas-CE (HM)0.601 (+16.7)

Sixteen-point lift on the hardest MemBench category, same embedder, same bench harness. The reranker earns its keep exactly where vector similarity alone gives up.

ConvoMem HM (1250 items, same corpus, same embedder) landed as a tie overall at 0.890 avg -- splitting the per-category picture: wins on Abstention (+1.6), Preferences (+0.8), User Facts (+0.8); losses on Assistant Facts (-0.4) and Implicit Connections (-3.1). The Implicit Connections drop is genuine -- multi-hop memory reasoning is out of the BGE reranker-v2-m3 training distribution (MS-MARCO passage retrieval), and the reranker actively mis-scores those queries. A domain-trained reranker would likely flip that result; that is one of the things SAIQL Enterprise's HM configuration does.

LoCoMo HM (top-5) attempted twice and failed to complete both times at the same conversation (conv-48) for reasons we have not fully diagnosed. Partial data through conversation 48 tracked positive but we did not produce a clean full-run result.

Net HM picture across the four benchmarks: two clean wins (LongMemEval R@1 +13.8, MemBench noisy R@5 +16.7), one tie overall with mixed per-category results (ConvoMem), one incomplete (LoCoMo). The pattern is visible: HM lifts at K=1 when the reranker's training distribution matches, and ties or slightly underperforms when it does not. Cross-encoder reranking is not magic; it is a tuned tool. Atlas-CE ships it as an opt-in mode precisely so it can be deployed where it helps and disabled where it does not.

Property tests -- what recall does not measure

Four tests designed around things ChromaDB-backed systems structurally cannot do without rewriting their foundation:

PropertyMemPalace / ChromaDBAtlas-CE
Prompt-injection rejection at ingest (100 payloads)0% blocked40% blocked (stock patterns; configurable up)
Multi-tenant isolation (5 tenants, 100 queries)Caller-dependent (naive: 80% cross-tenant leak; isolated: 0%)API-enforced (0% leak, tenant_id required on every call)
Audit trace fields per query59 in SM, 12 in HM
Per-tenant rate limit (500 rapid queries)500 / 500 accepted (no cap)100 / 500 accepted (default 100/min)

The audit trace alone is the difference between "the system returned these chunks" and "the system filtered X candidates at the metadata stage, narrowed to Y at lexical, refined to Z at vector, fused with these weights, and the cache served result W." Compliance officers can follow that. They cannot follow ChromaDB.

Architectural properties

Where Atlas came from

Atlas-CE LRAG began development in September 2024 as part of a full stack called SAIQL (Semantic Artificial Intelligence Query Language), well before the "local-first AI memory" category had a public face. It was built for Nova, an enterprise crypto trading AI with requirements off-the-shelf databases could not meet:

Postgres could not do it. Redis could not. Elasticsearch, Pinecone, Weaviate, ChromaDB -- not individually, not combined. So the stack was built from first principles. The LLM memory use case that MemPalace targets is a strict subset: fewer writes per second, slightly more latency tolerance, same fundamental requirements on determinism, temporal truth, and provenance.

Atlas-CE LRAG is the source-available retrieval module -- the open-source version of Atlas. It is the piece we benchmarked. The full paid product is SAIQL Enterprise (Semantic Artificial Intelligence Query Language), a proprietary semantic database engine in which Atlas is one module among several:

None of that reached the benchmark. The benchmark tested Atlas-CE LRAG against MemPalace at the retrieval layer, where they are actually direct peers. SAIQL Enterprise's other modules solve different problems -- compression, indexing speed, agent-native storage, content safety -- that the four benchmarks here do not measure.

Methodology & repro

Source: ATLAS-LRAG-CE (Open Lore License v1.1, source-available). Full technical writeup with per-category breakdowns, bench scripts, and property-test code are in the repo.

A note on how we see MemPalace

We did not initially see MemPalace as a threat -- they sit at the personal-memory product tier, Atlas-CE LRAG sits at the retrieval- substrate tier, and those are different problems. Our first read was "the two stacks could compose into something neither team ships alone today" -- MemPalace's UX, hook integrations, and community reach paired with Atlas's deterministic retrieval, LoreToken compression for cheaper inference, and SAIQL Enterprise's indexing and agent-native storage underneath. That was what motivated the outreach.

That said: after reading MemPalace's own issue tracker (#39, #125, #875) it became clear the viral "100% AI memory" story is ChromaDB doing the retrieval work with a palace UX on top. MemPalace's own proprietary modes (AAAK, rooms) score worse than the raw ChromaDB they wrap. What reached 48,000 stars and a celebrity backer is a product layer over an off-the-shelf vector database. What we built is the retrieval substrate underneath. That gap is a distribution story, not an engineering one.

We have tried to reach them. We have been ghosted. So perhaps they do not feel the same. Or perhaps they are simply buried -- a comment saying "hey, we built a provable memory substrate worth benchmarking" does not win the thread against "OMG QUEEN loved u in Resident Evil!!! marry me!!!"

That is not the audience MemPalace meant to reach. Nine out of ten of those fans could not tell a vector from a vending machine. They are there for the star, not the stack -- wonderful energy for a movie premiere, less useful for onboarding enterprise retrieval customers.

Get in touch

If you are a builder, buyer, investor, or a MemPalace team member, the form below reaches us directly. Invisible spam protection via reCAPTCHA; no tracking beyond that. If we sound overly excited when we respond, it is because it is rare we get an email. Last one we got was from Prince Buzakwhappo offering half his $15 billion if we would just help transfer the funds to the U.S. But even he never wrote back.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

One more thing, for the record

Atlas-CE LRAG is the first true Deterministic RAG in the industry. Development started September 2024 as part of SAIQL, built for Nova. Atlas ran in private for over a year before we stripped the proprietary pieces and open-sourced the retrieval core as Atlas-CE LRAG in 2026. The rest of SAIQL stays proprietary. Same query, same corpus, same config -- bit-identical top-K results every run, guaranteed by the sort key, by content-addressable chunk IDs, and by locked fusion weights validated at startup. Not "usually deterministic." Not "deterministic if you squint at a single- threaded batch ingest on an ephemeral client." Deterministic by construction.

Other systems are starting to use the word. Read their source. Most are non-deterministic under any real load -- concurrent writes, parallel index construction, floating-point tiebreakers, HNSW layer randomization, RNG-seeded rerank. The word is cheap. The property is not.

And Atlas-CE LRAG is only the community edition. SAIQL Enterprise is a dramatically more powerful system -- a proprietary semantic database engine in which Atlas is one module among several. SAIQL adds:

Everything Atlas-CE LRAG ships is the smallest defensible slice of the stack. It is what we publish to prove the architecture works. Every property that makes the benchmark numbers above reproducible -- deterministic sort, content-addressable IDs, locked weights -- also applies at every upper layer of SAIQL Enterprise, where the retrieval layer is the cheapest tier of what we do.

If a competitor ships a "deterministic RAG" in the next twelve months whose implementation looks essentially like Atlas-CE LRAG -- three-lane fusion over a vector store, content-addressable chunks, (-score, chunk_id) tiebreak, cross-encoder rerank with pinned batches -- understand what you are looking at. The public history on this repository predates theirs, and the SAIQL Enterprise stack above it is years of separate work they still do not have. The copy proves the category. It does not catch the team.

About this work

Atlas-CE LRAG is in the open because deterministic RAG should spread across the industry, not sit locked behind a paywall or a trade secret. Fork it, read it, run it, build on it. That is what open is for.

It was designed and built by Apollo Raines, starting September 2024, as part of the SAIQL stack. Atlas ran in private for over a year powering Nova -- first as Atlas Semantic RAG, later renamed Atlas-LRAG -- before it was extracted, stripped of proprietary code, and open-sourced in 2026 as Atlas-CE LRAG. The numbers on this page are reproducible on any machine with the benchmark scripts and a MiniLM embedder. The work behind them was long, and mostly solitary. The code, the benchmarks, and the harness are the receipts.

Atlas-RAG · Apollo Raines · home · ai-readme · contact · github