The AI memory system MemPalace, developed with Milla Jovovich, claimed perfect scores in testing and quickly went viral, but was later called out by the community for allegedly cheating in its tests and misleading data. Hands-on verification found the results were exaggerated, with many errors. The team has admitted the flaws and is working on fixes.
Yesterday (4/7), there was a big piece of news in the AI community: Hollywood star Milla Jovovich (known for Resident Evil and The Fifth Element), teamed up with developer Ben Sigman to use Claude Code to develop the open-source AI memory system “MemPalace.”
For a while, the claim that “a Hollywood megastar crossovers to deliver a perfect-scoring project” spread widely. MemPalace has gained more than 20k stars on GitHub so far, but it didn’t take long for the developer community to start questioning: Is it genuinely impressive, or is it just hype?
First, let’s talk about the motivation behind MemPalace. The official documentation says it aims to solve the limitation that, in current AI systems, user-AI conversation content, decision-making processes, and architecture discussions typically disappear after a work session ends, causing months of effort to go to waste.
To address this issue, MemPalace uses a spatial architecture to store memories, clearly categorizing information into wing areas representing individuals or projects, along with structures at different levels such as corridors, rooms, and drawers, while preserving the original dialogue text for later semantic retrieval.
The development team claims that MemPalace achieved a perfect score of 100% on the long-term memory evaluation benchmark LongMemEval, and also reached 96.6% accuracy without calling any external APIs. It can run fully locally, without needing to subscribe to cloud services, and it is paired with an AAAK dialect system claimed to achieve 30x lossless compression.
Image source: GitHub Hollywood star Milla Jovovich builds an AI memory palace, drawing outside attention
However, the claimed perfect score on LongMemEval quickly drew scrutiny from industry peers.
PenfieldLabs, which also builds AI memory systems, pointed out that MemPalace’s claim of a perfect score on the LoCoMo dataset is mathematically impossible, because the standard answers in that dataset itself already include 99 incorrect entries.
After analysis, PenfieldLabs found that MemPalace’s 100% score came from setting the retrieval count to 50 times, but the highest number of dialogue stages in the test dataset is only 32. This means the system effectively bypasses the retrieval stage and hands all the data directly to the AI model to read.
Regarding the 100% result on LongMemEval, the development team was found to have focused on three specific problems that were mistakenly concentrated during development, and wrote dedicated patch code. This raises suspicion of cheating on the test set.
Image source: Reddit Peer PenfieldLabs points out that MemPalace’s claim of a perfect score on the LoCoMo dataset is mathematically impossible
GitHub user hugooconnor commented after testing in practice. MemPalace claims a retrieval accuracy as high as 96.6%, but in reality it did not use the MemPalace “memory palace” architecture at all. hugooconnor said their test simply calls the default functions of the underlying database ChromaDB and does not involve any classification logic for the wing areas, rooms, or drawers emphasized by the project.
After testing, hugooconnor found that when the system’s dedicated classification logic for these memory palaces is actually enabled, retrieval performance instead declines. For example, in room mode, accuracy drops to 89.4%, and after enabling the AAAK compression technology, accuracy falls further to 84.2%. Both are lower than the default database performance.
hugooconnor also criticized the testing method. MemPalace’s test environment intentionally narrowed the retrieval range for each question to about 50 dialogue stages, making it too easy to find answers in a tiny sample library.
If the range is expanded to more than 19,000 dialogue stages in real scenarios, the accuracy of traditional keyword search would plummet to 30%, showing that MemPalace’s current testing approach is masking the real difficulty of searching.
Image source: GitHub GitHub user real-world tests show that MemPalace benchmark testing contains misleading elements
At the same time, although the development team has already released a correction statement acknowledging that the AAAK technology was indeed verified as lossy compression and promised to revise the documentation and system design according to the community’s harsh criticisms, the project’s main documentation still retains multiple uncorrected exaggerations. These include claims of 30x lossless compression and a 34% increase in retrieval, and even the comparison charts against other competitors completely lack sources.
As more and more developers download and test it, many bug reports about MemPalace’s source code have appeared on the GitHub platform.
User cktang88 lists multiple serious issues, including compression commands that cannot run and cause the system to crash, errors in the summary word-count logic, inaccurate statistical data for “mining rooms,” and a problem where the server loads all interpretation data into memory on every call, leading to severe resource consumption issues.
Other issues that were pointed out also include the system hard-coding the developer’s family member name into the default configuration file, as well as a forced display limit of 10k entries when querying status.
In response to these issues, the open-source community has already begun actively fixing them. User adv3nt3 submitted multiplefix requests, including correcting mining statistical data, removing the default family member name, and delaying the initialization time of the knowledge graph. The development team later also acknowledged these errors and is gradually resolving the code issues through community collaboration.
Regarding the MemPalace project, an interpretation by Hacker News user darkhanakh reached the conclusion: MemPalace gives off an OpenClaw kind of impression—manually manipulating benchmark results so they look flawless, then packaging it as some kind of major breakthrough to market it.
He believes MemPalace’s underlying technology may indeed be interesting, but given that its testing methods have such flaws, and then still promotes it with “the highest public score in history” as part of its advertising, it doesn’t seem appropriate. “But, anyway, the fact that Milla Jovovich is playing Vibe Coding—I still think that’s pretty cool.”
Further reading:
AI coding goes wrong! The convenience-store near-expiration product app “Waste Not Hunter” explodes with security issues, and the GPS at home is fully exposed