The AI memory system MemPalace, developed with the participation of Milla Jovovich, claimed it scored a perfect test result and went viral—only to be exposed by the community for allegedly cheating on the tests and misleading the data. Real-world testing found the results were exaggerated, with a large number of errors. The team has acknowledged the flaws and is working on a fix.
Yesterday (4/7), there was a big news story in the AI community: Hollywood actress Milla Jovovich (known for Resident Evil and The Fifth Element) worked with developer Ben Sigman to develop the open-source AI memory system “MemPalace,” with support from Claude Code.
For a time, the claim that “Hollywood superstar crosses over to deliver a perfect-score project” spread widely. MemPalace has since received more than 20k stars on GitHub, but it didn’t take long for the developer community to question: Is it truly substantial, or is it just hype?
First, let’s talk about the motivation behind MemPalace’s creation. The official documentation says it aims to address the limitation that, in today’s AI systems, the content of user–AI conversations, decision-making processes, and architecture discussions usually disappear after a work session ends—leading to months of effort being drop to zero.
To solve this problem, MemPalace uses a spatial architecture to store memories—explicitly categorizing information into “wings” representing people or projects, as well as different levels such as corridors, rooms, and drawers, preserving the original conversation text for later semantic retrieval.
The development team claims that MemPalace achieved a 100% perfect score on the long-term memory evaluation benchmark LongMemEval, and reached a 96.6% accuracy rate without calling any external APIs. It can run completely locally without subscribing to cloud services, and includes an AAAK dialect system touted to achieve 30x lossless compression.
Image source: GitHub Hollywood star Milla Jovovich builds an AI memory palace, drawing outside attention
However, MemPalace’s claimed 100% score on LongMemEval quickly drew doubts from peers.
PenfieldLabs, which also builds AI memory systems, pointed out that MemPalace’s claim that it achieved a perfect score on the LoCoMo dataset is mathematically impossible, because the dataset’s standard answers themselves contain 99 errors.
PenfieldLabs’ analysis found that MemPalace’s 100% score came from setting the number of retrievals to 50 times—but the highest stage count of the conversations in the test dataset is only 32 times. This means the system bypasses the retrieval stage entirely and hands all the data to the AI model to read.
Regarding the 100% score on LongMemEval, the development team was found to have written specialized repair code targeting three specific issues where the development concentration was wrong, raising suspicion of cheating on the test set.
Image source: Reddit Peers PenfieldLabs point out that MemPalace’s claim of a perfect score on the LoCoMo dataset is mathematically impossible
GitHub user hugooconnor commented after real-world testing: although MemPalace claims a retrieval accuracy as high as 96.6%, in reality it doesn’t use the memory palace architecture that MemPalace advertises at all. hugooconnor said their tests simply call the default functions of the underlying database ChromaDB and have nothing to do with the categorization logic emphasized by the project—such as wings, rooms, or drawers.
After testing, hugooconnor found that when the system’s dedicated categorization logic for these memory palaces is truly enabled, retrieval performance actually declines. For example, in room mode, accuracy drops to 89.4%; and after enabling the AAAK compression technique, accuracy drops further to 84.2%—both are lower than the performance of the default database.
hugooconnor also criticized the test methodology. In MemPalace’s testing environment, the retrieval range for each question is deliberately narrowed to about 50 conversation turns. Finding answers in such a tiny sample set is too easy.
If you expand the range to more than 19,000 conversation turns in real scenarios, the accuracy of traditional keyword search would plummet to 30%, showing that MemPalace’s current testing approach is masking the real search challenges.
Image source: GitHub GitHub user real-world testing shows MemPalace’s benchmark tests contain misleading elements
Meanwhile, although the development team has already released a clarification acknowledging that the AAAK technology has indeed been validated as lossy compression, and promised to revise the documentation and system design in response to the community’s harsh criticisms, the project’s main documentation still retains multiple exaggerated claims that have not been corrected. These include claims of 30x lossless compression and a 34% retrieval boost, and even the comparison charts with other competitors completely lack citations or sources.
As more and more developers download and test, many bug reports about MemPalace’s source code have appeared on the GitHub platform.
User cktang88 listed multiple serious issues, including compression commands that can’t run and cause the system to crash, summary word-count calculation logic errors, inaccurate statistical data for “mining” rooms, and a problem where the server loads all interpretation data into memory every time it’s called, leading to severe resource consumption issues.
Other reported problems also include the system hard-coding developers’ family member names into the default configuration file, as well as a forced display limit of 10k records when querying status.
To address these issues, the open-source community has started actively fixing them. User adv3nt3 submitted multiplefix requests, including correcting mining statistical data, removing the default family member names, and delaying the initialization time for the knowledge graph. The development team subsequently also acknowledged these mistakes and is gradually solving code issues through community collaboration.
For the MemPalace project, Hacker News user darkhanakh drew a conclusion: MemPalace gives off a sense of déjà vu of OpenClaw—artificially manipulating benchmark results to make them look flawless, then packaging it as some kind of major breakthrough for marketing.
He believes that while MemPalace’s underlying technology may indeed be interesting, given the flaws in the testing methodology and then promoting it with claims like “the highest public score ever,” it’s really not appropriate. “But then again, I think it’s still pretty cool that Milla Jovovich is playing Vibe Coding.”
Further reading:
AI writing code backfires! A convenience-store expiring-item app called “Food-Waste Hunter” explodes with data security issues—GPS at home sends full naked data