The AI memory system MemPalace, developed with the participation of Milla Jovovich, claimed perfect scores in testing and quickly went viral, only to be exposed by the community for allegedly cheating in the tests and misleading the data. Real-world testing found the results were exaggerated and there were large numbers of errors. The team has acknowledged the flaws and is working on repairs.
Yesterday (4/7), there was big news in the AI circle: Hollywood actress Milla Jovovich (known for Resident Evil and The Fifth Element), alongside developer Ben Sigman using Claude Code, developed the open-source AI memory system “MemPalace.”
For a while, claims spread widely that “Hollywood superstar crossovers to deliver a perfect-score project.” To date, MemPalace has also received more than 20,000 stars on GitHub, but it didn’t take long for the developer community to start questioning: Is it genuinely solid, or just hype?
Let’s first talk about the motivation behind MemPalace. In its official documentation, the goal is to address the limitation that, in current AI systems, the content of user-AI conversations, the decision-making process, and architecture discussions typically disappear after a work session ends—leading to months of effort being drop to zero.
To solve this problem, MemPalace uses a spatial architecture to store memories, explicitly categorizing information into wing areas representing people or projects, as well as into different levels of structure such as hallways, rooms, and drawers—preserving the original conversation text for later semantic retrieval.
The development team claims that MemPalace achieves 100% perfect results in the long-term memory evaluation benchmark LongMemEval, and reaches a 96.6% accuracy rate without calling any external API. It can also run fully locally without needing a subscription to cloud services, and is equipped with an AAAK dialect system that it claims can achieve 30x lossless compression.
Image source: GitHub Hollywood star Milla Jovovich builds an AI memory palace, drawing outside attention
However, the score of a perfect 100% on LongMemEval claimed by MemPalace quickly drew skepticism from peers.
PenfieldLabs, which also develops AI memory systems, pointed out that it is mathematically impossible for MemPalace to achieve a perfect score on the LoCoMo dataset, because the standard answers in that dataset itself already contain 99 errors.
PenfieldLabs’ analysis found that MemPalace’s 100% score comes from setting the number of retrievals to 50 times, but the highest phase count of the conversations in the test dataset is only 32. This means the system directly bypasses the retrieval stage and gives all the data to the AI model to read.
Regarding the claimed 100% score on LongMemEval, it was discovered that the development team targeted 3 specific problems that were concentrated in development errors, wrote dedicated repair code, and there is suspicion of cheating on the test set.
Image source: Reddit Peers, PenfieldLabs pointed out that MemPalace claims a perfect score on the LoCoMo dataset, which is mathematically impossible
GitHub user hugooconnor commented after running real tests. MemPalace claims retrieval accuracy as high as 96.6%, but in reality it did not use the memory palace architecture it advertises at all. hugooconnor said that their test simply calls the default functionality of the underlying database ChromaDB and does not involve the categorization logic emphasized by the project, such as wing areas, rooms, or drawers.
After testing, hugooconnor found that when the system truly enables these memory palace-specific categorization logic, retrieval performance actually declines. For example, in room mode, accuracy drops to 89.4%, and after enabling AAAK compression technology, accuracy drops further to 84.2%—both are below the default database table performance.
hugooconnor also criticized the testing methodology. In MemPalace’s test environment, the retrieval range for each question is intentionally narrowed to about 50 conversation phases, making it too easy to find answers in a very small sample library.
If the range is expanded to more than 19,000 conversation phases in a real-world scenario, the accuracy of traditional keyword search would plummet to 30%, indicating that MemPalace’s current testing approach is masking the real difficulty of searching.
Image source: GitHub GitHub users’ real-world tests: MemPalace benchmark tests include misleading elements
At the same time, although the development team has already released a correction statement, admitting that the AAAK technology was indeed verified as lossy compression and promising to revise the documentation and system design based on the community’s harsh criticism, the project’s main documentation still retains multiple exaggerated claims that have not been corrected. These include claims of 30x lossless compression and a 34% improvement in retrieval, and the comparison charts with other competitors also completely lack sources.
As more and more developers download and test, a large number of bug reports about MemPalace’s source code have appeared on the GitHub platform.
User cktang88 listed multiple serious flaws: including compression commands that can’t work and cause the system to crash, summary word-count calculation logic errors, inaccurate statistical data for mining rooms, and a problem where the server loads all interpretive data into memory on every call, leading to severe resource consumption issues.
Other issues that were pointed out also include the system hard-coding developers’ family member names into the default configuration profile, as well as a forced display limit of 10,000 records when querying status.
To address these issues, the open-source community has already begun actively fixing them. User adv3nt3 submitted multiplerepair requests, including fixing mining statistical data, removing the default family member names, and delaying the initialization time of the knowledge graph. The development team later also acknowledged these errors and is gradually resolving the code issues through community collaboration.
For the MemPalace project, a Hacker News user darkhanakh reached a conclusion: MemPalace gives the impression of OpenClaw—manually manipulating benchmark results to make them look flawless, and then packaging it as some kind of major breakthrough for marketing.
They believe that while MemPalace’s underlying technology might indeed be interesting, given that the testing methodology has flaws of this kind and it still heavily promotes it as “the highest public score ever,” it really isn’t appropriate. “But, as for the thing where Milla Jovovich is playing Vibe Coding—I still think it’s pretty cool.”
Further reading:
AI writing code went wrong! A convenience store near-expiration item app “Food-Saving Hunter” sparks security issues, with every GPS at home left exposed naked-running