|
| 1 | +--- |
| 2 | +title: "MemMachine v0.2 Delivers Top Scores and Efficiency on LoCoMo Benchmark" |
| 3 | +date: 2025-12-01T13:22:00-08:00 |
| 4 | +featured_image: "featured_image2.jpg" |
| 5 | +tags: ["AI Agent", "LoCoMo Benchmark", "Generative AI", "LLM", "Agent Memory", "featured"] |
| 6 | +author: "Tom Wong" |
| 7 | +description: "Discover MemMachine v0.2's industry-leading results on the LoCoMo benchmark. Learn how its advanced memory system for AI agents delivers top LLM scores, reduced token usage, and optimized conversational performance compared to previous releases." |
| 8 | +aliases: |
| 9 | + - /blog/2025/12/locomo-results-v0.2/ |
| 10 | +--- |
| 11 | + |
| 12 | + |
| 13 | +## Introduction |
| 14 | + |
| 15 | +In their paper, "[Evaluating Very Long-Term Conversational Memory of LLM Agents](https://arxiv.org/abs/2402.17753)", Snap researchers introduced the open-source [LoCoMo benchmark](https://github.com/snap-research/LoCoMo). LoCoMo provides a new standard for evaluating the true long-term conversational memory of AI agents. |
| 16 | + |
| 17 | +[Mem0](https://mem0.ai/) introduced us to their [evaluation of the LoCoMo benchmark](https://github.com/mem0ai/mem0). The project uses the LoCoMo dataset to compare benchmark scores with different memory systems. |
| 18 | + |
| 19 | +MemMachine v0.2 sets a new standard for AI agent long-term conversational memory, achieving industry-leading scores on the open-source [LoCoMo benchmark](https://github.com/snap-research/locomo). This benchmark, featured in Snap researchers’ paper "[Evaluating Very Long-Term Conversational Memory of LLM Agents](https://arxiv.org/abs/2402.17753)", is a trusted measure for evaluating memory system efficiency and agent performance. |
| 20 | + |
| 21 | +In this post, we compare MemMachine v0.2 against our v0.1.x previous release and competing systems using rigorous LoCoMo evaluations. You’ll find quantitative results, performance summaries across multiple categories, and insights on how improvements in token usage and search speed deliver cost-effective, state-of-the-art agent memory for generative AI applications. |
| 22 | + |
| 23 | +Explore how MemMachine leverages optimized retrieval, embedding, and reranking—validated by comparisons between gpt-4o-mini and the latest gpt-4.1-mini LLMs. |
| 24 | + |
| 25 | + |
| 26 | +### Key Results at a Glance |
| 27 | + |
| 28 | +- **Best-in-class LoCoMo benchmark scores** using top LLMs (gpt-4.1-mini, gpt-4o-mini) |
| 29 | +- **~80% reduction in token usage** vs. other systems, such as Mem0. |
| 30 | +- **Up to 75% faster memory add/search times** than other systems. |
| 31 | +- **Robust performance across multi-hop, temporal, open-domain, and single-hop reasoning tasks |
| 32 | + |
| 33 | +| Memory System | Eval Mode | LLM Score | Token Usage Reduction | Add/Search Speedup | |
| 34 | +|--------------------|---------------|-----------|----------------------|--------------------| |
| 35 | +| MemMachine v0.2 | gpt-4.1-mini (memory) | 0.9123 | 80% | 75% | |
| 36 | +| MemMachine v0.2 | gpt-4.1-mini (agent) | 0.9169 | 75% | 75% | |
| 37 | +| Mem0 main/HEAD | gpt-4.1-mini (memory) | 0.8000 | baseline | baseline | |
| 38 | + |
| 39 | +*Table: LoCoMo benchmark comparison - MemMachine v0.2 vs. Mem0 using SoTA LLM agents* |
| 40 | + |
| 41 | + |
| 42 | +The test environment is setup as follows. |
| 43 | + |
| 44 | + |
| 45 | +### Code for the benchmark |
| 46 | + |
| 47 | +The code for the benchmark is obtained from [Mem0 evaluation of the LoCoMo benchmark](https://github.com/mem0ai/mem0/tree/main/evaluation). |
| 48 | +The dataset for the benchmark is obtained from [Snap Research LoCoMo repo](https://github.com/snap-research/locomo/tree/main/data). |
| 49 | + |
| 50 | + |
| 51 | +### Eval-LLM |
| 52 | + |
| 53 | +The eval-LLM is the chat LLM that is used to answer the questions in the LoCoMo benchmark. The choice of eval-LLM can significantly influence the resulting score. The Mem0 evaluation of LoCoMo benchmark historically uses the OpenAI gpt-4o-mini as the default eval-LLM for evaluation. To compare different memory systems, the eval-LLM will be the same for all memory systems under test. |
| 54 | + |
| 55 | +Since the original Mem0 evaluation of LoCoMo benchmark was published, the newer OpenAI gpt-4.1-mini LLM was introduced. In this blog, we also do a comparison between the original gpt-4o-mini and the newer gpt-4.1-mini when used as the eval-LLM. |
| 56 | + |
| 57 | + |
| 58 | +### Embedder for memory system |
| 59 | + |
| 60 | +When using a memory system, the embedder is the essential element that is used to index the memories in a chat history. The embedder facilitates the retrieval of saved memories to correctly and factually answer questions. The choice of embedder can significantly influence the quality of the answers provided by the memory system. The Mem0 evaluation of LoCoMo benchmark historically uses the OpenAI text-embedding-3-small as the default embedder for evaluation. To compare different memory systems, the embedder will be the same for all memory systems under test. |
| 61 | + |
| 62 | + |
| 63 | +### Judge-LLM |
| 64 | + |
| 65 | +The judge-LLM is the chat LLM that is used to judge if the response from the eval-LLM correctly answers the question in the LoCoMo benchmark. |
| 66 | + |
| 67 | +The choice of judge-LLM can significantly influence the resulting score. Different LLMs may give false positives or false negatives when judging the same response. The Mem0 evaluation of LoCoMo benchmark historically uses the OpenAI gpt-4o-mini as the default judge-LLM for evaluation. To compare different memory systems, the judge-LLM will be the same for all memory systems under test. |
| 68 | + |
| 69 | + |
| 70 | +### Reranker |
| 71 | + |
| 72 | +When using a memory system, the reranker is an element that is used to re-evaluate the best matching memories that were retrieved by vector search. The reranker provides a second level of evaluation, providing the best set of saved memories to correctly and factually answer questions. The choice of reranker can significantly influence the quality of the answers provided by the memory system. |
| 73 | + |
| 74 | +The MemMachine v0.2 allows using a variety of rerankers. It also allows using no rerankers. For the purpose of running Mem0 evaluation of LoCoMo benchmark, MemMachine v0.2 uses the AWS cohere.rerank-v3-5:0 as the reranker. |
| 75 | + |
| 76 | + |
| 77 | +### Question categories |
| 78 | + |
| 79 | +The original LoCoMo benchmark has 5 categories of questions. The Mem0 evaluation of LoCoMo benchmark uses questions from 4 of the 5 categories. The categories used are as follows: |
| 80 | + |
| 81 | +| Category number | Description | |
| 82 | +| --------------- | ----------- | |
| 83 | +| **1** | **Multi-Hop:** Questions that require synthesizing information from multiple sessions. | |
| 84 | +| **2** | **Temporal Reasoning:** Questions can be answered through temporal reasoning and capturing time-related data cues within the conversation. | |
| 85 | +| **3** | **Open-Domain:** Questions can be answered by integrating a speaker’s provided information with external knowledge, such as commonsense or world facts. | |
| 86 | +| **4** | **Single-Hop:** Questions asking for specific facts directly mentioned in the single session conversation. | |
| 87 | + |
| 88 | + |
| 89 | +### LLM-score |
| 90 | + |
| 91 | +The judge-LLM will evaluate each question from the categories, and for each question the eval-LLM response is compared to the golden answer. The judge-LLM assigns a llm-score of 1 if the answers are equal, and 0 otherwise. The llm-score is tabulated for each question category. Then the weighted mean is calculated to give an overall mean llm-score. |
| 92 | + |
| 93 | + |
| 94 | +### Memory and agent modes |
| 95 | + |
| 96 | +The MemMachine works in either memory mode or agent mode. |
| 97 | + |
| 98 | +In [memory mode](https://docs.memmachine.ai/install_guide/integrate/GPTStore), the MemMachine directly provides the context for a question being asked. There is a single request to the eval-LLM for each question. |
| 99 | + |
| 100 | +In [agent mode](https://docs.memmachine.ai/core_concepts/agentic_workflow), the MemMachine is presented as an OpenAI-agent to the eval-LLM. When a question is given to the eval-LLM, the eval-LLM will use the MemMachine agent as a tool to retrieve the context for the question. The eval-LLM may perform several rounds of requests to the MemMachine agent to formulate the best response. |
| 101 | + |
| 102 | + |
| 103 | +## LLM-score results |
| 104 | + |
| 105 | +Here are the observed llm-scores for MemMachine v0.2. |
| 106 | + |
| 107 | + |
| 108 | +### LLM-score using gpt-4o-mini |
| 109 | + |
| 110 | +The eval-LLM is gpt-4o-mini to compare against other memory systems. |
| 111 | + |
| 112 | + |
| 113 | +#### Memory mode (gpt-4o-mini) |
| 114 | + |
| 115 | +Mean score per category |
| 116 | + |
| 117 | +| LoCoMo category | bleu-score | f1-score | llm-score | count | |
| 118 | +| --------------- | ---------- | -------- | --------- | ----- | |
| 119 | +| 1. | 0.1407 | 0.1993 | 0.8759 | 282 | |
| 120 | +| 2. | 0.0977 | 0.1847 | 0.7352 | 321 | |
| 121 | +| 3. | 0.0871 | 0.1191 | 0.7083 | 96 | |
| 122 | +| 4. | 0.1436 | 0.2519 | 0.9465 | 841 | |
| 123 | + |
| 124 | +Overall mean score |
| 125 | + |
| 126 | +| Category | Score | |
| 127 | +| -------- | ----- | |
| 128 | +| bleu-score | 0.1300 | |
| 129 | +| f1-score | 0.2200 | |
| 130 | +| llm-score | 0.8747 | |
| 131 | + |
| 132 | + |
| 133 | +#### Agent mode (gpt-4o-mini) |
| 134 | + |
| 135 | +Mean score per category |
| 136 | + |
| 137 | +| LoCoMo category | bleu-score | f1-score | llm-score | count | |
| 138 | +| --------------- | ---------- | -------- | --------- | ----- | |
| 139 | +| 1. | 0.1147 | 0.1684 | 0.8404 | 282 | |
| 140 | +| 2. | 0.1402 | 0.2242 | 0.8069 | 321 | |
| 141 | +| 3. | 0.0666 | 0.1037 | 0.7396 | 96 | |
| 142 | +| 4. | 0.1415 | 0.2508 | 0.9394 | 841 | |
| 143 | + |
| 144 | +Overall mean score |
| 145 | + |
| 146 | +| Category | Score | |
| 147 | +| -------- | ----- | |
| 148 | +| bleu-score | 0.1316 | |
| 149 | +| f1-score | 0.2210 | |
| 150 | +| llm-score | 0.8812 | |
| 151 | + |
| 152 | + |
| 153 | +*Figure 1. MemMachine v0.2 llm-score gpt-4o-mini* |
| 154 | + |
| 155 | + |
| 156 | + |
| 157 | +### LLM-score using gpt-4.1-mini |
| 158 | + |
| 159 | +The newer gpt-4.1-mini provides better results than the previous LLM. Here are the observed llm-scores for MemMachine v0.2 when the eval-LLM is gpt-4.1-mini. We also re-ran the Mem0 memory system using gpt-4.1-mini for comparison. |
| 160 | + |
| 161 | + |
| 162 | +#### Memory mode (gpt-4.1-mini) |
| 163 | + |
| 164 | +Mean score per category |
| 165 | + |
| 166 | +| LoCoMo category | bleu-score | f1-score | llm-score | count | |
| 167 | +| --------------- | ---------- | -------- | --------- | ----- | |
| 168 | +| 1. | 0.1795 | 0.2497 | 0.8972 | 282 | |
| 169 | +| 2. | 0.1521 | 0.2549 | 0.8910 | 321 | |
| 170 | +| 3. | 0.1059 | 0.1429 | 0.7500 | 96 | |
| 171 | +| 4. | 0.1868 | 0.3127 | 0.9441 | 841 | |
| 172 | + |
| 173 | +Overall mean score |
| 174 | + |
| 175 | +| Category | Score | |
| 176 | +| -------- | ----- | |
| 177 | +| bleu-score | 0.1732 | |
| 178 | +| f1-score | 0.2785 | |
| 179 | +| llm-score | 0.9123 | |
| 180 | + |
| 181 | + |
| 182 | +#### Agent mode (gpt-4.1-mini) |
| 183 | + |
| 184 | +Mean score per category |
| 185 | + |
| 186 | +| LoCoMo category | bleu-score | f1-score | llm-score | count | |
| 187 | +| --------------- | ---------- | -------- | --------- | ----- | |
| 188 | +| 1. | 0.1460 | 0.2125 | 0.8830 | 282 | |
| 189 | +| 2. | 0.1363 | 0.2366 | 0.9159 | 321 | |
| 190 | +| 3. | 0.0744 | 0.1167 | 0.7188 | 96 | |
| 191 | +| 4. | 0.1613 | 0.2836 | 0.9512 | 841 | |
| 192 | + |
| 193 | +Overall mean score |
| 194 | + |
| 195 | +| Category | Score | |
| 196 | +| -------- | ----- | |
| 197 | +| bleu-score | 0.1479 | |
| 198 | +| f1-score | 0.2503 | |
| 199 | +| llm-score | 0.9169 | |
| 200 | + |
| 201 | + |
| 202 | +*Figure 2. MemMachine v0.2 llm-score gpt-4.1-mini* |
| 203 | + |
| 204 | + |
| 205 | + |
| 206 | +## Token usage results |
| 207 | + |
| 208 | +When using a memory system, the retrieved memories are added to the question when presented to the eval-LLM. The amount of context generated by the memory system will add to the input token usage (e.g. prompt token usage). The final prompt will also influence the amount of output tokens that the eval-LLM will emit. |
| 209 | + |
| 210 | +Here are the observed token usage by MemMachine when in memory mode and in agent mode. The Mem0 token usage is shown for comparison. |
| 211 | + |
| 212 | +| memory system | input tokens | output tokens | |
| 213 | +| ------------- | ------- | ------- | |
| 214 | +| memmachine v0.2 gpt-4.1-mini memory mode | 4199096 | 43169 | |
| 215 | +| memmachine v0.2 gpt-4.1-mini agent mode | 8571936 | 93210 | |
| 216 | +| mem0 main/HEAD gpt-4.1-mini memory mode | 19206707 | 14840 | |
| 217 | + |
| 218 | +Note: mem0 main/HEAD is at commit cc2894aaec8e |
| 219 | + |
| 220 | + |
| 221 | +*Figure 3. MemMachine v0.2 token usage gpt-4.1-mini* |
| 222 | + |
| 223 | + |
| 224 | +MemMachine v0.2 retrieved the memories that provide the correct and factual responses to the questions using only a small fraction (about 20%) of the tokens used by Mem0. This is a significant reduction in cost. This will also reduce time-to-first-token, resulting in much faster responses to user queries. |
| 225 | + |
| 226 | + |
| 227 | +## Search time results |
| 228 | + |
| 229 | +The MemMachine v0.2 release has many new optimizations to the handling of episodic memory. Both the add and search memory times are significantly improved. |
| 230 | + |
| 231 | +There is approximately 75% reduction in add memory time compared to previous release. |
| 232 | + |
| 233 | +*Figure 4. MemMachine v0.2 add memory time comparison* |
| 234 | + |
| 235 | + |
| 236 | + |
| 237 | +There is up to 75% reduction in search memory time compared to previous release. |
| 238 | + |
| 239 | +*Figure 5. MemMachine v0.2 search memory time comparison* |
| 240 | + |
| 241 | + |
| 242 | +MemMachine v0.2 retrieved the memories that provide better responses to the questions compared to previous release, and provides much faster add memory and search memory times. This results in much faster responses to user queries. |
| 243 | + |
| 244 | + |
| 245 | +## Conclusion |
| 246 | + |
| 247 | +MemMachine v0.2 delivers significant advancements in conversational memory and efficiency, establishing itself as one of the highest-scoring AI memory systems available. The results demonstrate substantial reductions in token usage, faster memory operations, and improved benchmark scores—making MemMachine ideal for demanding generative AI applications. |
| 248 | + |
| 249 | +**Ready to experience the benefits of MemMachine v0.2?** |
| 250 | + |
| 251 | +- 👉 [Download and try MemMachine on GitHub](https://github.com/MemMachine/MemMachine) yourself—get started today and see the performance firsthand. |
| 252 | +- 📖 [Explore the comprehensive documentation](https://docs.memmachine.ai) to discover integration guides, workflows, and advanced features. |
| 253 | +- 💬 [Join our Discord community](https://discord.gg/usydANvKqD) to connect with fellow developers, share feedback, and collaborate with teams already building innovative solutions on top of MemMachine. |
| 254 | + |
| 255 | +Don’t miss the opportunity to join a fast-growing ecosystem of organizations and engineers leveraging MemMachine for state-of-the-art conversational AI. Your feedback and contributions are welcome! |
| 256 | + |
| 257 | + |
| 258 | +## Frequently Asked Questions |
| 259 | + |
| 260 | +### How does MemMachine v0.2 outperform previous agent memory systems on LoCoMo? |
| 261 | +Using a new, innovative architecture, MemMachine v0.2 achieves up to 80% token savings and 75% faster memory operations, while consistently scoring higher on complex reasoning benchmarks with leading LLMs. |
| 262 | + |
| 263 | +### Why is long-term conversational memory critical for AI agents? |
| 264 | +Strong conversational memory enables AI agents to handle multi-session, time-aware, and open-domain reasoning, driving more accurate, context-rich user experiences in generative AI applications. |
| 265 | + |
| 266 | +### What is the significance of token efficiency in LLM evaluation? |
| 267 | +Efficient token usage reduces operational costs and latency, enabling longer, more complex agent interactions—critical for scalable deployments. |
| 268 | + |
| 269 | +### Where can I learn more or try MemMachine? |
| 270 | +Visit [MemMachine.ai](https://memmachine.ai/) to download MemMachine and see for yourself why it's state-of-the-art. You'll find documentation, use cases, examples, the playground, and a growing community of developers using MemMachine in our Discord server. |
| 271 | + |
| 272 | + |
0 commit comments