Your AI Has Digital Amnesia. And the Fix Is Harder Than Anyone Admits.

You mention to your AI assistant that you’re vegetarian. A few days later you ask for restaurant recommendations. It suggests a steakhouse.

You remind it. It apologizes. Next session — same thing.

You’ve told your AI health coach about your knee injury four times. It keeps suggesting running plans. The customer support bot at your bank has no idea you called last Tuesday about the exact same issue. Your AI travel planner books a hiking tour after you explicitly said you can’t walk long distances.

It’s gone. Every bit of it. Every time.

I’ve been building a memory layer for LLM agents for the past few weeks — memo-mesh, a self-hostable, evidence-first system for persistent agent context. That work forced me to understand this problem more deeply than any article I read about it. This post is what I actually learned.

The short version: memory is not a solved problem. Not even close.

The space has gotten serious fast. Mem0 raised $24M from YC, Peak XV, and Basis Set. Supermemory — founded by Dhravya Shah — raised a ~$2.6M seed co-led by Susa Ventures and Browder Capital, with Jeff Dean of Google as an angel investor. AWS selected Mem0 as a memory provider for its Agent SDK. The money has noticed. The infrastructure is being built.

And still, most agents in production treat every conversation like the first one. Here’s why.

The obvious fixes don’t work

”Just use a bigger context window.”

Modern frontier models support 128k, 200k, even 1M token contexts. Surely you just stuff the history in there?

The problems compound fast. You’re paying for every token on every request whether it’s relevant or not. Models trained to reason over long contexts still demonstrably miss facts buried deep — a phenomenon researchers call “lost in the middle.” And none of it persists — the session ends, it’s gone. You’ve made the amnesia more expensive, not solved it.

”Just store everything in a database and retrieve it.”

You’ve solved storage. You haven’t solved memory.

Memory isn’t about whether the data exists somewhere. It’s about retrieving the right piece of information at the right moment — under uncertainty, with incomplete queries, without knowing exactly what to look for. A database is a filing cabinet. Memory is a colleague who says “wait, this customer called us about this exact issue last month."

"Fine-tune the model.”

Fine-tuning bakes information into model weights. Even with parameter-efficient methods like LoRA that have reduced the cost considerably, none of them solve the real-time update problem. If a user changes their preference today, fine-tuning can’t respond until the next training cycle. More fundamentally: fine-tuning mixes memory (what this person told us) with capability (how the model reasons). You can’t change what the model remembers without risk of changing how it thinks.

The common thread across all three: they treat memory as a storage problem. It isn’t. Memory is a structured recall problem. That distinction matters enormously for how you design a system to solve it.

What “memory” actually means for an AI

Before talking about solutions, it helps to be precise about what you’re actually storing. Most people talk about “AI memory” as if it’s one thing. It isn’t. When I think through it, there are at least four meaningfully different categories:

Facts — “This user is lactose intolerant.” “This customer is on the Premium plan.” “This patient is allergic to penicillin.”

Preferences — “She prefers email over phone calls.” “He always wants a summary upfront before the detail.” “This user reads in Hindi.”

Constraints — “This account has a spending limit of ₹50,000.” “This patient cannot take NSAIDs.” “This customer has opted out of promotional messages.”

Events — “Last week this user reported a billing discrepancy that was escalated.” “This customer was offered a retention discount in March and declined.”

Each of these has fundamentally different properties. Facts can become wrong over time. Preferences drift. Constraints can be overridden by new decisions. Events happened at a specific point in time and their relevance may fade.

Most naive memory systems treat all four the same — as flat text blobs dropped into a vector store. That’s why they feel approximately right but break on edge cases. The system retrieves “this user likes spicy food” without knowing that was stored six months before they mentioned having acid reflux. It retrieved the semantically relevant thing. It recalled the wrong thing.

For a personal assistant, that’s mildly annoying. For a healthcare agent or a financial advisor bot, it’s a liability.

What’s being built right now

Two companies have defined the current wave. They’ve made genuinely different architectural bets — and understanding the trade-offs they’ve accepted is more useful than a feature comparison.

Mem0: LLM-in-the-loop extraction and consolidation

Mem0’s core insight is that writing to memory should be as intelligent as reading from it. Their pipeline has three stages:

Extraction — on every conversation turn, a small LLM (GPT-4o-mini class) reads the latest exchange, a rolling summary, and recent message history, then extracts discrete candidate facts.

Consolidation — each candidate fact is checked against existing memories via vector similarity. The LLM then reviews the match and decides: ADD (new fact), UPDATE (modify existing), DELETE (contradicts something stored), or NOOP (already known, skip).

Retrieval — at query time, dense embeddings + vector similarity surface the most relevant memories and inject them into context.

The graph variant, Mem0ᵍ, adds a parallel layer where entities become nodes and relationships become labeled edges — enabling reasoning across relationships, not just “does this match?” but “what does this connect to?”

Per their own benchmarks, the results are impressive: 26% accuracy improvement over OpenAI’s memory system, 91% lower latency than full-context approaches, 90% token cost reduction.

What this architecture gets right: consolidation is genuinely intelligent. The LLM deciding ADD/UPDATE/DELETE means contradictions get resolved rather than silently piling up.

Where it falls short: every write operation hits an LLM. That’s latency and cost on every turn, not just at retrieval. More importantly — and this is what concerns me most as a builder — the consolidation is a black box. When your customer support agent gives a wrong answer based on a stale memory, you’re often left without a clean way to trace why it believed what it believed. The system worked. But you can’t see its reasoning.

Supermemory: decay, hierarchy, and human-inspired forgetting

Supermemory took a different direction — instead of making writes more intelligent, they modeled the architecture on how human memory actually works.

Four mechanisms underpin it: smart decay (less-accessed memories gradually deprioritize), recency bias (recently surfaced context gets priority independent of semantic similarity), context rewriting (summaries continuously update as new information arrives, with links between related facts detected automatically), and hierarchical storage (recent “hot” memories in fast edge storage, older memories loaded on demand).

What this architecture gets right: it handles scale gracefully. You’re not brute-forcing a flat vector index — you’re mimicking how a human brain manages finite working memory. The decay mechanism is particularly elegant; it sidesteps the “what to keep vs. discard” question by letting relevance emerge from access patterns.

Where it falls short: the model is automatic. You route calls through their proxy, relevant context gets injected, you get on with your day. That’s genuinely appealing for simple use cases. But the flip side is near-zero visibility: you can’t inspect what your agent knows, why it knows it, or when that knowledge was formed. For consumer apps where personalization is a nice-to-have, this is fine. For a business running agents that touch customer data, medical history, or financial information — you’re trusting a black box with some of your most sensitive operational data.

The hard problems nobody is talking about

Both approaches are genuinely impressive engineering. Both also sidestep the same three problems that I think are the real unsolved frontier.

1. Temporal validity

Memories go stale. Silently.

Here’s what this looks like in practice: a customer contacts your support agent in January and mentions they’re on the Basic plan. That fact gets stored. In April, they upgrade to Premium. Nobody in that conversation explicitly says “I’m on Premium now” — they just start asking Premium-level questions. The old Basic fact never gets contradicted. It just coexists with the new behavior. In July, your agent recommends a workaround for a limitation that hasn’t applied to them for three months. The customer is frustrated. Your support team has no idea why the agent said that.

Neither Mem0 nor Supermemory has a principled answer to: how does a memory know when to question its own validity? The current approach — let consolidation handle contradictions when they surface — only works when contradictions are explicit. Most of the time, they aren’t.

2. Conflict resolution under ambiguity

Related but distinct: what happens when two memories don’t contradict but create tension?

“She hates spicy food” (stored 8 months ago) and “She loved the new Thai restaurant downtown” (stored last week).

Neither is wrong. But they create a question the system has to resolve, and the answer isn’t just “most recent wins.” Context matters — maybe the Thai place she loved isn’t actually that spicy. An intelligent system would surface the tension and let the agent reason about it explicitly. Most systems silently favor whichever retrieval score is higher.

3. The provenance problem

If your agent gives a user a bad recommendation, can you trace it back to the specific memory that caused it?

This is the question that drove a lot of memo-mesh’s design. In production systems — customer support, healthcare, finance, anything where the agent speaks on behalf of your brand or your business — this isn’t optional. You can’t debug what you can’t audit. You can’t explain to a user why the agent said what it said if the memory layer is a black box. You can’t build organizational trust in AI tooling if the answer to “why did it do that?” is “we’re not sure.”

The provenance problem is not a developer experience issue. It’s a fundamental correctness property. And almost no current memory systems have it.

What good memory infrastructure actually looks like

Here’s what I think the architecture should have — not based on what exists, but on what the problems above demand.

Evidence-first. Every stored fact links back to the source message that produced it. Memory is a claim; the source is the evidence. Without this chain, debugging is archaeology.

Temporal metadata. Facts should know when they were written. The system should be able to reason about staleness — flagging facts past an age threshold for review, or at minimum surfacing the timestamp alongside retrieval results so the model can reason about it.

Explicit conflict surface. When two memories create tension, surface it — don’t silently resolve it by score. Transparency over false precision.

Self-hostable by default. Memory is categorically different from other AI infrastructure. A vector database holds embeddings. A model server runs weights. A memory layer holds everything the agent has ever learned about your users: their preferences, past decisions, health context, account history, behavioral patterns. That’s not something you should route through a third-party cloud without deliberate intent. The self-hosted option shouldn’t be an afterthought — it should be the default for any business that takes user data seriously.

Composable via standard protocols. The Model Context Protocol (MCP) — now supported by Claude, Cursor, and a growing ecosystem — is turning memory into a pluggable capability. A memory server becomes something any agent can call, regardless of which model or framework it runs on. Memory infrastructure that locks you into a specific provider will lose to memory infrastructure that composes.

Where this is heading

Two threads are converging that will shape this space over the next year.

MCP as the distribution layer. When memory becomes an MCP server, it decouples from any specific product or provider. Your memory layer becomes infrastructure — the same way a database is infrastructure. This is exactly the right framing: memory shouldn’t be a feature baked into your AI assistant, it should be a capability your assistant plugs into. Any agent, any model, any platform.

The self-hosted moment. The enterprises and businesses who matter most in production AI won’t route long-term customer and user context through a cloud API they don’t control. The same dynamic that created serious markets for self-hosted vector databases and on-prem LLM inference is arriving for memory. The difference: memory data is more sensitive than vectors or weights, because it’s interpretable. Anyone can read a memory store and understand exactly what it reveals about your users.

Whoever builds the self-hosted memory stack that production teams actually trust will own a foundational layer of the agent ecosystem.

What I’m building

Memo-mesh is my exploration of these problems. It’s a self-hostable memory layer for LLM agents built on a single design conviction: every extracted fact should link back to its source message.

That means you can always ask “where did the agent learn this?” and get a real answer — not a confidence score, but an actual conversation message. Consolidation happens automatically (cosine similarity deduplication at a configurable threshold), but the audit trail stays intact throughout. The knowledge graph maps entity relationships so the agent can reason across connections, not just recall isolated facts.

The stack is Node.js + TypeScript + PostgreSQL with pgvector. The core pipeline — ingestion, embedding, extraction, semantic search, knowledge graph, consolidation, context packs — is complete. Authentication is next, and after that, an MCP server that lets any compatible agent plug in without custom integration work.

It’s early and opinionated. The opinions are that auditable, self-hostable, and composable matter more right now than fast and automatic. The market will eventually tell me if I’m right.

If you’re building agents that interact with real people — support bots, personal assistants, health tools, financial advisors — and the provenance problem keeps you up at night: come build with me.