3 months ago, I was rejected from a technical interview because I couldn’t build an agent that never forgets.
Every approach I knew worked… until it didn’t.
I walked into that room confident. I’d built chatbots. I understood embeddings. I knew how to use vector databases.
But when the interviewer asked me to design an agent that could remember a user’s preferences across weeks not just within a single conversation, I froze.
My instinct was the standard playbook: Store everything in a vector database and retrieve similar conversations when needed.
The questions that killed me were simple: What about scale? After a thousand sessions, how do you handle conflicting data? How do you stop it from faking memories just to fill the gaps?
I had no answer.
That failure forced me to actually deep dive and find a solution:
Most tutorials about "agents with memory" are teaching how to implement RAG for memory.
The problem isn't embeddings. It isn't token limits. It isn't even retrieval.
The problem is that memory is infrastructure, not a feature.
Here is the entire system I built to solve it and the code I used to do it.
The Real Problem With "Standard" Memory
Here is what I thought memory meant: Keeping the conversation history and stuffing it into the context window.
That works for about 10 exchanges. Then the context window fills up.
So you truncate old messages. Now your agent forgets the user is vegan and recommends a steakhouse.
You realize conversation history isn't memory it's just a chat log.
"Fine," I thought. "I'll embed every message and retrieve relevant ones using similarity search."
This worked better. For a while.
But after two weeks, the vector database had 500 entries. When the user asked, "What did I tell you about my work situation?" the retrieval system returned fragments from 12 different conversations.
The agent saw:
"I love my job" (Week 1)
"I'm thinking about quitting" (Week 2)
"My manager is supportive" (Week 1)
"My manager micromanages everything" (Week 2)
Which one is true?
The agent had no idea. It hallucinated a synthesis: "You love your supportive manager but you're thinking about quitting because of micromanagement."
Completely wrong. The user had switched jobs between Week 1 and Week 2.
This is the crucial realization: Embeddings measure similarity, not truth.
Vector databases have a blind spot: they don't understand time, context, or updates. They just spit back text that looks mathematically close to what you asked for. That isn’t remembering; it’s guessing.
The fix required a mental shift. Memory isn't a hard drive. It’s a process. You can't just store data; you have to give it a lifespan and let it evolve.
Short-Term Memory: The Solved Problem
Before tackling the hard part (long-term memory), we need to handle short-term continuity.
Short-term memory is the ability to remember what was said 30 seconds ago. This is actually a solved problem.
The solution is Checkpointing.
Every agent operates as a state machine. It receives input, updates internal state, calls tools, generates output, and updates state again. A checkpoint is a snapshot of this entire state at a specific moment.
This gives you three capabilities:
Determinism: Replay any conversation.
Recoverability: Resume exactly where you left off if the agent crashes.
Debuggability: Rewind to inspect the agent's "thoughts."
In production, I use Postgres-backed checkpointers. Here is the pattern:
This handles the "now." But checkpoints are ephemeral. They don't build wisdom. For that, we need Long-Term Architectures.
Long-Term Memory Architectures
After months of failure, I found two architectures that actually work.
Architecture A: File-Based Memory (The Self-Organizing System)
This mimics how humans categorize knowledge. It works best for assistants, therapists, or companions.
The Three-Layer Hierarchy:
Layer 1: Resources (Raw Data). The source of truth. Unprocessed logs, uploads, transcripts. Immutable and timestamped.
Layer 2: Items (Atomic Facts). Discrete facts extracted from resources ("User prefers Python," "User is allergic to shellfish").
Layer 3: Categories (Evolving Summaries). The high-level context. Items are grouped into files like work_preferences.md or personal_life.md.
The Write Path: Active Memorization
When new information arrives, the system doesn't just file it away it processes it. It pulls up the existing summary for that category and actively weaves the new detail into the narrative. This handles contradictions automatically: if a user mentions they’ve switched to Rust, the system doesn't just add 'Rust' to the list; it rewrites the profile to replace the old preference
The Read Path (Tiered Retrieval): To save tokens, you don't pull everything.
Pull Category Summaries.
Ask LLM: "Is this enough?"
If yes -> Respond.
If no -> Drill down into specific items.
This works beautifully for narrative coherence. But it struggles with complex relationships. For that, you need graphs.
Architecture B: Context-Graph Memory (The Knowledge Web)
File-based memory struggles with complex relationships. For precise systems (CRM, Research), you need a Graph.
Hybrid Structure
Vector store for discovery, used to surface related or similar text.
Knowledge graph for precision, storing facts as subject–predicate–object relationships.
Conflict resolution
We also built in conflict resolution. If the graph currently says the user works at Google, but a new message places them at OpenAI, the system doesn't just add a second job. Instead, it recognizes the contradiction, archives the Google connection as 'past history,' and makes OpenAI the active employer.
Retrieval involves running parallel searches (Vector + Graph traversal) and merging the results
Hybrid Search
Retrieval runs two searches in parallel:
Vector Search: Find semantically similar conversations.
Graph Traversal: Find entities connected to the query.
The results merge into a unified context. This prevents the "remembers everything but knows nothing" problem.
Memory refresh, Decay, and Cron jobs
Here is what nobody tells you: Memory must decay.
"Never forget" doesn't mean "remember every single token." It means "remember what matters."
If you don't prune your database, your agent becomes confused, slow, and expensive.
I run background Cron jobs to keep the system healthy:
Nightly Consolidation
Every night at 3 AM, a background process reviews the day's conversations. It looks for patterns the agent missed during live operation. It merges redundant memories. It promotes frequently-accessed items to higher-priority storage.
Weekly Summarization
Once a week, the system re-summarizes category files. It compresses old items into higher-level insights. It prunes memories that haven't been accessed in 90 days.
Monthly Re-indexing
On a monthly basis, we run a full re-index of the memory store.
Embeddings are rebuilt with the latest model version, and graph edges are adjusted based on real usage.
Anything that hasn’t been touched in a while gets archived.
This maintenance keeps memory systems healthy for months.
Without it, they rot.
How retrieval works at inference time
Most retrieval systems fail because they rely solely on vector similarity. That’s a mistake. A robust memory system works backwards from the constraints of the context window. It starts with a broad search using a synthesized query, not the raw user input. Then, it treats those search results as prospects, not answers. We filter those prospects through a "relevance scorer" and a "time-decay" function. This ensures that a slightly less relevant but highly recent memory often beats a perfect match from six months ago. The result is a prompt that contains only the 5-10 memory tokens that actually move the needle, rather than a wall of similar-sounding text.
This ensures the agent sees only what it needs. Nothing more. Nothing less.
Why most people fail at this
After building this system, I understood why I failed that interview. Most implementations fail in production because they make five critical mistakes:
Mistake 1: Storing raw conversations forever Conversations are noisy. If you store every "um" and "like," your memory becomes polluted. Extract facts, not transcripts.
Mistake 2: Blind embedding usage Embeddings find similarity, not truth. "I love my job" and "I hate my job" embed very similarly. You need resolution logic.
Mistake 3: No memory decay Without decay, your agent drowns in the past. It remembers your vacation plans from two years ago but forgets your current deadline.
Mistake 4: No write rules If the agent writes to memory whenever it wants, it will write junk. Define explicit rules for what deserves to be remembered.
Mistake 5: Treating memory as chat history This is the fatal mistake. Chat history is ephemeral. Memory is a structured representation of what was learned.
The Mental model
The real breakthrough happened when we stopped looking at agents as simple chatbots and started treating them like operating systems. An agent needs the exact same capabilities:
Process Management: Track multiple concurrent tasks.
Memory Management: Allocate, update, and free knowledge.
I/O Management: Interface with tools and users.
Most importantly, it requires a sophisticated memory architecture. You need "RAM" for the fast, volatile context of the current conversation, but you also need a "hard drive" a persistent, indexed way to store knowledge that survives after the session ends. If you don't run regular maintenance on that memory, much like garbage collection, the system eventually breaks down.
The Before and After
Three months ago:
Today:
The difference between a chatbot and a companion is memory.
The difference between memory and good memory is architecture.
If you're building agents, this is no longer optional. Users expect persistence. They expect learning. They expect the agent to remember who they are.
Three months ago, I couldn't build this. Now I've shipped agents that remember customer preferences across thousands of sessions.
The interview rejection that felt like failure became the catalyst for understanding what production systems actually require.
Storage is cheap. Structure is hard. But structure is what transforms a stateless language model into something that genuinely never forgets.
The agents of tomorrow won't just have more parameters or better training data. They'll have memory systems that learn, evolve, and improve with every interaction.
And now you know how to build them.

