Master AI Agent Memory: Engineer LLMs for Persistent Recall

Picture your AI agent like a goldfish in a tank with a giant library next to it. The fish is brilliant in the moment but forgets everything when it swims past the rock. The library is full of useful info, yet the fish has no idea how to grab the right book at the right time. That’s basically every LLM today without memory engineering wrapped around it.

That analogy stuck with me while watching Matt Berman’s livestream conversation about agent memory. The expert he brought on, Richmond Alake from Oracle (developer experience for AI), has been heads-down in this space for over three years. He co-taught a deep learning course with Andrew Ng on agent memory and has been shipping packages, papers, and benchmarks while most of us were still arguing about prompt length.

I was honestly blown away by how he simplified a topic that usually melts brains. So here’s the breakdown of what this savvy professional shared, mapped to the goldfish-and-library picture.

🧠 Mapping the analogy to real components

Richmond walked through how agent memory mirrors human memory, and the parallels are surprisingly clean.

The fish (LLM): A reasoning engine with no persistent state. Powerful, but stateless between sessions.
The librarian (memory engineering): The system that decides what to keep, what to toss, and what to surface.
The shelves (database substrate): Long-term storage. Richmond made the case for one unified database (like the Oracle AI database) instead of stitching five together. One brain, not five.
The reading desk (context window): Short-term memory. Limited capacity, needs constant curation.
Dreaming (overnight consolidation): Anthropic’s new feature in Claude managed agents. The librarian works overnight, organizing notes, resolving conflicts, and pre-staging the right books for tomorrow.

Richmond pointed out that the dreaming idea isn’t brand new. The Letta team (the MemGPT folks, Sarah and Charles) published “sleep time compute” about a year ago. He calls them roughly a year ahead of the rest of the field.

🛠 How to actually apply this

Here’s where the original creator got practical. Memory engineering, in his framing, is different from context engineering. Context engineering is what’s inside the window. Memory engineering is what’s outside it and how stuff flows in and out.

Four cognitive memory types he mapped to computational versions:

Procedural memory: Skills.md files, tool descriptions, standard operating procedures for agents.
Episodic memory: Back and forth conversation history with timestamps.
Semantic memory: Facts, embeddings, knowledge stored for retrieval.
Working memory: Whatever’s loaded in the context window right now.

For forgetting (a real question from the chat), Richmond pointed back to the Stanford Generative Agents paper from 2023. They scored memories on recency, relevance, and importance, then surfaced or decayed them based on a weighted formula. Simple math, surprisingly strong results.

📊 The benchmark that made me sit up

This is the part I think is worth screenshotting. Richmond ran a 100-turn agent conversation two ways:

Naive agent: Just append every turn to context. Token usage climbed steeply across the 100 turns.
Engineered agent (using Oracle’s agent memory package): Token usage stayed flat the whole way through.

Flat. Across 100 turns. He hinted you could probably push it to a thousand turns and stay stable, which is as close to “infinite context” as a practical system gets right now.

And the LLM-as-judge eval preferred the engineered version’s answers over the naive one. So you get lower cost and better quality at the same time. That’s the kind of result that changes how you build.

💡 Use cases where this matters most

Richmond grouped real workloads into three application modes, and memory needs differ for each.

Deep research: Long horizon, multi-source synthesis. Memory keeps the thread across hours of crawling.
Assistants: Conversational back and forth. Episodic memory dominates here. This is where most consumer products live.
Workflows: Repeatable automation with light LLM decisions. Procedural memory and tool retrieval matter most.

He also dropped a sharp line that landed with me: “Most people want autonomy, but what they need is automation.” Translation: stop trying to build a full AGI butler. Build the workflow first, drop an LLM into the right step, and add memory so it doesn’t forget what it learned yesterday.

🌙 The dreaming question

On Anthropic’s dreaming feature, Richmond’s read is that the labs are offloading compute to off-peak hours. Overnight, the agent reviews past sessions, consolidates patterns, prunes noise, and reinforces what worked. Next day, you ask a question and get a tighter answer with fewer tokens. Cheaper for them, faster for you.

OpenAI is doing something parallel with thumbs up and thumbs down on memory sources. Richmond thinks the human-in-the-loop part is short term. Long term, the systems learn to curate themselves. The meta-harness paper Matt mentioned is a hint at where this goes.

🎯 What to do with this

Richmond’s call to action was simple. Get in the arena. Try the Oracle agent memory Python package. Open the notebooks in the Oracle AI Developer Hub. Read the Generative Agents paper if you haven’t. Take the Andrew Ng course, it’s free.

The scaffolding around the model is where 2026 gets won. Memory is the most accessible piece of that scaffolding. You don’t need a frontier lab to ship something useful here.

Go watch the full conversation for the live demos, the benchmark charts, and the back and forth on continual learning. Worth the hour.