Sawtooth Memory Stops LLM Agents From Freezing

A developer just shipped Sawtooth Memory, an open-source memory framework for LLM agents that kills the lag most chat agents suffer between turns. The launch surfaced on Hacker News, where it pulled a score of 159 and landed in the Product Launch category. The pitch is simple: stop freezing your app every time the agent has to remember something.

Here’s the problem it targets. Standard memory systems, like LangChain’s ConversationSummaryMemory, process conversation history on the main application thread. Every time a user sends a message, the whole app stalls while an LLM generates a fresh summary of the chat so far. Worse, those summaries hallucinate. They hit the “Lost in the Middle” effect and quietly drop specific UUIDs, names, or rules to save tokens. So your agent forgets the exact transaction ID a user just gave it.

Sawtooth Memory attacks both issues at once. According to the project’s Hacker News post, it stores the user’s message instantly and hands control back to the app in milliseconds, then offloads the heavy summarization to a background worker. The freeze disappears because the slow part runs off the main thread.

What it actually does

  1. Non-blocking ingestion. Messages get saved and control returns immediately. No more 5-to-10-second stalls while an LLM writes a summary mid-conversation.
  2. An immutable fact ledger. Before summarizing anything, Sawtooth extracts critical facts (IDs, names, paths, UUIDs) into a separate layer it calls L1.5. Summarization can’t delete what it never touches.
  3. A hierarchical memory stack. Four layers get stitched into each prompt: L0 holds the immutable persona and tool schemas, L2 holds compressed narrative memory, L1.5 holds the exact entities, and L1 holds recent raw conversation.
  4. Explainability traces. You can call explain_prompt() and get a deterministic audit trail showing exactly why each fact stayed in the context. That cracks open the usual black box of agent memory.
  5. Drop-in integrations. It ships a native SawtoothMemorySaver adapter for LangGraph and works with local air-gapped models through Ollama as well as cloud APIs from OpenAI, Anthropic, and Google.

The benchmark numbers

The author ran a local GPU test on an NVIDIA RTX 5060 using the phi4-mini model over a 20-message conversation. The results, as detailed on Hacker News:

  • Main thread latency: 64.15 seconds on standard summary memory versus 5.70 seconds on Sawtooth. That’s 11.3x faster.
  • Final prompt payload: 506 tokens versus 454 tokens, a 10% lower token cost.
  • UUID and fact recall: standard memory “hallucinates” and varies, while Sawtooth reports 100% retention through the L1.5 ledger.

What stands out here is the trade the author refuses to make. Most memory systems treat speed and accuracy as opposites: compress harder, lose more facts. By moving compression to the background and anchoring critical facts in a separate layer, Sawtooth claims both at once.

Availability

It’s free and open source under the MIT License. You install it with pip install sawtooth-memory, plus optional packages for whichever cloud provider you use. The project is taking pull requests and points developers to its DOCUMENTATION.md for deeper architecture details and API specs.

A few caveats worth flagging. The headline benchmark is a single local run on one small model and one 20-message conversation, so your mileage on larger models and longer chats is unproven. The author mentions cloud comparisons and reproducibility steps live in a separate benchmark writeup, which means the splashy 11.3x figure comes from a controlled local setup, not a broad test suite.

Why this matters: agent memory is one of the least-solved problems in production LLM apps right now. Teams building support bots, coding agents, and long-running assistants keep running into the same wall, slow turns and forgotten details. A lightweight, MIT-licensed library that addresses both, with LangGraph support baked in, is the kind of tool that spreads fast if the numbers hold up outside the author’s GPU. Developers can find the full methodology and API reference at the original source.

Scroll to Top