Agents that remember: Cloudflare launches Agent Memory to give AI agents persistent memory
🧠 Memory is one of the simplest words in the dictionary, but when it comes to artificial intelligence agents, it becomes one of the most complex challenges in modern engineering.
Think about it this way: what good is an incredibly capable AI agent if it forgets everything it learned the moment a conversation ends? You train it, fine-tune it, configure behaviors, define a personality, and in the next session the agent acts like it has never seen you before. It is frustrating for the user and a real problem for any product that depends on continuity and context accumulated over time.
This is exactly the problem Cloudflare decided to tackle head-on with the launch of Agent Memory, now in private beta. The service is a managed solution that extracts information from agent conversations and surfaces it at the right moment, without needing to stuff everything into the context window all at once. But before diving into the technical details of how this works in practice, it is worth understanding why this matters so much right now.
Even with context windows reaching 1 million tokens, a phenomenon called context rot still haunts developers working with more sophisticated agents. The more information you push into the context, the worse the quality of the model responses gets. It is a classic tension: keep everything and watch performance degrade, or prune aggressively and risk losing exactly what the agent will need later. Agent Memory arrives proposing a different path, with a well-defined API, retrieval-based architecture, and a set of operations designed for the real lifecycle of an agent in production. 🚀
The real problem with context in AI agents
When we talk about artificial intelligence agents in production, context is not just a technical detail. It is the backbone of any interaction that needs to make sense over time. An agent that helps a user plan projects, manage tasks, or make complex decisions necessarily needs to remember what was said before, what was decided, which preferences were expressed, and how things evolved since the beginning of the conversation. Without that, every session starts from scratch, and the user gets stuck in an endless loop of re-explanations that wear down any experience.
The challenge becomes even sharper when you consider that large language models, the well-known LLMs, do not have persistent memory by nature. They process what is in the context window at the time of inference and that is it. Anything left outside simply does not exist for the model. This architectural limitation triggered a race among AI labs to keep increasing context window sizes, reaching models today that can process up to one million tokens at once. That sounds like a lot, and it is. But it does not solve the problem as elegantly as it might seem at first glance.
Context rot is exactly that: the progressive degradation of response quality as the context grows too large. Research like LongMemEval, LoCoMo, and BEAM helps measure this phenomenon, and practical experiments show that models tend to pay less attention to information sitting in the middle of a very long context, favoring what appears at the beginning and end. So even if you technically fit everything inside the window, the model may simply ignore or underweight critical information buried in the middle of thousands of tokens. For agents operating in long and complex conversations, this is a real risk that directly undermines the usefulness of the system.
The current landscape of memory for agents
Memory for agents is one of the fastest-moving areas within AI infrastructure right now. New open-source libraries, managed services, and research prototypes pop up almost weekly, each with different approaches to what to store, how to retrieve information, and what types of agents they were designed for.
The architectural differences between these solutions are significant. Some are managed services that handle extraction and retrieval in the background. Others are self-hosted frameworks where the developer needs to operate the entire memory pipeline on their own. Some expose limited, purpose-specific APIs, keeping memory logic outside the main agent context. Others give the model direct access to a database or file system and let it build its own queries, which ends up burning tokens on storage and retrieval strategy instead of focusing on the actual task.
There are also solutions that try to fit everything inside the context window, distributing the load across multiple agents when needed, while others use retrieval to surface only what is relevant at that moment. Each approach has its merits, but Cloudflare made a deliberate choice by positioning Agent Memory as a managed service with a well-defined API and retrieval-based architecture, arguing that more controlled ingestion and retrieval pipelines are superior to giving the agent direct access to the file system.
How Cloudflare Agent Memory works in practice
The Cloudflare approach with Agent Memory is pretty straightforward: instead of trying to solve the context problem by pushing more tokens into the model, the service proposes a smart intermediate layer that manages what enters the context and when it happens. The core idea is that not all information needs to be present all the time. What an agent needs is access to the right information, at the right time, efficiently and without wasting tokens.
Agent Memory organizes memories within a concept called a profile, which is addressed by name. A profile offers several operations to the developer:
- Ingest — the main bulk ingestion path, typically called when the agent harness compacts the context
- Remember — allows the model to store something important right away, explicitly
- Recall — executes the full retrieval pipeline and returns a synthesized response
- List — lists stored memories
- Forget — marks a specific memory as no longer relevant or true
In practice, this means the agent no longer needs to manually manage what to remember and what to forget. That logic is delegated to the memory layer, which uses semantic similarity-based retrieval to surface what is most relevant to the current moment of the conversation. If a user brings up a topic that was discussed weeks ago, the agent can retrieve that context naturally, without the developer needing to manually implement all the search, indexing, and ranking logic.
Agent Memory can be accessed via binding from any Cloudflare Worker or via REST API for agents running outside the Workers environment. Those already using the Cloudflare Agents SDK will find a native integration, since the service works as the reference implementation for the memory portion of the Sessions API. 🔧
The ingestion pipeline under the hood
Memory ingestion is where much of the complexity that Agent Memory abstracts away for the developer lives. When a conversation arrives for processing, it goes through a multi-stage pipeline that extracts, verifies, classifies, and stores memories.
The first step is deterministic ID generation. Each message receives a content-based ID, a SHA-256 hash of the session ID, role, and content, truncated to 128 bits. If the same conversation is ingested twice, each message resolves to the same ID, making re-ingestion idempotent. This prevents duplicates and ensures consistency.
Next, the extractor runs two passes in parallel. A full pass divides messages into chunks of roughly 10,000 characters, with a two-message overlap, processing up to four chunks concurrently. Each chunk receives a structured transcript with role labels, relative dates resolved to absolute ones (for example, yesterday becomes 2026-04-14), and line indices for traceability. For longer conversations with nine or more messages, a detail pass runs in parallel, using overlapping windows focused specifically on extracting concrete values like names, prices, version numbers, and entity attributes that the broader extraction tends to generalize or miss.
The next step is verification. Each extracted memory is checked against the original conversation transcript. The verifier runs eight checks covering entity identity, object identity, location context, temporal accuracy, organizational context, completeness, relational context, and whether inferred facts are actually supported by the conversation. Each item is approved, corrected, or discarded.
Classification into four memory types
After verification, the pipeline classifies each verified memory into one of four types:
- Facts — represent what is true right now, atomic and stable knowledge like the project uses GraphQL or the user prefers dark mode
- Events — capture what happened at a specific moment, like a deploy or a decision
- Instructions — describe how to do something, such as procedures, workflows, and runbooks
- Tasks — track what is currently being worked on, and are ephemeral by design
Facts and instructions receive normalized topic keys. When a new memory has the same key as an existing one, the old one is superseded rather than deleted, creating a version chain with a pointer from the old memory to the new one. Tasks are excluded from the vector index to keep it lean, but remain discoverable via full-text search.
Finally, everything is written to storage using INSERT OR IGNORE, so content-based duplicates are silently ignored. After returning the response to the harness, vectorization runs asynchronously in the background. The text used for embedding prepends 3 to 5 search queries generated during classification to the memory content, bridging the gap between how memories are written declaratively and how they are searched interrogatively. 🧩
The retrieval pipeline
When an agent searches for a memory, the query goes through a separate and quite sophisticated retrieval pipeline. During development, the Cloudflare team discovered that no single retrieval method works best for all query types. So they run multiple methods in parallel and fuse the results.
The first stage runs query analysis and embedding concurrently. The query analyzer produces ranked topic keys, full-text search terms with synonyms, and a HyDE (Hypothetical Document Embedding), which is a statement formulated as if it were the answer to the question. The raw query is also embedded directly, and both embeddings are used in subsequent stages.
In the second stage, five retrieval channels run in parallel:
- Full-text search with Porter stemming — for keyword precision when you know the exact term but not the surrounding context
- Exact fact key lookup — returns results where the query maps directly to a known topic key
- Raw message search — queries the original conversation messages via full-text, acting as a safety net for details that the extraction pipeline may have generalized
- Direct vector search — finds semantically similar memories using the query embedding
- HyDE vector search — finds memories similar to what the answer would look like, which frequently surfaces results that direct search misses, especially for abstract or multi-hop queries
In the third stage, results from all five channels are merged using Reciprocal Rank Fusion (RRF), where each result receives a weighted score based on its position within each channel. Exact fact key matches receive the highest weight. Full-text search, HyDE vectors, and direct vectors are weighted according to signal strength. Raw messages come in with low weight as a safety net. Ties are broken by recency, with newer results ranked higher.
The pipeline then passes the top candidates to the synthesis model, which generates a natural language response to the original query. Certain specific query types receive special treatment. Temporal computation, for example, is handled deterministically via regex and arithmetic, not by the LLM. Results are injected into the synthesis prompt as pre-computed facts, since models are notoriously unreliable for things like date calculations. 📊
The infrastructure behind the scenes
Cloudflare built Agent Memory using its own products, and that is an interesting point. Under the hood, the service is a Cloudflare Worker that coordinates several systems:
- Durable Objects — stores raw messages and classified memories, with SQLite as the backend, ensuring strong tenant isolation
- Vectorize — provides vector search over embedded memories
- Workers AI — runs the LLMs and embedding models used in the pipeline
Each memory context maps to its own Durable Object instance and Vectorize index, keeping data completely isolated across different contexts. This also makes it easy to scale as demand grows.
The model choices
One interesting finding from development was that a bigger, more powerful model is not always better. The team chose to use Llama 4 Scout (17B parameters, MoE with 16 experts) for extraction, verification, classification, and query analysis, and Nemotron 3 (120B MoE, 12B active parameters) for synthesis. Scout handles structured classification tasks efficiently, while Nemotron’s greater reasoning capacity improves the quality of natural language responses. Synthesis was the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better balance of cost, quality, and latency.
What you can build with this
Agent Memory was designed to work with a wide variety of agent architectures, and the use cases are pretty diverse:
Memory for individual agents. It does not matter if you are using coding agents like Claude Code or OpenCode with a human in the loop, self-hosted frameworks like OpenClaw or Hermes acting on your behalf, or managed services like Anthropic Managed Agents. Agent Memory can serve as a persistent memory layer without requiring changes to the core agent loop.
Memory for custom agent harnesses. Many teams are building their own agent infrastructure, including background agents that run autonomously without human supervision. Companies like Ramp, Stripe, and Spotify have publicly described systems like this. These harnesses can benefit enormously from memory that persists across sessions and survives restarts.
Shared memory across agents, people, and tools. A memory profile does not have to belong to a single agent. An engineering team can share a profile so that knowledge learned by one person’s coding agent becomes available to everyone: code conventions, architectural decisions, tribal knowledge that normally lives only in people’s heads or gets lost when context is pruned. A code review bot and a coding agent can share memory so that review feedback influences future code generation. The knowledge your agents accumulate stops being ephemeral and becomes a durable team asset. 🤝
How Cloudflare is using it internally
The team is already using Agent Memory internally as a testing ground and source of ideas for next steps.
In the coding agent workflow, they use an internal OpenCode plugin that connects Agent Memory to the development loop. The less obvious but perhaps most valuable benefit was shared memory across the team. With a shared profile, the agent knows what other team members have already learned, which means it stops asking questions that have already been answered and stops making mistakes that have already been corrected.
In agentic code review, the most interesting result was that the reviewer learned to stay quiet. It now remembers that a specific comment was not relevant in a past review, or that a pattern was flagged and the author chose to keep it for a good reason. Reviews get less noisy over time, not just smarter.
For chatbots, memory was connected to an internal bot that ingests message history and then stays watching and remembering new messages. When someone asks a question, the bot can respond based on previous conversations.
Your data is yours
As agents become more capable and more deeply integrated into business processes, the memory they accumulate becomes genuinely valuable. Not just as operational state, but as institutional knowledge that took real work to build. Cloudflare acknowledges the growing concern from customers about what it means to tie that asset to a single vendor.
The positioning is clear: Agent Memory is a managed service, but the data belongs to the customer. All memory is exportable, and the company commits to ensuring that knowledge accumulated by agents on the platform can leave with you if your needs change. The stated philosophy is that the best way to build long-term trust is to make leaving easy and keep building something good enough that nobody wants to.
What is coming next
The team continues testing and refining Agent Memory internally, improving the extraction pipeline, tuning retrieval quality, and expanding background processing capabilities. An interesting analogy that Cloudflare itself uses is with the human brain: just as our brain consolidates memories during sleep by replaying and strengthening connections, they see opportunities for memory storage to improve asynchronously and are already implementing and testing several strategies in that direction.
Agent Memory is still in private beta, so access is limited for now. Developers interested in early access can join the waitlist directly on the Cloudflare website. The launch itself already signals an important direction for where the agent development ecosystem is heading: less manual infrastructure, more high-level primitives, and a clear bet that managed memory will be as fundamental a building block as databases or message queues are today for any serious application in production. 💡
