Workers AI now runs large models and Kimi K2.5 is first in line
Workers AI just made the leap a lot of people were waiting for.
Cloudflare announced that its inference platform now supports large language models, and the first one to arrive is Kimi K2.5 from Moonshot AI. If you work with or have any interest in building autonomous agents, this news changes the landscape quite a bit.
We are not talking about yet another generic model in the catalog of some random platform. We are talking about a frontier open-source model with a 256k token context window, multi-turn tool calling support, vision inputs and structured outputs, running directly inside the Cloudflare ecosystem. And with a cost argument that, honestly, is hard to ignore. 💡
Cloudflare itself used the model internally before releasing it to the public, and the numbers they shared are pretty concrete:
- A security agent processing more than 7 billion tokens per day
- More than 15 confirmed issues found in a single codebase by the agent running with Kimi K2.5
- A 77% savings compared to proprietary models at a similar level
- And an annual cost difference that would reach USD 2.4 million in the alternative scenario, just for that single use case on a single codebase
But the launch goes beyond the model itself. Cloudflare brought along infrastructure improvements like prefix caching with discounts on cached tokens, a new session affinity header and a redesigned async API, all designed specifically for the usage patterns that modern agents demand. In the sections ahead you will understand what Kimi K2.5 is, why Cloudflare chose it, what changed in the Workers AI infrastructure and how all of this fits into the bigger movement of making the platform the ideal environment for running the complete lifecycle of AI agents. 🚀
The context behind the launch: Cloudflare primitives for agents
Before talking about the model itself, it is worth understanding why Cloudflare sees this launch as a missing piece in a larger puzzle. The company has been building, for years, what it calls primitives for autonomous agents. These primitives are fundamental building blocks that give agents the capabilities they need to actually work in production.
Durable Objects provides state persistence, meaning the ability for an agent to remember where it left off and maintain information across different sessions. Workflows allows orchestration of long-running tasks, essential when an agent needs to execute flows that last minutes or even hours. Dynamic Workers and Sandbox containers provide secure execution environments where code can run in isolation without compromising the rest of the system. And the Agents SDK works as a high-level abstraction that connects everything together and makes life easier for developers when putting an agent together.
All of these components solved the execution environment side. The agent had somewhere to run, a way to maintain state, a way to execute long tasks and a way to communicate with the outside world. But the brain was missing. The AI model that actually makes decisions, reasons over data and drives the agent had to come from outside, usually from a proprietary provider with costs that scale fast. With Workers AI now running large models like Kimi K2.5, that gap has been filled. The platform now offers the complete lifecycle of an agent within a single ecosystem, from inference to storage.
What Kimi K2.5 is and why it matters so much
Kimi K2.5 is a large language model developed by Moonshot AI, a Chinese company that has been gaining significant attention on the global artificial intelligence stage. Unlike many models that hit the market with generic capabilities and vague promises, Kimi K2.5 was built with explicit focus on long reasoning, external tool usage and execution of complex tasks across multiple conversation turns. This places it squarely at the center of what autonomous agent developers need today: a model that does not just respond, but acts, plans and iterates.
The 256k token context window is one of the most important features of the model. For those not familiar with this metric, tokens are fragments of text that the model processes at once. The larger the window, the more information the model can consider at the same time, which is critical for agents that need to maintain long conversation histories, track past actions, interpret lengthy documents or coordinate multiple data flows in parallel. With 256k tokens available, Kimi K2.5 can operate in scenarios that simply freeze or lose coherence in models with smaller windows, making it a naturally more robust choice for real-world intelligent automation applications.
Beyond the context window, multi-turn tool calling support is what truly sets Kimi K2.5 apart in the context of autonomous agents. Tool calling is the ability for the model to invoke external functions, APIs or tools during response generation. When this happens across multiple turns, it means the model can chain actions, verify results, adjust its plan and keep operating without losing track. Combined with vision support, which allows it to interpret images and visual data, and structured outputs, which ensure responses in formats like JSON reliably, the model becomes a very complete piece for anyone looking to build AI pipelines that actually work in production.
Why Cloudflare chose this model to debut LLMs on Workers AI
Cloudflare’s decision to debut large model support with Kimi K2.5 does not seem random. The company has been positioning Workers AI as a platform geared toward the complete lifecycle of AI agents, not just one-off inference. In that context, choosing a frontier open-source model with robust agentic workflow capabilities makes perfect sense: it validates the platform proposition and delivers a real, concrete use case from day one. Cloudflare did not need to sell the model on narrative alone — they brought their own internal numbers to show that the case works.
The company shared that it spent weeks testing Kimi K2.5 as the engine for its internal development tools. Inside the OpenCode environment, Cloudflare engineers used the model as a daily driver for coding tasks with agents. On top of that, the model was integrated into the company’s automated code review pipeline, which is publicly visible through the Bonk agent on Cloudflare’s GitHub repositories. In production, the model proved to be a fast and efficient alternative to larger proprietary models, without sacrificing quality in the results.
The internal usage that Cloudflare described is quite revealing about the model’s potential in a demanding production environment. A security agent running inside the company’s infrastructure processed more than 7 billion tokens per day, which is a massive volume by any metric. That agent found more than 15 confirmed issues in a single codebase. When you put that in perspective with the 77% savings compared to equivalent proprietary models, the economic argument becomes clear. A difference that would reach USD 2.4 million per year in the alternative scenario is not marginal. That is the kind of number that shows up in budget spreadsheets and changes architecture decisions for entire systems.
It is also worth highlighting the fact that Kimi K2.5 is open-source in this equation. Proprietary models charge not just for usage, but also for the dependency they create. When you build on an open model and host it on a platform like Workers AI, you maintain control over how the model is used, where data flows and how the architecture evolves over time. For companies operating in sectors with compliance requirements or that simply want to avoid lock-in to a specific vendor, this combination of open-source with managed infrastructure is very attractive.
The cost-benefit ratio that changes the math for personal and enterprise agents
Cloudflare made a point of positioning this launch not just as a technical improvement, but as a direct answer to an economic problem that is becoming increasingly urgent. As AI adoption grows, the company observes a fundamental shift in how engineering teams and even individuals operate day to day. It is becoming more and more common to have a personal agent, like OpenClaw, running 24 hours a day, seven days a week. Inference volume is skyrocketing.
This new reality of personal and coding agents means cost is no longer a secondary concern — it becomes the main obstacle to scaling. When every employee at a company has multiple agents processing hundreds of thousands of tokens per hour, the bill with proprietary models simply stops making sense. Cloudflare’s expectation is that companies will increasingly migrate to open-source models that offer frontier-level reasoning without the price tag of closed models. And Workers AI positions itself as the facilitator of that transition, offering everything from serverless endpoints for a personal agent to dedicated instances that power autonomous agents across an entire organization.
The inference stack for large models: what is going on under the hood
Workers AI has been serving models, including LLMs, since its launch two years ago, but historically it prioritized smaller models. Part of the reason was that, for a while, open-source LLMs fell significantly behind the models from frontier labs. That changed with models like Kimi K2.5, but to serve this kind of very large LLM, Cloudflare needed to make important changes to the inference stack.
The company developed custom kernels for Kimi K2.5, built on top of the proprietary inference engine called Infire. These kernels optimize how the model is served, improving performance and GPU utilization and unlocking gains that simply do not exist when you run the model straight out of the box without any tuning.
Beyond that, there are multiple techniques and hardware configurations that can be used to serve a large model. Developers typically combine data parallelism, tensor parallelism and expert parallelism techniques to optimize performance. Strategies like disaggregated prefill, where the prefill and generation stages are separated onto different machines for better throughput and higher GPU utilization, also play an important role. Implementing these techniques and incorporating them into the inference stack takes a lot of dedicated expertise to get right.
That is exactly the point Cloudflare wants to emphasize: Workers AI has already done all of that experimentation and engineering behind the scenes. Much of it simply does not come ready when you host an open-source model on your own. The benefit of using a managed platform like this is that you do not need to be a Machine Learning engineer, a DevOps specialist or a site reliability engineer to make the necessary optimizations. The hard part has already been done, and the developer just needs to call an API. 🛠️
Prefix caching and session affinity: smart savings on tokens
The Kimi K2.5 announcement came alongside infrastructure improvements that deserve special attention, starting with prefix caching. When you work with agents, you are very likely sending a large number of input tokens as part of the context. These can be detailed system prompts, tool definitions, MCP server tools or entire codebases. In theory, with the Kimi K2.5 256k token window, a single request could contain nearly 256 thousand input tokens. That is a lot.
To understand how caching helps, you need to know how an LLM processes a request. Processing is divided into two stages: the prefill stage, which processes the input tokens, and the generation stage, which produces the output tokens. These stages are sequential, meaning the input tokens need to be fully processed before generation begins. In multi-turn conversations, each new prompt sent by the client includes all previous prompts, tools and session context. The difference between consecutive requests is usually just a few new lines of input, while everything else has already gone through the prefill stage in a previous request.
This is where prefix caching comes into play. Instead of redoing the prefill on the entire request, Workers AI can cache the input tensors from a previous request and prefill only the new input tokens. This saves time and computation on the prefill stage, resulting in a faster Time to First Token and higher Tokens Per Second throughput, since the GPU is not stuck waiting for prefill to finish.
Workers AI has always done prefix caching, but now it is exposing cached tokens as a usage metric and offering a discount on cached tokens compared to regular input tokens. Additionally, Cloudflare introduced a new header called x-session-affinity. When you send this header with a unique string per session or per agent, the request is routed to the same model instance, increasing the prefix cache hit rate. More cached tokens mean faster TTFT, higher TPS and lower inference costs. Some clients like OpenCode already implement this automatically, and the Cloudflare Agents SDK starter also comes configured to use this feature.
Redesigned async API: durable inference for agents that do not need real time
The other big infrastructure update is the redesigned async API, which addresses a very specific usage pattern of modern agents: tasks that take time to complete and do not need to keep a connection open the entire time.
Cloudflare is upfront about a fact that many providers prefer not to admit: serverless inference is really hard. With a pay-per-token business model, it is cheaper on a per-request basis because you do not need to pay for entire GPUs to serve your requests. But there is a trade-off: you need to compete with other people’s traffic and capacity constraints, and there is no strict guarantee that your request will be processed. This is not unique to Workers AI — it is the reality of serverless model providers in general, as evidenced by frequent news about overloaded providers and service outages.
For request volumes that exceed synchronous rate limits, it is now possible to submit batches of inferences to be completed asynchronously. The new async API works more like flex processing than a traditional batch API: queued requests are processed as soon as there is available headroom on the model instances. For async use cases, you will not hit capacity errors, and inference will be durably executed at some point. In Cloudflare’s internal testing, async requests were generally executed within five minutes, but this depends on live traffic volume.
Under the hood, Cloudflare migrated from a push-based system to a pull-based system. This allows queued requests to be pulled as soon as capacity is available. The company also added better controls to adjust async request throughput, monitoring GPU utilization in real time and pulling async requests when utilization is low. This way, critical synchronous requests maintain priority while async requests continue to be processed efficiently. It is an ideal solution for non-real-time use cases, like code scanning agents or research agents. There is also the option to set up event notifications to know when inference is complete, instead of polling the request. 👀
What this means for anyone building with AI today
Cloudflare’s move with Workers AI and Kimi K2.5 represents something bigger than adding a new AI model to the catalog. It is a clear signal that the platform wants to be the environment where autonomous agents live end to end, from inference to storage, through routing, security and workflow orchestration. When you consider that Cloudflare already offers Workers for edge code execution, KV and R2 for storage, D1 for databases and Durable Objects for persistent state, Workers AI with large LLM support closes an important loop for anyone wanting to build AI applications without needing to piece together five different vendors.
For developers working with agents, the combination of a model with multi-turn tool calling, a wide context window, prefix caching with discounts, session affinity and an async API within an already familiar infrastructure significantly reduces the friction of getting something into production. It is not just about cost, although the cost argument is strong. It is about the ability to iterate fast, test different agent architectures and scale without needing to rewrite everything when volume grows. That kind of flexibility is what separates projects that stay in proof of concept from those that reach production with consistency.
The Cloudflare Agents SDK starter already uses Kimi K2.5 as the default model, which makes it easy for anyone looking to start from scratch. You can also connect to Kimi K2.5 on Workers AI directly through OpenCode and test the model in the Cloudflare interactive playground. The ecosystem is already set up so developers can experiment and validate before scaling.
The overall AI infrastructure landscape is changing fast, and initiatives like this show that the competition is no longer just between the models themselves, but between the ecosystems that surround them. Having a powerful model available is the starting point. What makes the difference day to day is latency, cost per token, ease of integration with the rest of the stack and platform reliability during peak hours. Workers AI, with Kimi K2.5 as the flagship, is clearly betting it can deliver all of that together — and the numbers Cloudflare shared from internal usage suggest it is not just a promise. 🚀
