What are the key points?

Cloudflare Workers AI adds frontier-scale model support featuring Moonshot AI’s Kimi K2.5. Integrated prefix caching and session affinity headers reduce inference costs by up to 77%. Revamped asynchronous API enables durable execution for high-volume, non-real-time agentic workloads.

Cloudflare Workers AI Launches Large Model Support with Kimi K2.5

•Cloudflare Workers AI adds frontier-scale model support featuring Moonshot AI’s Kimi K2.5.
•Integrated prefix caching and session affinity headers reduce inference costs by up to 77%.
•Revamped asynchronous API enables durable execution for high-volume, non-real-time agentic workloads.

Cloudflare is positioning its Developer Platform as the premier destination for building and deploying AI agents by integrating frontier-scale open-source models directly into its infrastructure. The rollout begins with Kimi K2.5, a model from Moonshot AI featuring a massive 256k context window—the amount of information a model can process at once—alongside robust vision and tool-calling capabilities. This move transitions Cloudflare from a host of smaller specialized models to a full-stack environment capable of handling the entire agent lifecycle.

To optimize these demanding workloads, Cloudflare introduced prefix caching and session affinity headers. When an agent engages in a multi-turn conversation, most of the previous context remains static. Prefix caching stores the mathematical representations (tensors) of these initial inputs, allowing the system to skip redundant processing for subsequent requests. By using a session affinity header, developers ensure requests route to the same model instance, maximizing cache hits and slashing costs by up to 77% compared to proprietary alternatives.

Addressing the inherent instability of serverless environments, Cloudflare also overhauled its asynchronous API. This pull-based system manages high-volume batches of inference by utilizing spare GPU capacity, effectively eliminating capacity errors for non-real-time tasks like code scanning or deep research. This infrastructure update suggests a shift where cost-efficiency and reliable scaling become the primary benchmarks for enterprise-grade autonomous agents.

Cloudflare is positioning its Developer Platform as the premier destination for building and deploying AI agents by integrating frontier-scale open-source models directly into its infrastructure. The rollout begins with Kimi K2.5, a model from Moonshot AI featuring a massive 256k context window—the amount of information a model can process at once—alongside robust vision and tool-calling capabilities. This move transitions Cloudflare from a host of smaller specialized models to a full-stack environment capable of handling the entire agent lifecycle.

To optimize these demanding workloads, Cloudflare introduced prefix caching and session affinity headers. When an agent engages in a multi-turn conversation, most of the previous context remains static. Prefix caching stores the mathematical representations (tensors) of these initial inputs, allowing the system to skip redundant processing for subsequent requests. By using a session affinity header, developers ensure requests route to the same model instance, maximizing cache hits and slashing costs by up to 77% compared to proprietary alternatives.

Addressing the inherent instability of serverless environments, Cloudflare also overhauled its asynchronous API. This pull-based system manages high-volume batches of inference by utilizing spare GPU capacity, effectively eliminating capacity errors for non-real-time tasks like code scanning or deep research. This infrastructure update suggests a shift where cost-efficiency and reliable scaling become the primary benchmarks for enterprise-grade autonomous agents.

Cloudflare Workers AI Launches Large Model Support with Kimi K2.5

Tags