Share:

The AI Engineering stack Cloudflare built internally using its own products

AI Engineering has gone from theory to everyday routine in Cloudflare’s development workflow. And we’re not talking about a pilot project or a proof of concept that got buried in the backlog.

Over the past 30 days, 93% of the company’s R&D team actively used AI-powered coding tools, all within an environment built end-to-end with the very products Cloudflare offers to the market.

That alone would be a noteworthy data point, but what makes this case even more relevant is the journey to get there. It took eleven months of real work, a dedicated team called iMARS (Internal MCP Agent/Server Rollout Squad), and an architecture designed around three distinct layers, each solving a specific challenge of AI adoption at enterprise scale.

The result? Nearly 48 million AI requests in a single month, 295 teams actively using agents, over 3,683 internal users, and a spike in merge request volume the company had never seen before in a single quarter. 📈

In this article, we take a deep dive into how this stack was built, which products make it up, how Workers AI is being used to cut costs by up to 77% compared to proprietary models, and why this architecture can serve as a blueprint for any team that wants to take AI agents seriously in the development process.

The team that made it all happen: meet iMARS

Before talking about technology, it’s worth understanding who was behind the decisions. iMARS, short for Internal MCP Agent/Server Rollout Squad, is the cross-functional group that brought together engineers from multiple areas at Cloudflare to get this entire AI Engineering stack up and running. This team wasn’t created to study AI in controlled environments or publish whitepapers. It exists to solve real problems around productivity, scale, and security inside a company with more than 6,100 employees, roughly 3,683 of whom are active users of AI tools.

The iMARS effort started with an honest assessment: how do you ensure AI tools get adopted safely, without engineers having to work around security policies or rely on unaudited external solutions? The answer came in the form of a layered architecture, where each part of the system solves a specific problem without creating new risks.

After the initial push, ongoing maintenance responsibility shifted to the Dev Productivity team, which already owned most of the company’s internal tooling, including CI/CD, build systems, and automation. This ensured the project didn’t die after the initial hype. iMARS didn’t work in a silo: it operated in close collaboration with product, security, and infrastructure teams, which explains why adoption was so fast and buy-in was so high. 🤝

The three-layer architecture that holds it all together

Cloudflare’s internal stack is organized into three well-defined acts, as the company itself describes: the platform layer, the knowledge layer, and the enforcement layer. Understanding each one is essential to grasping why the system works so well at scale.

Platform layer: access, routing, and inference

It all starts with authentication. Cloudflare Access handles all identity and Zero Trust policy enforcement. Every engineer accessing any AI tool must be authenticated via corporate SSO, no exceptions. There are no shared tokens, no untracked access, and no way to use a language model without it being logged and auditable.

Once authenticated, every request to language models flows through the AI Gateway, which acts as a centralized point for managing provider keys, cost tracking, and data retention policies. In practical terms, this means the company knows exactly how much it’s spending on each model, broken down by team and use case.

The AI Gateway numbers from the past 30 days are impressive: 20.18 million requests and 241.37 billion tokens routed. Among the providers used, frontier models from OpenAI, Anthropic, and Google account for 91.16% of monthly requests, while Workers AI already covers 8.84% and is steadily growing.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

One architectural decision that proved crucial was routing everything through a single proxy Worker from day one. The team could have allowed clients to connect directly to the AI Gateway, which would have been simpler to set up initially. But centralizing through a Worker meant that features like per-user attribution, model catalog management, and permission controls could be added later without touching any client configuration. This proxy pattern creates a control plane that direct connections simply can’t offer.

How it works in practice: a single URL configures everything

The engineer’s experience begins with a single login command. That command triggers a chain that configures providers, models, MCP servers, agents, commands, and permissions, all without the user having to manually edit any configuration file.

The flow works like this: the client fetches configuration from a discovery endpoint served by a Worker. That endpoint returns an authentication block explaining how to authenticate, along with a configuration block containing providers, MCP servers, agents, and default permissions. The user authenticates through the same SSO they use for everything at Cloudflare, receives a signed JWT, and they’re good to go. Every subsequent request carries that token automatically.

The proxy Worker does three things: it serves the shared configuration compiled at deploy time, proxies requests to the AI Gateway while validating the JWT and rewriting headers, and keeps the model catalog updated via a cron trigger that runs every hour. No API keys exist on user machines. The Worker injects the real keys on the server side. 🔐

For anonymous tracking, the Worker maps the user’s email to a UUID using D1 for persistent storage and KV as a read cache. The AI Gateway only sees the anonymous UUID, never the email. This enables per-user cost tracking without exposing identities to model providers or Gateway logs.

The MCP Portal: a single OAuth for multiple tools

Cloudflare’s internal MCP server portal aggregates 13 production servers that expose over 182 tools covering Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace, an internal release manager, and more. Everything is unified under a single endpoint and a single authentication flow via Cloudflare Access.

Each MCP server is built on the same foundation: McpAgent from the Agents SDK, workers-oauth-provider for OAuth, and Cloudflare Access for identity. Everything lives in a monorepo with shared authentication infrastructure. Adding a new server is basically copying an existing one and swapping out the API it wraps.

Code Mode: solving the token overhead problem

MCP is the right protocol for connecting AI agents to tools, but it has a practical problem: each tool definition consumes tokens from the context window before the model even starts working. As the number of servers and tools grows, the token overhead grows with it.

The GitLab MCP server, for example, originally exposed 34 individual tools. Those 34 schemas consumed approximately 15,000 tokens from the context window per request. In a 200K token window, that’s 7.5% of the budget wasted before you even ask a question.

Code Mode solves this at the portal level. Instead of exposing each tool definition upstream to the client, the portal collapses everything into just two portal-level tools: search and execute. The model discovers and calls tools through code instead of loading all schemas upfront. Without Code Mode, each new MCP server added more overhead to every request. With Code Mode at the portal, the client still sees only two tools regardless of how many servers are connected behind it. Less wasted context, lower token costs, and a cleaner architecture overall. ⚡

The knowledge layer: how agents understand the systems

Backstage as the engineering knowledge graph

Before building MCP servers that were actually useful, the team had to solve a more fundamental problem: structured data about services and infrastructure. Agents needed to understand context outside the codebase, like who owns what, how services depend on each other, where documentation lives, and which databases each service communicates with.

Cloudflare runs Backstage, the open source developer portal originally created by Spotify, as its service catalog. The numbers from that catalog give a sense of the scale of the challenge: 2,055 services, 167 libraries, 122 packages, 228 APIs with schema definitions, 544 systems (products) across 45 domains, 1,302 databases, 277 ClickHouse tables, 173 clusters, 375 teams, and 6,389 users with ownership mappings, plus dependency graphs connecting services to the databases, Kafka topics, and cloud resources they depend on.

The Backstage MCP server, with its 13 tools, is available through the MCP Portal. An agent can look up who owns a service, check its dependencies, find related API specs, and pull Tech Insights scores, all without leaving the coding session. Without this structured data, agents are flying blind. The catalog transforms individual repositories into a connected map of the entire engineering organization.

AGENTS.md: getting thousands of repositories AI-ready

Early in the rollout, the team noticed a recurring failure pattern: coding agents produced changes that looked plausible but were wrong for the specific repository. The issue was usually local context. The model didn’t know the correct test command, the team’s current conventions, or which parts of the codebase shouldn’t be touched.

This led to the creation of AGENTS.md: a short, structured file in each repository that tells coding agents how the codebase actually works. A typical file includes information about the runtime, test and lint commands, how to navigate the codebase, team conventions, boundaries that shouldn’t be crossed, and service dependencies.

The generation pipeline pulls entity metadata from the Backstage catalog, analyzes the repository structure to detect language, build system, test framework, and directory layout, maps the detected stack to relevant Engineering Codex patterns, and a capable model generates the structured document. The system opens a merge request so the owning team can review and refine it.

About 3,900 repositories were processed this way. The first pass wasn’t always perfect, especially for polyglot repos or unusual build setups, but even that baseline was far better than asking agents to infer everything from scratch. And to keep these files current, the AI Code Reviewer can flag when changes to the repository suggest the AGENTS.md needs an update. 📋

The enforcement layer: quality at scale

The AI Code Reviewer that analyzes 100% of merge requests

Every merge request at Cloudflare gets an AI code review. The integration is straightforward: teams add a single CI component to their pipeline, and from that point on every MR is reviewed automatically.

The reviewer is implemented as a GitLab CI component that, when an MR is opened or updated, runs OpenCode with a multi-agent coordinator. This coordinator classifies the MR by risk level (trivial, light, or full) and delegates to specialized review agents: code quality, security, Codex compliance, documentation, performance, and release impact.

Each agent connects to the AI Gateway for model access, pulls rules from the Engineering Codex in a central repository, and reads the repository’s AGENTS.md for context. Results are posted as structured comments on the MR, organized by categories like Security, Code Quality, and Performance, with severity levels like Critical, Important, Suggestion, or Optional Note.

The reviewer maintains context between iterations. If it flagged something in a previous round that has since been fixed, it acknowledges that instead of raising the same issue again. And when a finding maps to an Engineering Codex rule, it cites the specific rule ID, turning an AI suggestion into a reference to an organizational standard.

Over the past 30 days, the reviewer operated at 100% coverage across all repositories on the standard CI pipeline, consuming 5.47 million AI Gateway requests and processing 24.77 billion tokens. Workers AI handles about 15% of the reviewer’s traffic, primarily for documentation review tasks where models like Kimi K2.5 perform well at a fraction of the cost of frontier models.

Engineering Codex: engineering standards as agent skills

The Engineering Codex is Cloudflare’s new internal standards system, where the company’s core engineering standards live. It goes through a multi-stage AI distillation process that produces a set of rules in conditional format, like if you need X, use Y or you must do X if you are doing Y or Z.

This skill is available for engineers to use locally while they build, with prompts like how should I handle errors in my Rust service or review this TypeScript code for compliance. The company’s Network Firewall team, for example, audited rampartd using a multi-agent consensus process where each requirement was classified as COMPLIANT, PARTIAL, or NON-COMPLIANT with specific violation details and remediation steps, reducing what used to take weeks of manual work to a structured, repeatable process.

The integration between Backstage, AGENTS.md, the AI Code Reviewer, and the Engineering Codex is what makes the system greater than the sum of its parts. When an agent can pull context from the service catalog, read the AGENTS.md of the repository it’s editing, and be reviewed against Codex rules by the same toolchain, the first draft is usually close enough to ship. That wasn’t true six months ago.

Tools we use daily

Workers AI and its central role in cost reduction

Workers AI is Cloudflare’s serverless inference platform that runs open source models on GPUs distributed across the company’s global network. Beyond the massive cost improvements compared to frontier models, a key advantage is that inference runs on the same network as your Workers, Durable Objects, and storage. No cross-cloud hops, which reduces latency, eliminates network instability, and cuts down on the extra networking configuration needed.

Over the past 30 days, Workers AI processed 51.47 billion input tokens and 361.12 million output tokens. Kimi K2.5, launched on Workers AI in March 2026, is a frontier-scale open source model with a 256K context window, tool calling, and structured outputs. A Cloudflare security agent processes over 7 billion tokens per day using Kimi. On a mid-tier proprietary model, that would cost an estimated $2.4 million per year. On Workers AI, it comes in 77% cheaper. 💰

Beyond security, Workers AI is used for documentation review in the CI pipeline, for generating AGENTS.md files across thousands of repositories, and for lightweight inference tasks where same-network latency matters more than peak model capability. As open source models continue to improve, the expectation is that Workers AI will absorb an increasingly larger share of internal workloads.

The numbers that sum up the impact

From launching the effort to reaching 93% R&D adoption took less than a year. Here’s the full picture from the past 30 days:

  • 3,683 active users using AI-powered coding tools (60% of the entire company, 93% of R&D)
  • 47.95 million AI messages
  • 295 teams using agents and coding assistants
  • 27.08 million messages via OpenCode
  • 434.9K messages via Windsurf
  • 20.18 million requests on AI Gateway
  • 241.37 billion tokens routed through AI Gateway
  • 51.83 billion tokens processed on Workers AI

The 4-week moving average of merge requests climbed from roughly 5,600 per week to over 8,700. During the week of March 23, the number hit 10,952, nearly double the Q4 baseline. That’s probably the most concrete indicator that AI tools are having a real impact on development velocity.

What comes next: background agents

The next evolution of the internal stack includes background agents, which can be spun up on demand with the same tools available locally (MCP Portal, git, test runners) but running entirely in the cloud. The architecture uses Durable Objects and the Agents SDK for orchestration, delegating to Sandbox containers when the work requires a full development environment, like cloning a repository, installing dependencies, or running tests.

The Sandbox SDK reached general availability during Agents Week, and long-running agents are now natively supported in the Agents SDK. This solves the durable session problem that previously required workarounds. The SDK now supports sessions that run for extended periods without eviction, enough time for an agent to clone a large repository, run a full test suite, iterate over failures, and open an MR in a single session. 🔄

What this case teaches about AI adoption at scale

Looking at the big picture, the most striking thing about this case isn’t the technology itself, but the adoption process. Getting 93% of R&D at a company the size of Cloudflare to actively use AI tools in less than a year is a result most organizations can’t come close to. And the reason is almost always the same: lack of trust in the tools, data security concerns, or simply tools that don’t fit into engineers’ actual workflow.

Cloudflare addressed all three problems systematically. Trust came from the fact that the tools were built internally, with the same security standards the company applies to its commercial products. Security concerns were addressed by the Zero Trust model, which ensures full traceability. And the workflow fit problem was solved through direct editor integrations and automatic context enrichment.

One detail worth highlighting: all of the infrastructure listed, with the exception of Backstage, is made up of products Cloudflare sells commercially. This means any company can replicate this architecture using the same tools. It’s not proprietary internal infrastructure. It’s a public product, battle-tested at scale by the very company that built it.

For teams thinking about building something similar, the most valuable lesson from this case is probably this: start with engineers’ real problems, not with model capabilities. Cloudflare didn’t ask what can AI do, but rather where are our engineers losing the most time, and worked backward from there. That kind of problem-first orientation, rather than technology-first, is what separates a successful adoption from yet another project stuck in pilot mode forever. 🎯

Picture of Rafael

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

Fill out the form and our team will contact you within 24 hours.

Related publications

Amazon's stock could rise following OpenAI partnership.

Amazon and OpenAI partnership could boost AI revenue and stock value, says Citi; strategic impact on AWS and infrastructure race.

Moratorium on AI Data Centers: Energy in Debate

Sanders and AOC propose moratorium on AI datacenter construction in the US to assess environmental and energy impacts.

Blockchain and AI Agents Are Changing Crypto Payments

AI agents power crypto payments with blockchain, stablecoins and x402, enabling autonomous transactions, micropayments and machine-to-machine economy

Receba o melhor conteúdo de inovação em seu e-mail

Todas as notícias, dicas, tendências e recursos que você procura entregues na sua caixa de entrada.

Ao assinar a newsletter, você concorda em receber comunicações da Método Viral. A gente se compromete a sempre proteger e respeitar sua privacidade.

Rafael

Online

Atendimento

Calculadora Preço de Sites

Descubra quanto custa o site ideal para seu negócio

Páginas do Site

Quantas páginas você precisa?

4

Arraste para selecionar de 1 a 20 páginas

📄

⚡ Em apenas 2 minutos, descubra automaticamente quanto custa um site em 2026 sob medida para o seu negócio

👥 Mais de 0+ empresas já calcularam seu orçamento

Fale com um consultor

Preencha o formulário e nossa equipe entrará em contato.