Efficiency at Scale: The Challenge Few Can Actually See
Efficiency at scale is one of the biggest challenges in modern engineering.
When a system serves over 3 billion people, even the smallest performance variation can translate into an absurd amount of wasted energy — and money straight down the drain.
That was exactly the scenario that brought Meta to a critical turning point.
The company already had robust tools for detecting performance issues across its infrastructure. But identifying the problem is only half the battle.
The other half — investigating, diagnosing, and fixing — still relied on engineers dedicating precious hours of their day to resolve what were often tiny regressions with a massive impact on operations.
That is where the Capacity Efficiency Program was born, an initiative that brings together artificial intelligence agents to automate both the search for optimization opportunities and the resolution of performance regressions — all within a unified platform. 🚀
The result? Hundreds of megawatts recovered, investigations that used to take around 10 hours compressed to under 30 minutes, and engineers finally free to focus on what truly matters: innovating on new products.
Here is how Meta built this system and what it means for the future of engineering at scale.
The Real Problem Behind Performance Regressions
Before understanding the solution, it is worth stepping back and grasping the scale of the problem. When we talk about performance regressions in globally scaled infrastructure, we are not talking about lag that a user notices on their screen. We are talking about subtle variations in computational resource consumption — CPU, memory, I/O — that, multiplied across thousands of servers running around the clock, turn into monumental energy and financial losses.
At Meta, where the volume of data flowing and being processed is simply colossal, an efficiency drop of just 0.1% can mean megawatts of energy burned for no reason. It is the kind of problem that no traditional dashboard can solve on its own.
The process for investigating these regressions, up until recently, followed a very manual workflow. An engineer would receive an alert, start correlating data from different sources, try to isolate which code change, configuration tweak, or infrastructure shift had caused the performance variation, and then propose a fix from there. That entire cycle consumed, on average, around 10 hours of specialized work per occurrence.
Considering that these regressions happened frequently — and often simultaneously across different parts of the system — the human and operational cost was extremely high. Highly skilled engineers were spending a significant chunk of their time solving repetitive, structured problems when they could have been channeling that energy into higher-impact projects.
This context created an urgent need to rethink the model. It was not enough to simply improve existing monitoring tools or add more alerts to the system. A qualitative leap was needed: building something capable of not just detecting the problem, but understanding the context, tracing the root cause, and suggesting — or even executing — the fix. And it was out of that need that Meta’s engineering team started designing what would become the Capacity Efficiency Program.
Offense and Defense: A Two-Front Structure
Within Meta’s Capacity Efficiency organization, efficiency is treated as a two-front effort:
- Offense: proactively seeking out opportunities — conceptual code changes — to make existing systems more efficient, and shipping them to production.
- Defense: monitoring resource usage in production to detect regressions, identifying the root cause down to a specific pull request, and applying mitigations quickly.
Both sides of the problem were already addressed by internal tools that worked well and played a meaningful role in Meta’s efficiency efforts for years. However, actually resolving the issues these tools identified introduced a new bottleneck: human engineering time.
That time could be spent on activities like querying profiling data to find hot functions ripe for optimization, reviewing descriptions and documentation for efficiency opportunities, checking recent code and configuration deployments that might have caused an abrupt shift in resource usage, or digging through internal discussions about launches that could be linked to a regression.
Many engineers at Meta used these efficiency tools daily. But no matter how good the tools were, engineers had limited time to tackle performance problems when the top priority was innovating on new products.
The question that changed everything was straightforward: what if AI could handle the investigation and the resolution?
The Insight That Unified Everything
The big breakthrough from the engineering team was realizing that both offense and defense share the same work structure. Both involve gathering context about the system, applying specialized knowledge to interpret the data, and generating an action — whether a fix or an optimization.
That meant there was no need to build two separate AI systems. It was possible to create a single platform capable of serving both sides.
The platform was built on two foundational layers:
- MCP Tools: standardized interfaces that allow large language models (LLMs) to execute code. Each tool does one thing — query profiling data, fetch experiment results, retrieve configuration history, search code, or extract documentation.
- Skills: these encode domain knowledge about performance efficiency. A skill can tell the LLM which tools to use and how to interpret the results. It captures reasoning patterns that experienced engineers developed over years. For example, querying the top GraphQL endpoints for latency regressions or checking recent schema changes when the affected function deals with serialization.
Together, tools and skills transform a general-purpose language model into something capable of applying domain knowledge that would normally be locked inside the heads of senior engineers. The same tools power both offense and defense. Only the skills change. 💡
Defense: Catching Regressions Before They Pile Up
FBDetect is Meta’s internal tool for performance regression detection. It can identify regressions as small as 0.005% in noisy production environments by analyzing time-series data of resource usage.
When FBDetect finds a regression, the system immediately tries to trace the root cause back to a code or configuration change. This is the critical first step in understanding what happened, and it is done primarily through traditional techniques like correlating functions impacted by the regression with recent pull requests.
Once a root cause is determined, engineers are notified and expected to take action — such as optimizing the recent code change. But Meta added an extra layer to make this process much faster.
AI Regression Solver
The AI Regression Solver is the newest and most promising component of FBDetect. It automatically produces a pull request to fix the regression without needing to revert it. Traditionally, pull requests that caused performance regressions were either reverted — which slowed down engineering velocity — or simply ignored — which unnecessarily increased infrastructure resource usage.
Now, Meta’s internal coding agent is activated to follow three steps:
- Gather context with tools: find the symptoms of the regression, such as the functions that regressed, and look up the root cause (a pull request), including the exact files and lines that were changed.
- Apply domain knowledge with skills: use regression mitigation knowledge for that specific codebase, language, or regression type. For example, regressions caused by logging can be mitigated by increasing the sampling rate.
- Create a resolution: produce a new pull request and send it to the original author of the root cause for review.
This automated workflow compresses what used to take hours into a process that can be completed in less than 30 minutes.
Offense: Turning Opportunities Into Production-Ready Code
On the offensive side, so-called efficiency opportunities are conceptual proposals for code changes believed to improve the performance of existing code. Meta built a system where engineers can view an opportunity and request an AI-generated pull request that implements it. What used to require hours of investigation now takes minutes to review and ship to production.
The pipeline mirrors the AI Regression Solver from the defensive side:
- Gather context with tools: the AI agent fetches the opportunity metadata, documentation explaining the optimization pattern, examples showing how similar opportunities were resolved, the specific files and functions involved, and validation criteria to confirm the fix works.
- Apply domain knowledge with skills: use specialist engineering knowledge about that specific type of efficiency opportunity, encoded in a skill. For example, applying memoization to a function to reduce CPU usage.
- Create a resolution: produce a candidate fix with safeguards, verify syntax and style, confirm it addresses the correct problem, and present the generated code in the engineer’s editor, ready to be applied with a single click.
The crucial point is that the same tools from the defensive side are reused here: profiling data, documentation, and code search. The only thing that changes is the skills.
One Platform, Compounding Returns
The unified architecture, with shared tools and data sources, turned out to be an exceptionally clean abstraction. Every existing agent and every new agent has a straightforward way to gather performance context using the interfaces already built, without needing to reinvent the wheel.
Although the first use cases were performance regressions and efficiency opportunities, in less than a year the same foundation began powering additional applications: conversational assistants for efficiency questions, capacity planning agents, personalized opportunity recommendations, guided investigation workflows, and AI-assisted validation. Each new capability required little to no new data integration, since it only needed to compose existing tools with new skills.
This composition model is what makes the system especially powerful over the long run. As more skills are encoded and more agents are added, the platform’s value grows non-linearly — every new component benefits from everything that was built before it. 🔧
The Numbers That Prove the System’s Impact
Talking about efficiency gains without concrete numbers would be way too vague, and Meta did not leave that point open. The results of the Capacity Efficiency Program are significant: the program managed to recover hundreds of megawatts of computational capacity that was previously being consumed unnecessarily. To put that in perspective: savings of that magnitude are equivalent to the energy needed to power hundreds of thousands of American homes for an entire year.
On the defensive front, FBDetect captures thousands of regressions per week. With faster automated resolution, fewer megawatts are wasted piling up across the server fleet. On the offensive front, AI-assisted resolution of opportunities is expanding to more product areas each half, handling a growing volume of gains that engineers simply would never have had time to address manually.
When it comes to regression resolution time, the transformation was equally impressive. The process that used to consume around 10 hours of an engineer’s work now gets completed in roughly 30 minutes by the automated system. That represents a reduction of over 95% in average resolution time.
When you multiply that number by the frequency at which these regressions occur in an infrastructure the size of Meta’s, the cumulative impact in recovered engineering hours is simply enormous. That time was redirected toward higher strategic value initiatives, measurably accelerating the pace of internal innovation.
Offense and Defense Reinforce Each Other
The most profound shift driven by the program lies in how offense and defense started feeding into each other.
Engineers who used to spend their mornings on defensive triage now review AI-generated analyses in minutes. Engineers who use the efficiency tools can get AI-assisted code instead of starting from scratch. The intimidating question of where do I even start? has been replaced by reviewing and shipping high-impact fixes.
Together, these two fronts are what allow Meta’s Capacity Efficiency program to keep growing its delivery of saved megawatts without needing to proportionally increase team size. The ultimate goal is a self-sustaining efficiency engine where AI handles the long tail of problems. 🤝
What This Means for the Future of Engineering at Scale
Meta’s Capacity Efficiency Program is not just an internal solution for an internal problem. It is a clear signal of how artificial intelligence is beginning to transform the very way technology infrastructure is managed at a global scale.
For years, the dominant model was based on reactive monitoring: wait for the problem to surface, ping the team, and fix it. Intelligent automation is shifting that paradigm toward a predictive and proactive model, where the system anticipates problems, learns from past occurrences, and acts with increasing autonomy to keep infrastructure performance within optimal parameters.
This shift has deep implications for how companies think about the makeup of their engineering teams. If repetitive, structured tasks are being handled by AI, the most valuable skill profile changes. Engineers will increasingly need the ability to design and train these systems, interpret their outputs, and make strategic decisions based on automatically generated recommendations. Human work moves from the operational to the strategic — and that requires continuous adaptation from both people and organizations.
There is also a scalability angle that deserves special attention. What Meta built was designed to operate on one of the largest technology infrastructures in the world, but the principles behind the system are applicable at different scales. The combination of specialized AI agents, working in a coordinated way within a unified platform with reusable tools and interchangeable skills, is an architecture that can be adapted for companies of varying sizes and industries.
Meta’s learning, therefore, goes far beyond its own data centers — it contributes to a broader understanding of how to build engineering systems that are truly intelligent, efficient, and resilient for the future. 🌐
