SHARE:

Why more AI agents doesn’t always mean better results

Multi-Agent is probably the most hyped term in the artificial intelligence universe in 2025, and it’s not hard to see why. The idea of having multiple AI agents working together, each handling a specific part of the problem, sounds brilliant in theory. And in some cases, it genuinely is. Klarna, for example, built a multi-agent architecture using LangGraph that handles 2.3 million conversations per month — the equivalent of 700 full-time human agents. Resolution time dropped from 11 minutes to under 2, repeat inquiries decreased by 25%, customer satisfaction jumped 47%, and the cost per support transaction went from $0.32 to $0.19. The total projected savings by the end of 2025 hover around $60 million. That kind of result makes any tech leader want to replicate the strategy immediately, but the reality is far more complex than it looks at first glance.

The cold splash of water comes straight from Gartner, which predicts that more than 40% of agentic AI projects will be canceled by the end of 2027. We’re not talking about projects paused or reduced in scope — we’re talking about outright cancellations, driven by runaway costs, unclear business value, and inadequate risk controls. Same technology. Same year. Completely opposite outcomes. That statistic shows there’s a massive gap between putting together a fun prototype with three agents chatting with each other and deploying a multi-agent system in production that actually delivers consistent value for the business.

The cascade effect: when errors multiply by 17x

In December 2025, a Google DeepMind team led by Yubin Kim rigorously tested the premise that more agents produce better results. They ran 180 different configurations across 5 agent architectures and 3 LLM (Large Language Model) families. The finding should be taped to every AI team’s monitor:

Unstructured multi-agent networks amplify errors by up to 17.2x compared to a single-agent baseline.

Not 17% worse. Seventeen times worse.

When agents are thrown together without a defined topology — what the study calls a bag of agents — each agent’s output becomes the next one’s input. The errors don’t cancel out. They cascade. Imagine a pipeline where Agent 1 extracts the customer’s intent from a support ticket. It confuses a billing dispute with a simple billing inquiry — a subtle difference, right? Agent 2 pulls the wrong response template. Agent 3 generates a reply addressing the wrong problem. Agent 4 sends it. The customer responds frustrated. The system processes the angry response through the same broken chain. Each loop amplifies the original misinterpretation. That’s the 17x effect in practice: not a catastrophic, obvious failure, but a silent accumulation of small errors that produces nonsense dressed up as confidence.

The same study found a saturation threshold: coordination gains plateau after 4 agents. Below that number, adding agents to a structured system helps. Above it, coordination overhead eats into the benefits.

This isn’t an isolated finding. The MAST (Multi-Agent Systems Failure Taxonomy) study, published in March 2025, analyzed 1,642 execution traces across 7 open-source frameworks. Failure rates ranged from 41% to 86.7%. The largest failure category: coordination breakdowns, responsible for 36.9% of all recorded failures.

The compound reliability problem

This is the math that most architecture documents simply ignore, and it can make or break your project.

A single agent completes a step with 99% reliability. Sounds excellent, right? Now chain 10 sequential steps: 0.99 raised to the 10th power gives you 90.4% overall reliability. Still acceptable.

But if it drops to 95% per step — which is still a strong number for most AI tasks — ten steps yield only 59.9% total reliability. Twenty steps? A mere 35.8%.

You started with agents that get it right 19 out of 20 times. You ended up with a system that fails nearly two-thirds of the time. 😬

Token costs multiply along the same logic. A document analysis workflow consuming 10,000 tokens with a single agent can require 35,000 tokens in a 4-agent implementation. That’s a 3.5x cost multiplier before even accounting for retries, error handling, and coordination messages between agents.

This is why Klarna’s architecture works and most copies of it don’t. The difference isn’t the number of agents. It’s the topology — the way agents are organized and communicate with each other.

Three multi-agent architecture patterns that work in production

Instead of asking how many agents you need, the smarter question is: how would I definitely fail when building a multi-agent system? The research answers clearly. By chaining agents without structure. By ignoring coordination overhead. By treating every problem as a multi-agent problem when a single well-configured agent would handle it just fine.

Three patterns avoid these failure modes. Each one serves a different task format.

Plan-and-Execute

A more capable model creates the full plan. Cheaper, faster models execute each step. The planner handles reasoning; the executors handle action.

This is essentially what Klarna runs. A frontier model analyzes the customer’s intent and maps out resolution steps. Smaller models execute each step: pulling account data, processing refunds, generating responses. The planning model touches the task once. Execution models handle the volume. The cost impact is significant: routing planning to a capable model and execution to cheaper models can reduce costs by up to 90% compared to using frontier models for everything.

When it works: tasks with clear objectives that decompose into sequential steps. Document processing, customer service workflows, research pipelines.

When it breaks: environments that change during execution. If the original plan becomes invalid midway through, you need replanning checkpoints or an entirely different pattern.

Supervisor-Worker

A supervisor agent manages routing and decisions. Worker agents handle specialized subtasks. The supervisor splits requests, delegates, monitors progress, and consolidates results.

The Google DeepMind research validates this pattern directly. A centralized control plane suppresses the 17x error amplification that bag-of-agents networks produce. The supervisor acts as a single coordination point, preventing situations where, for instance, a support agent approves a refund while a compliance agent simultaneously blocks the same request.

When it works: heterogeneous tasks requiring different specializations. Customer support with escalation paths, content pipelines with review stages, financial analysis combining multiple data sources.

When it breaks: when the supervisor becomes a bottleneck. If every decision funnels through a single agent, you’ve recreated the monolith you were trying to escape. The fix: give workers limited autonomy for decisions within their domain and escalate only edge cases.

Swarm (Decentralized Handoffs)

No supervisor. Agents pass the ball to each other based on context. Agent A handles triage, identifies it as a billing issue, and transfers to Agent B, the billing specialist. Agent B resolves it or transfers to Agent C, escalation, if needed.

OpenAI’s original Swarm framework was educational only — they made that explicit in the README. The Agents SDK, the production version released in March 2025, implements this pattern with guardrails: each agent declares its handoff targets, and the framework ensures transfers follow declared paths.

When it works: high-volume, well-defined workflows where routing logic is embedded in the task itself. Chat-based customer support, multi-step onboarding, triage systems.

When it breaks: complex handoff graphs. Without a supervisor, debugging why a user ended up at Agent F instead of Agent D requires production-grade observability tooling. If you don’t have distributed tracing, don’t use this pattern.

Which multi-agent framework to use

Three frameworks dominate multi-agent system deployments in production right now. Each one reflects a different philosophy on how agents should be organized.

LangGraph uses graph-based state machines. With 34.5 million monthly downloads, it offers typed state schemas that enable precise checkpointing and inspection. It’s what Klarna runs in production. The best choice for stateful workflows where you need human-in-the-loop intervention, branching logic, and durable execution. The trade-off: a steeper learning curve than the alternatives.

CrewAI organizes agents as role-based teams. With 44,300 GitHub stars and growing, it offers the lowest barrier to entry: define agent roles, assign tasks, and the framework handles coordination. It deploys teams roughly 40% faster than LangGraph for straightforward use cases. The trade-off: limited support for cycles and complex state management.

OpenAI Agents SDK provides lightweight primitives — Agents, Handoffs, and Guardrails. It’s the only major framework with equal support for Python and TypeScript/JavaScript. Clean abstraction for the Swarm pattern. The trade-off: tighter coupling with OpenAI’s models.

One protocol worth knowing: the Model Context Protocol (MCP) has become the de facto standard for agent tool interoperability. Anthropic donated the protocol to the Linux Foundation in December 2025, co-founded by Anthropic, Block, and OpenAI under the Agentic AI Foundation. More than 10,000 active public MCP servers exist today. All three frameworks above support the protocol. If you’re evaluating tools, MCP compatibility is a baseline requirement.

If you’re unsure where to start, the combination of Plan-and-Execute with LangGraph is the most battle-tested. It covers the widest range of use cases. And switching patterns later is a reversible decision — you don’t have to get it right on the first try.

Five failure modes that kill projects in production

The MAST study identified 14 failure modes across 3 categories. The five below account for most production failures. Each one includes a specific prevention measure you can implement before your next deploy.

Compound reliability decay

Calculate end-to-end reliability before going to production. Multiply per-step success rates across the entire chain. If the number drops below 80%, shorten the chain or add verification checkpoints. The recommendation is to keep chains under 5 sequential steps. Insert a verification agent at step 3 and step 5 that validates output quality before passing it along. If verification fails, route to a human or an alternative path — not to a retry of the same chain.

Coordination tax

This failure mode accounts for 36.9% of all failures in multi-agent systems. When two agents receive ambiguous instructions, they interpret them differently. A support agent approves a refund while a compliance agent blocks it. The user gets contradictory signals. Prevention requires explicit input and output contracts between each pair of agents. Define the data schema at every boundary and validate. No implicit shared state. If Agent A’s output feeds Agent B, both need to agree on the format before deploy, not at runtime.

Cost explosion

Token costs multiply across agents — 3.5x in documented cases. Retry loops can burn through over $40 in API fees within minutes, with nothing useful to show for it. Prevention includes setting hard token budgets per agent and per workflow. Implement circuit breakers: if an agent exceeds its budget, halt the workflow and surface an error instead of retrying endlessly. Log cost per completed workflow to catch regressions early.

Security gaps

The OWASP Top 10 for LLM Applications found prompt injection vulnerabilities in 73% of evaluated production deployments. In multi-agent systems, a compromised agent can propagate malicious instructions to every downstream agent. Prevention requires input sanitization at every boundary between agents, not just at the entry point. Treat inter-agent messages with the same distrust you’d apply to external user input. Run a red team exercise against your agent chain before going live.

Infinite retry loops

Agent A fails. Retries. Fails again. In multi-agent systems, Agent A’s failure triggers Agent B’s error handling, which calls Agent A again. The loop runs until the budget is gone. Prevention: a maximum of 3 retries per agent per workflow execution. Exponential backoff between retries. Dead-letter queues for tasks that fail beyond the retry limit. And an absolute rule: never allow an agent to trigger another without a cycle check in the orchestration layer.

Tool versus worker: the $60 million difference

In February 2026, the National Bureau of Economic Research (NBER) published a study surveying nearly 6,000 executives across the United States, United Kingdom, Germany, and Australia. The finding: 89% of companies reported zero productivity change from AI. 90% of managers said AI had no impact on employment. These companies were using AI an average of 1.5 hours per week per executive.

Fortune called it a resurrection of Robert Solow’s 1987 paradox: you can see the computer age everywhere except in the productivity statistics. History repeating itself forty years later with a different technology and the same pattern.

The 90% that saw no impact deployed AI as a tool. The companies that saved millions deployed AI as workers.

The contrast with Klarna isn’t about better models or more compute power. It’s a structural choice. The 90% treated AI as a copilot — a tool assisting a human in the loop, used 1.5 hours a week. The companies with real returns, like Klarna, Ramp, and Reddit via Salesforce Agentforce, treated AI as a workforce: autonomous agents executing structured workflows with human oversight at decision points, not at every step.

This isn’t a technology gap. It’s an architecture gap. The opportunity cost is staggering: the same engineering budget producing zero ROI on one side versus $60 million in savings on the other. The variable isn’t how much you spend. It’s how you structure it.

How to choose the right path for your scenario

The decision between using a single agent or a multi-agent architecture shouldn’t be driven by hype — it should be guided by well-defined technical and business criteria. First ask whether the problem you’re trying to solve requires multiple distinct specializations that can’t be combined in a single prompt or context. If the answer is no, start simple.

Also ask what the acceptable operational cost budget per request is and what the maximum latency the end user will tolerate looks like. Each additional agent in the system increases both metrics. Without clarity on these limits from the start, the project runs a serious risk of becoming another entry in Gartner’s 40% statistic.

Invest in observability from day zero. Before writing the first agent, define how you’ll monitor every call to the language model, every message exchange between agents, and every routing decision. Tools like LangSmith and distributed tracing platforms create the visibility layer that’s essential for identifying errors, optimizing performance, and controlling costs in production. Without it, you’re flying blind — it might work for a while, but when turbulence hits, and it will, you won’t have instruments to navigate. 🧭

Adopt a gradual evolution mindset. Start with the minimum number of agents needed, measure real results — not just technical metrics, but business impact — and add complexity only when the data justifies it.

40% of agentic AI projects will be canceled by 2027. The other 60% will make it to production. The difference won’t be which LLM they chose or how much they spent on compute. It will be whether they understood the right architecture patterns, ran the compound reliability math, and built their systems to survive the failure modes that take down everything else.

Klarna didn’t deploy 700 agents to replace 700 humans. They built a structured multi-agent system where a smart planner routes work to cheap executors, where every handoff has an explicit contract, and where the architecture was designed to fail gracefully instead of cascading. The same patterns, the same frameworks, and the same failure data are available to every team. What you build with them is the only variable left. 🎯

Picture of Rafael

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

Fill out the form and our team will contact you within 24 hours.

Related publications

AI SDR Agent on WhatsApp: How SMBs Can Cut Costs and Scale Sales

Respond 21x faster your leads and scale your sales operation with a fraction of the cost of expanding your sales

Robot Detects Unusual Browser Activity Using JavaScript and Cookies

Learn why sites require JavaScript and cookies for unusual activity and how to fix blocks with quick, simple steps

Productivity with Agentic Artificial Intelligence in execution and workflows.

Agentic AI: how to operationalize AI agents to improve workflows, metrics, and governance, turning pilots into real productivity gains.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

Rafael

Online

Atendimento

Calculadora Preço de Sites

Descubra quanto custa o site ideal para seu negócio

Páginas do Site

Quantas páginas você precisa?

4

Arraste para selecionar de 1 a 20 páginas

📄

⚡ Em apenas 2 minutos, descubra automaticamente quanto custa um site em 2026 sob medida para o seu negócio

👥 Mais de 0+ empresas já calcularam seu orçamento

Fale com um consultor

Preencha o formulário e nossa equipe entrará em contato.