SHARE:

AI agents are redefining what it means to use artificial intelligence in everyday life.

Just two years ago, most people interacted with AI models basically the way they would with a smarter search bar: ask a question, get an answer.

Simple as that.

But that landscape has changed — and fast.

Today, tools like Claude Code and Claude Cowork from Anthropic allow AI to write and run code, manage files, navigate corporate systems, and complete tasks that span multiple applications, almost without needing someone holding its hand the whole time.

It is a real game changer. 🔑

And with it come impressive productivity gains — but also a whole new set of responsibilities and risks that did not exist when AI only answered questions.

When an agent acts autonomously, with less human oversight, the room for misinterpretation grows. New vulnerabilities emerge, like prompt injection attacks, where malicious instructions are hidden inside content the agent processes.

And as these systems become more powerful and companies trust them with increasingly important actions, both types of risk are likely to intensify.

How do you make sure AI agents are truly trustworthy?

That is exactly what Anthropic is trying to answer with its framework for building trustworthy agents, originally published last August and now detailed in practical terms. The framework rests on five core principles: keeping humans in control, aligning behavior with human values, protecting agent interactions, maintaining transparency, and preserving privacy.

From the technical architecture that makes an agent work to the practical principles of security, human control, and transparency, all the way to what goes beyond any single company and involves the entire industry — we are going to explore each of these layers here.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

Let’s break down what is at stake. 👇

What makes an AI agent different from a regular chatbot

The difference between a traditional chatbot and an AI agent goes way beyond the name. A chatbot responds. An agent acts. And that distinction completely changes how we need to think about these systems.

Anthropic defines an agent as an AI model that directs its own processes and tool use while completing a task — meaning it decides on its own how to achieve what the user wants, rather than following a fixed script. In practice, that means the agent operates in a self-directed loop: it plans, executes an action, observes the result, adjusts the plan, and repeats this process until the task is done or it needs to check in with the human.

To make this more concrete, think of a real example. If you asked Claude in Claude Cowork to submit receipts from a business trip, it would plan the steps one by one — transcribe each receipt photo, extract the amount and vendor, categorize the expense, and submit it through your company system. If a hotel charge got rejected for exceeding the daily limit, Claude would not only notice the submission failed but would also recognize that it does not know what the allowed cap is or what other rules might apply. At that point, it would pause to ask whether it should look up the expense policy on the company shared drive before trying again. With your approval, it would incorporate what it learned into the plan and move forward.

That level of integration with the real world — where actions have consequences outside the chat window — is what defines a modern agent and is also what demands a completely new approach to security and governance.

The anatomy of an agent: four essential components

To understand where both the capabilities and the risks of an agent come from, you need to look at its four fundamental components. Each one is simultaneously a source of power and a potential vulnerability:

  • The model: this is the intelligence that makes everything possible. That intelligence is the product of the training process, which shapes both what the model knows and how it reasons and behaves.
  • The harness: these are the instructions and guardrails under which the model operates. In the receipts example, the harness could instruct Claude to flag any amount over one hundred dollars or to never submit expenses without user confirmation.
  • The tools: these are the services and applications the model can use — your email, calendar, expense software, and so on. Without tools, Claude can read the receipt but cannot file it.
  • The environment: this is where the agent runs — whether it is set up in Claude Code, Claude Cowork, or another product — and which files, websites, or systems it can access. The same agent on a corporate laptop inside an enterprise network will have completely different access and risks than it would on a personal phone.

Most of the conversation around AI policy today revolves around the model, which is understandable — that is where the core capabilities come from. But an agent’s behavior depends on all four layers working together. A well-trained model can still be exploited through a poorly configured harness, a tool with excessive permissions, or an exposed environment. That is why protections need to cover all of these levels.

Real autonomy brings real risks

The more autonomy an agent has, the larger the attack surface available for something to go wrong — whether through a genuine misinterpretation or an intentional malicious exploit.

One of the most discussed risks in the artificial intelligence field right now is called prompt injection, and it works in a pretty insidious way: malicious instructions are embedded inside content the agent will process during task execution. Imagine an agent that reads emails to manage your calendar — a carefully crafted email could contain hidden instructions ordering the agent to forward confidential information to an external address. The agent, without realizing it, would execute the instruction as if it were legitimate.

Anthropic acknowledges that as models become more capable, the understanding of prompt injection has also evolved considerably — both in terms of how attacks work and why no single line of defense is enough to guarantee total protection. The more open an agent’s environment is, the more entry points exist. The more tools it can use, the more an attacker can accomplish once they gain access.

That is why the company builds defenses in multiple layers: it trains the model to recognize injection patterns, monitors production traffic to block attacks in the real world, and relies on external red-teamers to test and stress its systems.

Even so, even when combined, these protections are not an absolute guarantee. That is why Anthropic recommends that customers themselves think carefully about which tools and data they make available to an agent, which permissions they grant, and in which environments they allow it to operate. Irreversible actions — like permanently deleting data or sending messages on the user’s behalf — should have a human confirmation checkpoint before being executed.

That balance between efficiency and control is one of the major design challenges the entire artificial intelligence industry faces right now.

Principles in practice: how Anthropic makes product decisions

Building agents that are both useful and trustworthy requires very careful product decisions. Anthropic’s framework establishes five principles to guide those choices. Let’s look at three of them with concrete examples: human control, alignment with user expectations, and security. The other two — transparency and privacy — run through all the rest.

Designed to keep humans in control

The central tension with agents is clear: to be useful, they need to operate with autonomy, but to be safe, humans need to maintain meaningful control over how they work.

The most direct way for a user to maintain control over Claude is by deciding what it can and cannot do. In Claude.ai and Claude Desktop, users choose which tools to activate and configure permissions — like always allow, needs approval, or block — for each action Claude takes. This allows you, for instance, to define that it is always safe for Claude to read your calendar but still require approval before sending an invite to someone.

This approach works well for simple tasks. But when a task involves dozens of actions, repeated approval requests become a source of friction, and users end up ignoring those prompts out of fatigue. To bridge that gap, Anthropic introduced a feature in Claude Code called Plan Mode. Instead of asking for approval action by action, Claude presents the user with its complete action plan upfront. The user can review, edit, and approve everything before anything happens — and can still intervene at any point during execution. This shifts the level of oversight from the individual step to the overall strategy, which tends to be the point where users most want to exercise their judgment.

There are also more complex usage patterns emerging. Increasingly, agents in products like Claude Code delegate part of the work to subagents — other Claudes working in parallel on different parts of a task. Subagents raise new questions about how users can understand and direct workflows that are no longer visible as a linear sequence of actions. Anthropic is exploring different coordination patterns to address this, and the insights gained will feed into the oversight design for this next generation of agents.

Helping agents truly understand their goals

Making sure agents pursue the right goals, in the way users actually want, is one of the hardest unsolved problems in agent development. An agent can only act in line with what the user really wants if it knows when to stop and ask for clarification — whether it is uncertain or about to make a mistake.

While working on a task, an agent frequently runs into situations its initial plan did not cover. It can resolve many of those gaps on its own — by researching information it needs, for example. But others will be questions of preference or intent that only the user can resolve. The challenge for Anthropic is helping its models recognize which is which, finding the right balance between pausing too much and not pausing enough. An agent that stops at every possible doubt gives up most of the autonomy that makes it useful. One that always pushes forward risks misinterpreting what the user actually wanted.

Anthropic tackles this problem from multiple angles during Claude’s training. First, it builds training scenarios that place the model in ambiguous situations and reinforces the choice to pause rather than assume. Second, Claude’s Constitution — which directly influences how models are trained — reinforces a similar instinct, favoring raising concerns, seeking clarification, or refusing to proceed over acting on assumptions.

Anthropic’s research on agent usage illustrates the impact of this training well. On complex tasks, users interrupt Claude only slightly more often than on simple tasks, but Claude’s own rate of checking in nearly doubles. This shows the importance of calibrating agents on the decision of when to act and when to hand the decision back to the human.

Security and transparency as non-negotiable pillars

Within the architecture of modern AI agents, security and transparency are not optional features you bolt on afterward — they are foundations that need to be present from the very beginning of the system’s design.

Transparency enters this picture in a very practical way: if an agent cannot explain what it is doing or why it is making a particular decision, it is impossible for a human to effectively oversee its behavior. That is why well-designed systems include mechanisms for detailed logging, audit trails, and in more critical cases, the ability for the agent itself to flag when it encounters an ambiguous or potentially risky situation and request human guidance before moving forward.

That behavior — stopping and asking instead of guessing — is a sign of maturity in an agent’s design and significantly reduces the risk of unintended consequences.

Tools we use daily

There is also a broader dimension here that goes beyond any specific company. Building security and transparency standards for autonomous agents is a challenge that the entire artificial intelligence industry needs to tackle collaboratively. Transparency needs to be not just a technical feature but a cultural commitment from those who build these tools.

What the broader ecosystem can do

The measures described so far represent what Anthropic can do within its own products. But the security and reliability of agents cannot be achieved by any single company alone. The question for the whole ecosystem is how to create the conditions for companies to experiment with agents and for developers to keep building safely.

There are a few fronts where the industry, standards bodies, and governments can make meaningful contributions:

Standardized benchmarks

Right now, there is no rigorous, standardized way to compare agent systems in terms of resistance to prompt injection or how reliably they flag uncertainties. Each company tests its own systems using its own methods, and no results are independently verified. Standards bodies like NIST, working alongside industry groups, are well positioned to maintain shared benchmarks and encourage a broader ecosystem of third-party evaluation.

Evidence sharing

Anthropic has already published extensively on how Claude is used as an agent and where it struggles, and hopes this practice becomes common across the field. The more developers share this kind of evidence, the more complete the picture policymakers will have of how agents are actually being used.

Open standards

Anthropic created the Model Context Protocol (MCP) as an open standard for how models communicate with external data sources and tools — and later donated the protocol to the Linux Foundation’s Agentic AI Foundation, so it belongs to the broader community. The logic behind this is that open protocols allow security properties to be designed into the infrastructure once, rather than patched individually with every deployment. Open standards also keep competition focused on agent quality and safety, rather than on who controls the integrations.

None of these measures replaces the work that model developers need to do to build safe agents, but this is the kind of infrastructure no single company can build on its own.

The near future of agents and the role of human control

The debate over how much autonomy to grant an AI agent does not have a single answer — and it probably never will. The appropriate level of independence depends on the context, the criticality of the task, the maturity of the system, and how well the human operator understands what is happening behind the scenes. An agent that manages calendar reminders can operate with far more freedom than one that has access to financial systems or critical infrastructure. That calibration between efficiency and control will be an ongoing exercise as the technology advances and use cases multiply.

What is becoming increasingly clear is that the future of AI agents is not about replacing humans but about building systems where humans and agents work together seamlessly — with the human in the role of strategic supervisor and the agent in the role of a highly capable tactical executor. For that partnership to work well, transparency about what the agent is doing needs to be constant, and human intervention mechanisms need to be simple, fast, and effective.

Agents are going to reshape the way people work. Whether that happens on a secure and open foundation depends on how the industry, civil society, and governments build that foundation together.

The evolution of AI agents is already happening, with or without consensus on best practices. What changes is the speed at which the industry can establish solid foundations of security, transparency, and accountability before systems become too complex to be easily audited. That is the real challenge of the moment — and how it gets resolved will define not just the future of artificial intelligence, but the relationship society will have with this technology for decades to come. 🤖

Picture of Rafael

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

Fill out the form and our team will contact you within 24 hours.

Related publications

AI SDR Agent on WhatsApp: How SMBs Can Cut Costs and Scale Sales

Respond 21x faster your leads and scale your sales operation with a fraction of the cost of expanding your sales

Robot Detects Unusual Browser Activity Using JavaScript and Cookies

Learn why sites require JavaScript and cookies for unusual activity and how to fix blocks with quick, simple steps

Productivity with Agentic Artificial Intelligence in execution and workflows.

Agentic AI: how to operationalize AI agents to improve workflows, metrics, and governance, turning pilots into real productivity gains.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

Rafael

Online

Atendimento

Calculadora Preço de Sites

Descubra quanto custa o site ideal para seu negócio

Páginas do Site

Quantas páginas você precisa?

4

Arraste para selecionar de 1 a 20 páginas

📄

⚡ Em apenas 2 minutos, descubra automaticamente quanto custa um site em 2026 sob medida para o seu negócio

👥 Mais de 0+ empresas já calcularam seu orçamento

Fale com um consultor

Preencha o formulário e nossa equipe entrará em contato.