SHARE:

If you have ever worked with AI agents, you know that feeling of relief after a successful demo can be pretty short-lived.

The agent impressed everyone during testing, passed every simulated scenario, and seemed ready for the real world.

Then production hit, and the story changed completely. 😅

Real users ran into wrong tool calls, inconsistent responses, and failures nobody had imagined during development.

That gap between expected behavior and the actual user experience is one of the biggest challenges for anyone working with language models today.

And the problem has a very specific root cause: LLMs are non-deterministic.

That means the same user query can trigger different tool selections, distinct reasoning paths, and completely variable outputs depending on the run. A single passing test shows what can happen, not what typically happens. Without systematically measuring those variations, teams get stuck in cycles of manual testing and reactive debugging, burning through API costs without really knowing whether changes are actually improving the agent.

This is exactly the problem AWS built Amazon Bedrock AgentCore Evaluations to solve — a fully managed service focused on agent performance evaluation across the entire lifecycle, from development through production.

In this article, you will learn how this service works in practice, which approaches it uses to measure agent quality, and how it can transform the way AI teams make decisions based on real data instead of guesswork. 🚀

Why AI agent evaluation demands a completely new approach

Before talking about the solution, it is worth understanding the problem a bit better. When a user sends a request to an agent, multiple decisions happen in sequence. The agent determines which tools to call, executes those calls, and generates a response based on the results. Each step introduces potential failure points: selecting the wrong tool, calling the right tool with incorrect parameters, or synthesizing tool outputs into an inaccurate final response.

Unlike traditional applications where you test the output of a single function, agent evaluation requires measuring quality across that entire interaction flow. This creates specific challenges for agent developers that can be addressed as follows:

  • Defining evaluation criteria for what constitutes a correct tool selection, valid parameters, an accurate response, and a useful experience for the user.
  • Building test datasets that represent real user requests and expected behaviors.
  • Choosing scoring methods that can assess quality consistently across repeated runs.

Each of those definitions directly determines what the evaluation system measures, and getting them wrong means optimizing for the wrong outcomes. Without this foundational work, the gap between what teams expect their agents to do and what they can prove they do becomes a real business risk.

Closing that gap requires a continuous evaluation cycle. Teams build test cases, run them against the agent, score the results, analyze failures, and implement improvements. Each failure becomes a new test case, and the cycle continues with every agent iteration. However, running this cycle end to end requires significant infrastructure beyond the evaluation logic itself. Teams need to curate datasets, select and host scoring models, manage inference capacity and API limits, build data pipelines that transform agent traces into evaluation-ready formats, and create dashboards to visualize trends.

For organizations running multiple agents, that overhead multiplies with each one. The result is that teams end up spending more time maintaining evaluation tooling than acting on what it tells them. This is exactly the problem Amazon Bedrock AgentCore Evaluations was built to solve.

What is Amazon Bedrock AgentCore Evaluations

Initially launched in public preview at AWS re:Invent 2025, the service is now generally available. It manages the evaluation models, inference infrastructure, data pipelines, and scalability so teams can focus on improving agent quality instead of building and maintaining evaluation systems. For built-in evaluators, model quota and inference capacity are fully managed. This means organizations evaluating many agents are not consuming their own quotas or provisioning separate infrastructure for evaluation workloads.

AgentCore Evaluations examines agent behavior end to end using OpenTelemetry (OTEL) traces with generative AI semantic conventions. OTEL is an open source observability standard for collecting distributed traces from applications. The generative AI semantic conventions extend that standard with fields specific to language model interactions, including prompts, completions, tool calls, and model parameters. Because it is built on this standard, the service works consistently with agents built on any framework — such as Strands Agents or LangGraph — instrumented with OpenTelemetry and OpenInference.

Evaluations can be configured with different approaches:

  • LLM-as-a-Judge — where an LLM evaluates each agent interaction against structured rubrics with clearly defined criteria.
  • Ground Truth-based evaluation — which compares agent responses against pre-defined or simulated datasets.
  • Custom code evaluators — where you can bring a Lambda function as an evaluator with your own code.

In the LLM-as-a-Judge approach, the judge model examines the full context of the interaction — including conversation history, available tools, tools used, parameters passed, and system instructions — and provides detailed reasoning before assigning a score. Every score comes with an explanation. Teams can use these scores to verify judgments, understand exactly why an interaction received a specific rating, and identify what should have happened differently.

Three principles guide the service’s approach to evaluation:

  • Evidence-driven development — replaces intuition with quantitative metrics so teams can measure the real impact of changes.
  • Multidimensional evaluation — assesses different aspects of agent behavior independently, making it possible to pinpoint exactly where improvements are needed.
  • Continuous measurement — connects the performance baselines established during development directly to production monitoring.

Evaluation across the agent lifecycle

An agent’s journey from prototype to production creates two distinct evaluation needs. During development, teams need controlled environments where they can compare alternatives, test the agent against curated datasets, reproduce results, and validate changes before they reach users. Once the agent is live, the challenge shifts to monitoring real-world interactions at scale, where users encounter edge cases and interaction patterns that no amount of pre-deploy testing could have anticipated.

AgentCore Evaluations maps two complementary approaches to these lifecycle phases:

Online evaluation for production monitoring

Online evaluation monitors live agent interactions by continuously sampling a configurable percentage of traces and scoring them against your chosen evaluators. You define which evaluators to apply, configure sampling rules that control what fraction of production traffic gets evaluated, and set appropriate filters. The service handles reading the traces, running the evaluations, and surfacing results in the AgentCore Observability dashboard, powered by Amazon CloudWatch.

If you are already collecting traces for observability, online evaluation adds explained quality scores alongside your existing operational metrics — without requiring code changes or redeployments.

Production quality issues often show up in ways that traditional monitoring does not catch. Operational dashboards might show everything green on latency and error rates while the user experience silently degrades because the agent starts selecting wrong tools or providing less helpful responses. Continuous quality scoring catches these silent failures by tracking evaluation metrics alongside operational metrics. Since AgentCore observability runs on CloudWatch, you can build custom dashboards and set up alarms to get alerted the moment scores drop below your thresholds.

On-demand evaluation for development

On-demand evaluation is a real-time API designed for development and CI/CD workflows. Teams use it to test changes before deploy, run evaluation suites as part of CI/CD pipelines, perform regression testing across builds, and gate deployments based on quality thresholds.

Developers select a complete session and specify exact spans or traces by providing their IDs. The service considers the full session conversation and scores individual spans and traces against the same evaluators used in production. Common use cases include validating prompt changes, comparing model performance across alternatives, and preventing quality regressions.

Because both modes use the same evaluators, what you test in CI/CD is what you monitor in production — ensuring consistent quality standards throughout the entire development cycle. Together, the two modes form a continuous feedback loop between development and production.

How AgentCore evaluates your agent

AgentCore Evaluations organizes agent interactions into a three-level hierarchy that determines what can be evaluated and at what granularity:

  • Session — represents a complete conversation between a user and the agent, grouping all related interactions from a single user or workflow.
  • Trace — captures everything that happens during a single exchange. When a user sends a message and receives a response, that round trip produces a trace containing every step the agent took to generate its answer.
  • Span — represents specific individual actions the agent performed, such as invoking a tool, retrieving information from a knowledge base, or generating text.

The service offers 13 pre-configured built-in evaluators organized across these three levels, each measuring a distinct aspect of agent behavior:

Session level

Goal Success Rate — evaluates whether all user objectives were completed within a conversation.

Trace level

Includes evaluators like Helpfulness, Correctness, Coherence, Conciseness, Faithfulness, Harmfulness, Instruction Following, Response Relevance, Context Relevance, Refusal, and Stereotyping. These evaluators measure response quality, accuracy, safety, and communication effectiveness.

Tool level

Tool Selection Accuracy and Tool Parameter Accuracy — evaluate tool selection decisions and parameter extraction precision.

Understanding the distinctions between evaluators

Some evaluators naturally interact with each other, so scores should be read together rather than in isolation. Evaluators that sound similar often measure fundamentally different things:

  • Correctness checks whether the response is factually accurate, while Faithfulness checks whether it is consistent with the conversation history. An agent can be faithful to incorrect source material and still be wrong.
  • Helpfulness asks whether the response moves the user toward their goal, while Response Relevance asks whether it answers what was originally asked. An agent can thoroughly answer the wrong question.
  • Coherence checks for internal contradictions in reasoning, while Context Relevance checks whether the agent had the right information available. One reveals a generation problem, the other a retrieval problem.

Some dependencies matter too: Tool Parameter Accuracy only makes sense when the agent selected the correct tool, so low Tool Selection Accuracy should be addressed first. Correctness often depends on Context Relevance because an agent cannot generate accurate responses without the right information. And Conciseness and Helpfulness often conflict because brief responses may omit context that users need.

Custom and code-based evaluators

Beyond built-in evaluators, the service supports custom evaluators using LLM-as-a-Judge with your own models, instructions, criteria, and scoring schemas. These are particularly valuable for industry-specific evaluations like compliance verification in healthcare or financial services, brand voice consistency, or adherence to organizational quality standards.

Code-based evaluators let you bring an AWS Lambda function to perform the evaluations. This type of evaluator is ideal when you have heuristic scoring methods that do not require language comprehension to verify. An LLM evaluator can judge whether a response sounds correct, but it cannot reliably confirm that a specific paycheck value appears literally in the response, or that a request ID follows a specific format. For these deterministic checks, custom code is faster, cheaper, and more reliable.

Four situations where code-based evaluators are especially useful:

  • Exact data validation — when the agent must return specific values from a data source, such as balances, transaction IDs, or prices.
  • Format compliance — when responses must follow structural constraints like size limits, mandatory phrases, or output schemas.
  • Business rule enforcement — policies that require precise interpretation, such as correct application of discount rules or citation of the right regulatory clause.
  • High-volume production monitoring — Lambda invocations cost a fraction of LLM inference, making code-based evaluators the right choice when every production session needs to be continuously scored at scale.

Evaluating agents with Ground Truth

LLM-as-a-Judge scoring tells you whether responses seem correct and helpful by the standards of a general-purpose language model. Ground Truth evaluation goes further by letting you specify the expected answer, the tools that should have been called, and the outcomes the session should have achieved.

AgentCore Evaluations supports three types of Ground Truth reference inputs, each consumed by a specific set of evaluators:

  • expected_response — used by the Builtin.Correctness evaluator to measure similarity between the agent response and the known correct answer.
  • expected_trajectory — used by trajectory matching evaluators to verify whether the agent called the right tools in the correct sequence.
  • assertions — used by Builtin.GoalSuccessRate to verify whether the session satisfied a set of natural language statements about expected outcomes.

These inputs are optional and independent. Evaluators that do not require Ground Truth, such as Builtin.Helpfulness and Builtin.ResponseRelevance, can be included in the same call as Ground Truth evaluators, and each evaluator reads only the fields it needs.

Best practices for agent evaluation

Success criteria for your agent typically combine three dimensions: response quality, the latency at which users receive them, and inference cost. AgentCore Evaluations focuses on the quality dimension, while operational metrics like latency and cost are available through AgentCore Observability in CloudWatch.

Evidence-driven development

  • Establish a baseline of your agent’s performance with both synthetic and real-world data. Measure before and after every change so improvements are based on evidence, not intuition.
  • Run A/B tests with statistical rigor for every change. Whether updating a system prompt, swapping a model, or adding a tool, compare performance using the same set of evaluators before and after deployment.
  • Run repeated tests — at least 10 per question — organized by category to measure reliability. Variation across repeated runs reveals where your agent is consistent and where it needs work.

Multidimensional evaluation

  • Define what success looks like early on, using multidimensional criteria that reflect the agent’s real purpose.
  • Evaluate each step in the agent workflow, not just final outcomes. Measuring tool selection, parameter accuracy, and response quality independently gives you the diagnostic precision to fix problems where they actually occur.
  • Involve subject matter experts in designing your metrics and defining task coverage. Expert input keeps your evaluators grounded in real-world expectations and catches blind spots that automated scoring alone might miss.
  • Start with built-in evaluators to establish baseline measurements, then build custom evaluators as your needs mature.

Continuous measurement

  • Detect drift by comparing production behavior against your testing baselines. Set up CloudWatch alarms on key metrics to catch regressions before they reach a broad user base.
  • Remember that your test dataset evolves alongside your agent, your users, and the adversarial scenarios you encounter. Update it regularly as edge cases emerge in production and requirements change.

Diagnosing common evaluation patterns

Understanding how to interpret evaluation results is just as important as setting up the service. Here are some specific patterns you might encounter as you scale your application:

  • Low scores across all evaluators — the problem is typically fundamental. Start by reviewing Context Relevance scores to determine whether the agent has access to the information it needs. Check the system prompt for clarity and completeness.
  • Inconsistent scores for similar interactions — this usually points to evaluation configuration issues rather than agent problems. If you are using custom evaluators, check whether your instructions are specific enough and whether each scoring level has clear, distinguishable definitions.
  • High Tool Selection Accuracy but low Goal Success Rate — the agent selects appropriate tools but fails to complete user objectives. This pattern suggests you may need to add tools to handle certain requests, or that the agent struggles with tasks requiring multiple sequential tool calls.
  • Slow evaluations or failures due to throttling — reduce your sampling rate to evaluate a smaller percentage of sessions. Cut back on the number of evaluators. For custom evaluators, request quota increases or switch to a model with higher default quotas.

The real impact for teams working with AI

For teams scaling the use of AI agents within their organizations, the biggest benefit of Amazon Bedrock AgentCore Evaluations is not just technological — it is cultural. Having reliable evaluation infrastructure changes how the team makes decisions. Instead of relying on the intuition of whoever built the agent or on sparse user feedback, decisions become guided by objective metrics that everyone can follow and discuss. This reduces internal conflicts, speeds up the iteration cycle, and increases the team’s confidence in the changes they are making.

From a business perspective, continuous performance evaluation also helps justify investments in more advanced language models or improvements to the agent architecture. When you can show with data that a specific change increased response accuracy or reduced the number of incorrect tool calls, it becomes much easier to get approval to keep evolving the product. This is especially relevant in enterprise environments where AI adoption still faces skepticism and where concrete results make all the difference.

With on-demand evaluation anchoring the development workflow and online evaluation providing continuous visibility in production, quality becomes a measurable and improvable property across the entire agent lifecycle. The relationships between evaluators and the diagnostic patterns offer a framework not just for scoring agents, but for understanding where and why quality issues occur and where to focus improvement efforts.

At the end of the day, it is the user experience that determines the success or failure of any AI agent in production. There is no point in having the most sophisticated model on the market if it cannot answer your users’ real questions consistently and reliably. AgentCore Evaluations exists precisely to ensure that quality standard is measurable, trackable, and continuously improved — transforming agent development from an art based on intuition into a discipline based on evidence. 💡

Picture of Rafael

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

Fill out the form and our team will contact you within 24 hours.

Related publications

AI SDR Agent on WhatsApp: How SMBs Can Cut Costs and Scale Sales

Respond 21x faster your leads and scale your sales operation with a fraction of the cost of expanding your sales

Robot Detects Unusual Browser Activity Using JavaScript and Cookies

Learn why sites require JavaScript and cookies for unusual activity and how to fix blocks with quick, simple steps

Productivity with Agentic Artificial Intelligence in execution and workflows.

Agentic AI: how to operationalize AI agents to improve workflows, metrics, and governance, turning pilots into real productivity gains.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

Rafael

Online

Atendimento

Calculadora Preço de Sites

Descubra quanto custa o site ideal para seu negócio

Páginas do Site

Quantas páginas você precisa?

4

Arraste para selecionar de 1 a 20 páginas

📄

⚡ Em apenas 2 minutos, descubra automaticamente quanto custa um site em 2026 sob medida para o seu negócio

👥 Mais de 0+ empresas já calcularam seu orçamento

Fale com um consultor

Preencha o formulário e nossa equipe entrará em contato.