Evaluation of AI Agents: Frameworks, Memory, and Protocols

AI, AI Integration and Automation, App Development

08/06/2026 15 minutos de leituraPor Rafael

Index

AI Agents in 2026: The Complete Stack For Building Agents That Actually Work Before the Stack, There Is the Loop How to Evaluate Each Layer of the Stack Layer 1: Models and Inference Layer 2: Protocols and Tools Layer 3: Memory and Knowledge — Way Beyond the Vector Database Context Engineering Replaced Prompt Engineering The Three Tiers of Memory Layer 4: Frameworks and SDKs — Choosing Without Getting Lost Layer 5: Evaluation and Observability — The Layer Nobody Put on the Map Three Dimensions of Evaluation Layer 6: Guardrails and Security What Type of Agent Are You Building?Stateless Tool Caller Multi-Step Workflow Learning Agent Multi-Agent System Code Agents: All 6 Layers in Action The Big Picture: Stop Building Like It Is 2024

AI Agents in 2026: The Complete Stack For Building Agents That Actually Work

AI Agents are everywhere, and with them came a classic problem: people are picking tools before understanding what the problem actually needs.

Picture a team spending three weeks configuring LangGraph for a support chatbot, with 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. At the end of the day, the agent answers refund questions and calls a single API. Fifty lines of code using the OpenAI SDK with two MCP servers would have solved the exact same thing. The problem was not the tool choice itself but rather not mapping out which layers the problem actually needed before starting to build.

In November 2024, Letta published an AI agent stack diagram that became the go-to reference for a large share of engineering teams. If you have seen that layered agent image floating around LinkedIn or pinned in a Slack channel, it probably came from there. The thing is, that diagram is now 14 months old, and a lot has changed since then 🗺️

MCP did not exist yet. Memory was treated as a subset of the vector database. Nobody was shipping native agent SDKs from providers. And evaluation was not even on the map. In 2026, the stack has six layers, at least three of which did not exist as distinct categories when the original diagram was drawn. The original article mapping this new landscape was published by Paolo Perrone on the Substack The AI Engineer and republished by O’Reilly with the author’s permission.

Before the Stack, There Is the Loop

Before talking about layers and tools, it is worth understanding what an AI Agent actually is. An agent is defined by the think-act-observe cycle: the model reasons about a task, takes an action like calling a tool or writing to memory, observes the result, and repeats the process until the task is done. This loop is the atomic unit. The entire stack exists to make this loop work reliably, at scale, in production.

And here is a fundamental distinction: the agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multi-step runs, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. These are fundamentally different infrastructure problems.

Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire protocol layer is new because of it. Reasoning models like o1 and o3 changed what agents can do autonomously, with single calls replacing some multi-step chains. And memory became a first-class architectural primitive, not a bolt-on hack over a vector database.

How to Evaluate Each Layer of the Stack

When choosing tools for each layer, three questions cut through the noise:

How much state management do you need? A stateless tool caller and a multi-session agent that learns over time are completely different engineering problems. The layers where state management is hardest, like memory and frameworks, are where most teams get stuck.
How much vendor lock-in can you tolerate? MCP is an open standard, provider SDKs are not. Every tool choice either increases or decreases the pain of your next migration.
How far is it from demo to production? Some layers, like model serving, have almost no gap. Others, like eval and guardrails, have a chasm. The layer where you feel this distance the most is the one that deserves investment first.

Layer 1: Models and Inference

The first layer of the stack is how you run the model powering your agent: calling an API, using a managed provider for open-weight models, or self-hosting.

The inference layer has changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking changed what agents can plan and execute. Agents that previously needed multi-step chains now solve problems in a single reasoning call. Open-weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 have closed the quality gap dramatically, so the old advice of always using the biggest closed model is no longer the default. The emerging pattern is to prototype on closed source and deploy on open weight.

In practice, this layer is commoditizing. The differences between models matter less every quarter. The real decision is the cost versus latency trade-off, not which model is the smartest. API calls are stateless: send a request, get a response, nothing to manage. Lock-in risk is high for closed APIs because each model reasons differently, and switching providers means readjusting prompts, adapting to different failure modes, and retesting your entire eval suite. For open weight, the risk is low. The gap between prototype and production is the smallest across all layers.

Consider self-hosting when your agent’s call volume makes API pricing unsustainable or when you need sub-100ms latency that API round-trips cannot deliver.

Layer 2: Protocols and Tools

This layer covers how your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

This layer did not exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97 million monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and donation to the Linux Foundation. The protocol debate is over. MCP won 🏆

Browser Use exploded in parallel, hitting 78,000 GitHub stars in less than a year. Nobody was running browser agents in production in 2024. And agents can now talk to other agents: IBM launched ACP and Google launched A2A. Neither is a standard yet, but the problem they solve, agents coordinating with other agents, is real and growing.

Security is the open problem. Endor Labs analyzed 2,614 MCP servers and found that 82% were vulnerable to path traversal and 67% to code injection. The only question left is how you secure your MCP servers before someone exploits them. OWASP published the MCP Top 10 in beta, which is the first real security checklist for tool-connected agents.

State management here is nonexistent: your agent calls a tool, gets a response, done. Lock-in risk is low because MCP is an open standard. The gap between demo and production is medium: your demo MCP server works until someone sends a malicious tool description.

Layer 3: Memory and Knowledge — Way Beyond the Vector Database

The idea that memory in AI Agents boils down to embeddings and similarity search is behind us. In 2024, memory meant picking a vector database and doing RAG. In 2026, memory is a first-class architectural primitive with three distinct tiers, and all of them feed the same place: the context window your agent sees on each call.

Context windows have gotten massive. Gemini reached over 1 million tokens, Claude hit 200,000. But bigger windows did not eliminate the need for memory. They shifted the trade-off: what do you put directly in context versus what do you retrieve on demand?

Context Engineering Replaced Prompt Engineering

Instead of writing a better prompt, the core discipline is now architecting what information the agent sees on each call. Memory blocks emerged as named, structured fields in the context window that the agent can read and overwrite on every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to discard.

The Three Tiers of Memory

Short-term memory, or in-context memory, is what lives inside the context window during a single run. It is the fastest, the most expensive in tokens, and the most volatile. When the conversation ends, it is gone. For agents that only need to complete a task within a single session, this is enough. But when the agent needs to resume a conversation days later or remember user preferences, in-context memory alone does not cut it.

On the infrastructure side, pgvector has become the default for teams that do not need a dedicated vector database. It is just Postgres with an extension. GraphRAG emerged as a second retrieval option: following relationships between entities instead of matching embeddings, with Neo4j leading that space.

Long-term memory is where things get interesting. Tools like Letta, Mem0, and Zep started treating persistent memory as a first-class citizen, with explicit read and write operations, update policies, and consolidation mechanisms. The agent does not just retrieve information — it decides what is worth storing, when to update an existing memory, and when to discard something outdated. This active management capability is what separates an agent that feels intelligent from one that just feels reactive.

Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions. The gap between prototype and production is large: demo memory works because context windows are big enough, but production memory breaks when conversations get long and the agent starts forgetting the important parts.

Layer 4: Frameworks and SDKs — Choosing Without Getting Lost

The framework ecosystem for AI Agents has exploded. Every major AI lab now has its own agent SDK. OpenAI has the Agents SDK, which evolved from Swarm. Google launched ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only option. Now you choose between three camps:

Provider SDKs: fast to get started but locked to one model
Graph-based frameworks like LangGraph: portable but require more setup
No framework: thin wrappers over provider APIs + MCP

This choice simply did not exist in 2024.

LangGraph established itself as the leader in graph-based orchestration with version 1.0 released in October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. Meanwhile, the DIY camp grew. Teams that tried LangChain in 2024 and fought with the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control, and that works until your agent needs state management or complex branching.

An important note on naming: LangChain and LangGraph are not the same thing. LangChain is the integration layer that handles model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine that manages state, control flow, and graphs. Most teams in production use both together, but LangGraph is where the agent logic lives.

Lock-in risk at this layer is the highest in the entire stack. Your orchestration code is not portable. A LangGraph agent rewritten for CrewAI is a brand new codebase. Provider SDKs are worse because you are locked to one model too. Most teams over-choose on frameworks. If your agent calls a model and a few tools, you do not need LangGraph. A provider SDK and a few tool calls will get you to production faster than any graph.

Layer 5: Evaluation and Observability — The Layer Nobody Put on the Map

If there is one thing the original diagram did not cover that is now treated as critical, it is evaluation. This layer barely existed in 2024. Now it is the gap. LangChain’s State of Agent Engineering survey found that 89% of teams with agents in production have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies 📊

Evaluating an AI Agent is fundamentally different from evaluating a standalone LLM. When you test a model, you pass a prompt, get a response, and compare it to a reference. When you evaluate an agent, you are dealing with decision sequences, calls to external tools, intermediate states, memory that evolves over time, and outcomes that sometimes only make sense in the context of the agent’s entire trajectory.

Three Dimensions of Evaluation

The approach that is converging separates at least three levels:

Fast checks on every PR: did the agent call the right tools?
Nightly regression suites: use an LLM to judge output quality
Continuous production monitoring: alerts when agent performance starts to degrade

New agent-specific benchmarks have also emerged, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for code agents.

Tools like LangSmith, Braintrust, and Arize are evolving rapidly to cover these dimensions. But the most important point is not which tool you use — it is when you start thinking about it. Teams that leave evaluation for the end invariably discover they need to rearchitect parts of the system to make it observable enough to be evaluated. Most prototypes have zero eval. You do not feel the pain until production users find the failures for you.

Layer 6: Guardrails and Security

Agent guardrails have become a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.

The pre-action guardrails pattern emerged from teams that learned the hard way. They now apply authorization at the tool execution layer, not at the output layer. Because by the time you filter the response, the agent has already sent the email.

This is the least mature layer in the entire stack. There is no dominant framework and no established standards. You are writing policy code from scratch. NeMo Guardrails is the closest thing to a framework, but you will still write most of the rules manually. The gap between demo and production is effectively infinite: your demo has no guardrails because nobody is trying to break it. Production will.

And if you are running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an open problem. You will need custom authorization logic.

What Type of Agent Are You Building?

This is the decision that cuts through all the framework confusion. The type of agent determines which layers you invest in and which tools to pick for each one.

Tools we use daily

Translation

Text Inspection & Clipping

Productivity & Organization

Stateless Tool Caller

Answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.

Multi-Step Workflow

Processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.

Learning Agent

Remembers your preferences across sessions, gets better on your codebase over time, or tracks project context for weeks. You need a memory-first architecture, a vector database, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what to discard, and how to prevent old context from polluting new responses.

Multi-Agent System

Agents that delegate to other agents, split a research task among specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals at every handoff. Build eval infrastructure before you build the second agent.

Code Agents: All 6 Layers in Action

Code agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI Agent stack. All six layers working together.

At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocol layer, MCP servers connect to editors, terminals, file systems, and Git, which is how the agent reads your code and runs commands. At the memory layer, codebase-aware retrieval and reranking ensure the agent does not read your entire repo — just the files relevant to that specific edit.

At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Control flow built from scratch for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance rate model every 90 minutes based on whether users accept or reject suggestions. That is eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write and run code, but inside a container that limits what it can touch.

The Big Picture: Stop Building Like It Is 2024

Most teams are building like it is still 2024. They pick LangGraph before knowing if they need state. They add a vector database before outgrowing Postgres. They design multi-agent architectures before shipping a single agent that works.

A chatbot that calls tools and a multi-agent research system share almost no infrastructure. Treating both the same means overbuilding the first and underbuilding the second.

The teams that have moved past this point run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed intentionally, not inherited from the framework default. Most teams do the opposite: zero evals, output-only filtering, and a system prompt that grows until the context window chokes.

The gap is not talent or budget. It is knowing which layers matter for your specific agent instead of half-building all six.

The AI Agent stack in 2026 is more mature, more fragmented, and more demanding than it was 14 months ago. Each layer has its own design decisions, its own competing tools, and its own trade-offs that only become visible when you are building something real. And a clear trend is on the horizon: provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams will probably not build each layer separately. They will get an increasingly opinionated stack from their model provider, and that will work for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even in those cases, when something fails in production, you need to know which layer failed. Mapping these layers before picking any tool is not engineering bureaucracy — it is what separates systems that work in production from systems that only work in the demo.

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

CONTACT
US

Fill out the form and our team will contact you within 24 hours.

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

Evaluation of AI Agents: Frameworks, Memory, and Protocols

Index

AI Agents in 2026: The Complete Stack For Building Agents That Actually Work

Before the Stack, There Is the Loop

How to Evaluate Each Layer of the Stack

Layer 1: Models and Inference

Layer 2: Protocols and Tools

Receive the best innovation content in your email.

Layer 3: Memory and Knowledge — Way Beyond the Vector Database

Context Engineering Replaced Prompt Engineering

The Three Tiers of Memory

Layer 4: Frameworks and SDKs — Choosing Without Getting Lost

Layer 5: Evaluation and Observability — The Layer Nobody Put on the Map

Three Dimensions of Evaluation

Layer 6: Guardrails and Security

What Type of Agent Are You Building?

Tools we use daily

Stateless Tool Caller

Multi-Step Workflow

Learning Agent

Multi-Agent System

Code Agents: All 6 Layers in Action

The Big Picture: Stop Building Like It Is 2024

Rafael

CONTACTUS

Related publications

AI SDR Agent on WhatsApp: How SMBs Can Cut Costs and Scale Sales

Robot Detects Unusual Browser Activity Using JavaScript and Cookies

Productivity with Agentic Artificial Intelligence in execution and workflows.

Receive the best innovation content in your email.

Rafael

Website Pricing Calculator

Website Pages

Calculator Result

Fale com um consultor

CONTACT
US